Age of DataFrames 2: Polars Edition
May 23, 2024
•
Wouter Gins
In this publication, I showcase some Polars tricks and features.
As one of the newer kids on the block, Polars offers an exciting alternative to PySpark for small to medium sized datasets, as already evidenced in other blogposts. Here, we’ll dangle our toes in the water with a more focused topic to showcase how powerful and expressive polars can be: e-sports prediction. We’ll have a look at a tournament of the video game Age of Empires II: Definitive Edition (AoE), and try to make a prediction on how the first round will go. Aside from a similar age (Polars had its first commit in June 2020, AoE was released in November 2019), the AoE community has spawned many community projects to maintain vast databases, making it ideal for a small project like this. I hope that by the end of this blogpost, you will share my joy in the features of Polars that just work 😄
In order to make a prediction, I’ll first have a look at some of the data that has been gathered by the community, then make a simple model to calculate a win chance, and then end by applying this model to the tournament.
If you want to play around with these concepts, here is a link to the GitHub repo you can use to download the data and get started.
📖 Data source
For our data, we scraped from aoestats.io, which maintains a database of online games played. I’m currently only interested in matches that are officially ranked, which corresponds to raw_match_type being between 6 and 9 (inclusive). I will add this as a filter to the reading.
I’ve already pre-processed the data, having done a bit of cleaning. After filtering on the match type, the data looks like this:
A neat trick the Polars offers is the .shrink_dtype()
expression, which shrinks numeric types to the smallest version that supports the values currently in the dataframe. Quite handy to minimize memory usage!
Why filter: Eagle-eyed readers will also know of the existence of the semi-join, which I could also have used in order to filter the data by making a small dataframe with a raw_match_type column. In testing, I found a minute difference in performance between the two (0.4s versus 0.6s). My takeaway here is that the semi-join probably has some overhead. For filtering on larger amounts of values or on values that are not known beforehand, I would recommend the semi-join.
Now, aside from the matches, I also need the data of the players involved in these games. This is an ideal case for the semi-join!
As the dataset contains almost 12 million played games and 42 million records of players in those games, the dataset is large enough to play around with and just small enough to fit in memory on my modest laptop (once I close Chrome 😉).
🔮 Win chance prediction: making a lookup table
Of particular interest in this dataset is the rating or elo columns, which is a number that represents how strong a player is. This concept originated in chess and is widely used among different e-sports in order to rank players. As a naive estimator for a player’s chance to win a match, we can use the rating difference. First, let’s transform the data so it can be fed it into a classifier, and then make a lookup table for the rating difference:
For the classifier, I simply took the Gaussian Naive Bayes classifier available in scikit-learn. This is not necessarily the most suitable choice, and I will also not split the data into a training and testing set. For building good models, these steps are absolutely essential, but I want to focus on how to use Polars, not on machine learning 😉 Just for fun, I’ve also scored the model to see how it performs:
Our classifier scores 55%, so slightly better than a straight-up coin toss! The classification selects who has the higher win percentage, somewhat boring. However, the model can also give the win percentage, not so boring! Let’s evaluate this in steps of 5 ELO points over a decent range, so we can see the win chance changing smoothly. Since we need to interface with the classifier, the Polars version of a User Defined Function is required, which can be either map_elements(...)
or map_batches(...)
. The exact difference is quite nuanced, and it has a large entry in the documentation. In this situation, map_batches
can be used to evaluate all the data at once, rather than evaluating record per record.
Latest
From Good AI to Good Data Engineering. Or how Responsible AI interplays with High Data Quality
Responsible AI depends on high-quality data engineering to ensure ethical, fair, and transparent AI systems.
A glimpse into the life of a data leader
Data leaders face pressure to balance AI hype with data landscape organization. Here’s how they stay focused, pragmatic, and strategic.
Data Stability with Python: How to Catch Even the Smallest Changes
As a data engineer, it is nearly always the safest option to run data pipelines every X minutes. This allows you to sleep well at night…