In this tutorial, we'll learn how to create a decile column using Python's Polars library. Deciles are a common way to divide data into ten equal parts, each containing 10% of the values. They are often used in statistics to understand data distribution, making them a powerful tool in data analysis.
Installing Polars
Polars is a high-performance DataFrame library that's well-suited for handling large datasets. We can install polars using pip.
pip install polarsLoading and Understanding the Data
Let's start by loading a dataset. For this example, we'll create a DataFrame with some random numerical values:
import polars as pl
import numpy as np
# Creating a DataFrame with random data
data = pl.DataFrame({
'id': np.arange(1, 101),
'value': np.random.randint(100, 1000, 100)
})
print(data)

This will generate a dataset with 100 rows and two columns: id and value. The value column contains random integers between 100 and 1000
Calculating Deciles
The decile for each row will be based on the value column. We'll use Polars' qcut function to divide the data into deciles.
Here’s how to create a decile column:
import polars as pl
import numpy as np
# Creating a DataFrame with random data
data = pl.DataFrame({
'id': np.arange(1, 101),
'value': np.random.randint(100, 1000, 100)
})
print(data)
# Define the number of deciles
decile_bins = 10
# Calculate deciles and create a new 'decile' column
data = data.with_columns(
# Use pl.col('value') to access the column and then apply qcut
pl.col('value').qcut(decile_bins).alias('decile')
)
print(data)

Explanation:
- pl.qcut('value', decile_bins) divides the value column into 10 quantiles (deciles).
- The result is a new column called decile, where each row is assigned a decile rank from 0 to 9 (i.e., the 1st to 10th decile).
Sorting and Grouping by Deciles
We might also want to sort the data or group it by deciles to get an overview:
1. Sorting by Decile:
# ...
sorted_data = data.sort('decile')
print(sorted_data)
Output:

2. Grouping by Decile and Calculating Summary Statistics:
# Group by decile and calculate summary statistics
summary_stats = data.groupby('decile').agg(
[
pl.col('value').mean().alias('mean_value'),
pl.col('value').min().alias('min_value'),
pl.col('value').max().alias('max_value'),
pl.count().alias('count')
]
)
print(summary_stats)
Output:

This will give us a summary of each decile, showing the mean, minimum, and maximum values for the value column, along with the number of rows in each decile.
Conclusion
Creating decile columns in Python using Polars is straightforward and efficient. With the qcut function, we can quickly assign deciles to our data and use them for analysis, sorting, or grouping.