Polars is a fast and efficient DataFrame library designed for handling large datasets in Python. While Pandas is the go-to for many, Polars is gaining traction due to its performance advantages, especially with larger datasets. If we're transitioning from Pandas or exploring Polars for our data manipulation tasks, understanding how to drop rows from a DataFrame is essential. In this article, we'll explore different methods to drop rows in a Polars DataFrame.
Installing Polars
First, make sure that Polars is installed. If not, we can install it using pip:
pip install polarsCreating a Polars DataFrame
Let’s start by creating a simple DataFrame in Polars:
import polars as pl
# Creating a sample DataFrame
df = pl.DataFrame({
"Name": ["Amit", "Raj", "Sita", "Pooja"],
"Age": [25, 30, 35, 40],
"City": ["Mumbai", "Delhi", "Kolkata", "Chennai"]
})
print("Original DataFrame with Indian Hindi Names:")
print(df)
Output:
shape: (4, 3)
┌───────┬─────┬─────────┐
│ Name ┆ Age ┆ City │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═══════╪═════╪═════════╡
│ Amit ┆ 25 ┆ Mumbai │
│ Raj ┆ 30 ┆ Delhi │
│ Sita ┆ 35 ┆ Kolkata │
│ Pooja ┆ 40 ┆ Chennai │
└───────┴─────┴─────────┘
1. Dropping Rows Based on Condition
To drop rows, we can use the filter method in Polars. This method allows us to keep only the rows that meet a certain condition. For example, if we want to drop rows where the age is greater than 30:
# ...
# Dropping rows where Age > 30
df_filtered = df.filter(pl.col("Age") <= 30)
print("\nDataFrame after Dropping Rows where Age > 30:")
print(df_filtered)
Output:
DataFrame after Dropping Rows where Age > 30:
shape: (2, 3)
┌──────┬─────┬────────────┐
│ Name ┆ Age ┆ City │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞══════╪═════╪════════════╡
│ Amit ┆ 25 ┆ Mumbai │
│ Raj ┆ 30 ┆ Delhi │
└──────┴─────┴────────────┘
2. Dropping Rows by Index
If we want to drop specific rows by their index, Polars doesn't have a direct method to drop by index, but we can work around it by creating a mask:
# Dropping the row at index 2 (Sita)
indexes_to_drop = [2] # This is the index list to drop
df_dropped = df.filter(~pl.Series(range(len(df))).is_in(indexes_to_drop))
print("\nDataFrame after Dropping Row at Index 2:")
print(df_dropped)
Output:
DataFrame after Dropping Row at Index 2:
shape: (3, 3)
┌───────┬─────┬─────────┐
│ Name ┆ Age ┆ City │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═══════╪═════╪═════════╡
│ Amit ┆ 25 ┆ Mumbai │
│ Raj ┆ 30 ┆ Delhi │
│ Pooja ┆ 40 ┆ Chennai │
└───────┴─────┴─────────┘
3. Dropping Rows with Missing Values
Sometimes, rows may have missing values, and we may want to drop those rows:
# ...
# Adding a row with missing values
df_with_missing = df.extend(pl.DataFrame({"Name": [None], "Age": [None], "City": [None]}))
print("\nDataFrame with Missing Values:")
print(df_with_missing)
# Dropping rows with missing values
df_no_missing = df_with_missing.drop_nulls()
print("\nDataFrame after Dropping Rows with Missing Values:")
print(df_no_missing)
Output:
DataFrame with Missing Values:
shape: (5, 3)
┌───────┬──────┬─────────┐
│ Name ┆ Age ┆ City │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═══════╪══════╪═════════╡
│ Amit ┆ 25 ┆ Mumbai │
│ Raj ┆ 30 ┆ Delhi │
│ Sita ┆ 35 ┆ Kolkata │
│ Pooja ┆ 40 ┆ Chennai │
│ null ┆ null ┆ null │
└───────┴──────┴─────────┘
DataFrame after Dropping Rows with Missing Values:
shape: (4, 3)
┌───────┬─────┬─────────┐
│ Name ┆ Age ┆ City │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═══════╪═════╪═════════╡
│ Amit ┆ 25 ┆ Mumbai │
│ Raj ┆ 30 ┆ Delhi │
│ Sita ┆ 35 ┆ Kolkata │
│ Pooja ┆ 40 ┆ Chennai │
└───────┴─────┴─────────┘
4. Dropping Duplicate Rows
To drop duplicate rows in Polars, we can use the unique method. This method returns a DataFrame with duplicate rows removed based on the columns we specify:
# Create a DataFrame with duplicate rows
df_with_duplicates = pl.DataFrame({
"Name": ["Alice", "Bob", "Charlie", "Bob", "Alice"],
"Age": [24, 19, 34, 19, 24],
"City": ["New York", "Los Angeles", "Chicago", "Los Angeles", "New York"]
})
# Drop duplicate rows
df_no_duplicates = df_with_duplicates.unique()
print(df_no_duplicates)
Output:
shape: (3, 3)
┌─────────┬─────┬─────────────┐
│ Name │ Age │ City │
│ --- │ --- │ --- │
│ str │ i64 │ str │
├─────────┼─────┼─────────────┤
│ Alice │ 24 │ New York │
│ Bob │ 19 │ Los Angeles │
│ Charlie │ 34 │ Chicago │
└─────────┴─────┴─────────────┘
Conclusion
Polars is a powerful tool for data manipulation in Python, offering a range of options for dropping rows based on various conditions. Whether we're dropping rows by condition, index, or due to null values or duplicates, Polars provides a straightforward and efficient way to perform these operations.
With its focus on performance and parallelism, Polars is an excellent choice for working with large datasets. As we explore Polars further, we'll discover even more advanced features that can help streamline our data manipulation tasks.
By understanding how to drop rows in Polars, we're well on our way to mastering this powerful DataFrame library and making our data processing tasks more efficient.