Handling Categorical Data in Python

Categorical data refers to features that contain a fixed set of possible values or categories that data points can belong to. Handling categorical data correctly is important because improper handling can lead to inaccurate analysis and poor model performance. In this article, we will see how to handle categorical data and its related concepts.

Why Do We Need to Handle Categorical Data?

Handling categorical data is important because:

Algorithms Require Numerical Inputs: Most machine learning algorithms cannot directly process categorical data and need it to be converted into numerical formats.
Inconsistent Categories: Categorical data contains inconsistencies like typos, case sensitivity or alternate spellings. We must standardize these to avoid treating them as separate categories.
Remapping Categories: Some categories might need to be grouped for simplicity and relevance. For example, remapping rare categories into an "Other" group.
Improves Model Performance: Proper encoding techniques like one-hot encoding or label encoding help models to understand the relationships of categories leading to better predictions.
Handles Real-World Complexity: It is used in many domains such as E-commerce, Finance, Healthcare, etc making it robust to handle important features.

Implementation for Handling Categorical Data

Here we will be using a Demographics dataset which has some incorrect, invalid or meaningless data (bogus values) due to human error while filling survey form or any other reason. You can download dataset from here.

Step 1: Importing necessary Libraries

We will be using Numpy, Pandas, Matplotlib, Seaborn and Scikit-learn libraries for its implementation.

Python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder

Step 2: Loading the Dataset

We load the dataset into a Pandas DataFrame for manipulation.

Python

file_path = '/content/demographics.csv'
main_data = pd.read_csv(file_path)
print(main_data.head())

Output:

Handling Categorical Data in Python — First five rows of the dataset

Step 3: Identifying and Removing Bogus Blood Types

First we create a DataFrame containing all valid blood types to check for bogus values in the dataset:

Python

valid_blood_type_list = ['A+', 'A-', 'B+', 'B-', 'AB+', 'AB-', 'O+', 'O-']
blood_type_categories = pd.DataFrame({'blood_type': valid_blood_type_list})
print(blood_type_categories)

Output:

Lets find bogus blood types by comparing the dataset values to this valid list:

Python

unique_blood_types_main = set(main_data['blood_type'])
valid_blood_types_set = set(blood_type_categories['blood_type'])  
bogus_blood_types = unique_blood_types_main.difference(valid_blood_types_set)
bogus_blood_types

Output:

{'C+', 'D-'}

Once the bogus values are found the corresponding rows can be dropped from the dataset.

Python

bogus_records_index = main_data['blood_type'].isin(bogus_blood_types)

without_bogus_records = main_data[~bogus_records_index].copy()
without_bogus_records['blood_type'].unique()

Output:

array(['A+', 'B+', 'A-', 'AB-', 'AB+', 'B-', 'O-', 'O+'], dtype=object)

Step 4: Handling Inconsistent Marriage Status Categories

Checking the unique values in the marriage_status column:

Python

main_data['marriage_status'].unique()

Output:

array(['married', 'MARRIED', ' married', 'unmarried ', 'divorced', 'unmarried', 'UNMARRIED', 'separated'], dtype=object)

Standardizing the categories by converting all text to lowercase.

Python

inconsistent_data = main_data.copy()
inconsistent_data['marriage_status'] = inconsistent_data['marriage_status'].str.lower()
inconsistent_data['marriage_status'].unique()

Output:

array(['married', ' married', 'unmarried ', 'divorced', 'unmarried', 'separated'], dtype=object)

Now we will standardize the categories by stripping extra spaces:

Python

inconsistent_data['marriage_status'] = inconsistent_data['marriage_status'].str.strip()

inconsistent_data['marriage_status'].unique()

Output:

array(['married', 'unmarried', 'divorced', 'separated'], dtype=object)

Step 5: Grouping Income into Meaningful Bins

Numerical data like age or income can be mapped to different groups. Let us check income range to define bin intervals:

Python

print(f"Max income - {main_data['income'].max()}, Min income - {main_data['income'].min()}")

Output:

Max income - 190000, Min income - 40000

Now, let us create the range and labels for the income feature. Pandas cut method is used here.

Python

income_bins = [40000, 75000, 100000, 125000, 150000, np.inf]
income_labels = ['40k-75k', '75k-100k', '100k-125k', '125k-150k', '150k+']

remapping_data = main_data.copy()
remapping_data['income_groups'] = pd.cut(
    remapping_data['income'],
    bins=income_bins,
    labels=income_labels
)

remapping_data.head()

Output:

Step 6: Visualizing Income Group Distribution

Now lets visualize the distribution of income groups:

Python

remapping_data['income_groups'].value_counts().sort_index().plot.bar()
plt.title('Income Group Distribution')
plt.xlabel('Income Groups')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

Output:

Step 7: Cleaning Phone Number Data

Simulating phone numbers with inconsistent formats and cleaning them:

Python

import random
phone_numbers = []

for i in range(100):
    number = random.randint(100000000, 9999999999)  # length can be 9 or 10 digits
    if i % 2 == 0:
        phone_numbers.append('+91 ' + str(number))  # add +91 prefix for some
    else:
        phone_numbers.append(str(number))

phone_numbers_data = pd.DataFrame({
    'phone_numbers': phone_numbers
})

phone_numbers_data.head()

Output:

Based on the use case the country code before numbers could be dropped or added for missing ones. Similarly phone numbers with less than 10 numbers should be discarded.

Python

phone_numbers_data['phone_numbers'] = phone_numbers_data['phone_numbers'].str.replace(r'\+91 ', '', regex=True)

num_digits = phone_numbers_data['phone_numbers'].str.len()

invalid_numbers_index = phone_numbers_data[num_digits < 10].index
phone_numbers_data.drop(invalid_numbers_index, inplace=True)

phone_numbers_data.dropna(inplace=True)
phone_numbers_data.reset_index(drop=True, inplace=True)

phone_numbers_data.head()

Output:

Finally we can verify whether the data is clean or not.

Python

assert not phone_numbers_data['phone_numbers'].str.contains(r'\+91 ').any(), "Found phone numbers with '+91 ' prefix"
assert (phone_numbers_data['phone_numbers'].str.len() == 10).all(), "Some phone numbers do not have 10 digits"

Step 8: Visualizing Categorical Data

Various plots could be used to visualize categorical data to get more insights about the data. So let us visualize the number of people belonging to each blood type.

Python

import seaborn as sns
sns.countplot(x='blood_type', data=without_bogus_records)
plt.title('Count of Blood Types')
plt.xlabel('Blood Type')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

Output:

Now we can see the relationship between income and the marital status of a person using a boxplot.

Python

sns.boxplot(x='marriage_status', y='income', data=inconsistent_data)

plt.title('Income Distribution by Marriage Status')
plt.xlabel('Marriage Status')
plt.ylabel('Income') 
plt.tight_layout()
plt.show()

Output:

Step 9: Encoding Categorical Data

Certain learning algorithms like regression and neural networks require their input to be numbers. Hence categorical data must be converted to numbers to use these algorithms. Let us see some encoding methods.

1. Label Encoding

With label encoding we can number the categories from 0 to num_categories - 1. Let us apply label encoding on the blood type feature.

Python

le = LabelEncoder()
without_bogus_records['blood_type_encoded'] = le.fit_transform(without_bogus_records['blood_type'])

without_bogus_records[['blood_type', 'blood_type_encoded']].drop_duplicates()

Output:

2. One-hot Encoding in Python

There are certain limitations of label encoding that are taken care of by one-hot encoding. Some of them are:

Creates a false order: It gives numbers like 0, 1, 2 to categories which may make models think one category is bigger or better than the other.
Misleads models: Algorithms like linear regression or decision trees might assume there's a ranking which can reduce accuracy.
Problem with distance-based models: In models like KNN or K-Means, the numeric labels can wrongly influence distance calculations.
Bias in training: Some models may give more importance to higher label values, even if all categories are equal.
Not suitable for nominal data: Label encoding is not a good choice when categories have no natural order, like colors or city names.

Python

inconsistent_data = pd.get_dummies(inconsistent_data, columns=['marriage_status'])
inconsistent_data.head()

Output:

3. Ordinal Encoding in Python

Categorical data can be ordinal where the order is of importance. For such features, we want to preserve the order after encoding as well. We will perform ordinal encoding on income groups. We want to preserve the order as 40K-75K < 75K-100K < 100K-125K < 125K-150K < 150K+

Python

custom_map = {
    '40k-75k': 1,
    '75k-100k': 2,
    '100k-125k': 3,
    '125k-150k': 4,
    '150k+': 5
}

remapping_data['income_groups_encoded'] = remapping_data['income_groups'].map(custom_map)

remapping_data[['income', 'income_groups', 'income_groups_encoded']].head()

Output:

With these techniques we can prepare categorical data for meaningful analysis and effective machine learning models.

Handling Categorical Data in Python

Why Do We Need to Handle Categorical Data?

Implementation for Handling Categorical Data

Step 1: Importing necessary Libraries

Step 2: Loading the Dataset

Step 3: Identifying and Removing Bogus Blood Types

Step 4: Handling Inconsistent Marriage Status Categories

Step 5: Grouping Income into Meaningful Bins

Step 6: Visualizing Income Group Distribution

Step 7: Cleaning Phone Number Data

Step 8: Visualizing Categorical Data

Step 9: Encoding Categorical Data

Explore