Data Preprocessing in R

Data preprocessing is essential in data analysis and machine learning as real-world data is often incomplete, noisy or inconsistent. In R, it involves cleaning, organizing and structuring data before analysis or modeling to ensure accurate and reliable results.

Handling missing values, outliers and inconsistent data entries.
Applying transformation methods such as normalization, scaling and categorical encoding.
Structuring and integrating data to prepare datasets for analysis and machine learning tasks.

Implementation

Step 1: Installing and Loading Required Packages

The tidyverse package provides a collection of essential tools for data manipulation, transformation and visualization in R.
The package is installed using the install.packages("tidyverse") function.
The library(tidyverse) function loads the package into the R session for use in the script.

install.packages("tidyverse")
library(tidyverse)

Step 2: Load the Dataset

The dataset is loaded into R using read.csv() and character columns are kept as characters instead of automatic factors.

You can download dataset from here

data <- read.csv("/path/to/your/data.csv")

Step 3: Data Inspection

Before preprocessing, it is important to examine the dataset to understand its dimensions, structure and basic statistical details.

1. Check Dataset Dimensions: This function returns the total number of rows and columns in the dataset.

dim(data)

Output:

891 . 7

2. View Sample Records: This displays the first six rows to get a quick preview of the dataset.

head(data)

Output:

Screenshot-2026-02-18-114957 — Rows of the Data

3. Examine Data Structure: This shows the internal structure of the dataset, including column names and data types.

str(data)

Output:

4. Descriptive Statistics: To get an overview of the dataset, we can compute descriptive statistics using the summary() function. The summary() function generates basic descriptive statistics for each column, including the minimum, maximum, mean and quartiles.

summary(data)

Output:

Screenshot-2026-02-18-115312 — Statistics of the Data

Step 4: Convert Data Types

Convert columns into appropriate formats so numerical variables can perform calculations and categorical variables can be properly grouped and encoded for analysis.

data$Age  <- as.numeric(data$Age)   
data$Fare <- as.numeric(data$Fare)  
data$Sex  <- as.factor(data$Sex)

Step 5: Handle Missing Values

Replace missing numeric values with the median and categorical values with the most frequent category.

data$Age[is.na(data$Age)] <- median(data$Age, na.rm = TRUE) 
data$Fare[is.na(data$Fare)] <- median(data$Fare, na.rm = TRUE)

mode_sex <- names(sort(table(data$Sex), decreasing = TRUE))[1]
data$Sex[is.na(data$Sex)] <- mode_sex

Step 6: Remove Duplicate Rows

This removes repeated rows to maintain clean and consistent data.

data <- distinct(data)

Step 7: Feature Scaling

Feature scaling adjusts numerical variables to a comparable range so that no single feature dominates the model due to larger values. It improves the performance and convergence speed of many machine learning algorithms.

1. Standardization: Standardization transforms data so that it has a mean of 0 and a standard deviation of 1. This method is commonly used in algorithms that depend on distance calculations such as KNN and SVM.

data$Fare <- as.numeric(scale(data$Fare))
data$Age  <- as.numeric(scale(data$Age))

2. Normalization: Normalization scales the data to a fixed range, usually [0, 1]. This is useful when features need to be within a specific range for machine learning algorithms.

data$Fare <- (data$Fare - min(data$Fare)) / 
             (max(data$Fare) - min(data$Fare))

data$Age <- (data$Age - min(data$Age)) / 
            (max(data$Age) - min(data$Age))

Step 8: Encode Categorical Variables

One-hot encoding converts categorical values into binary numeric columns.

encoded_data <- cbind(data, model.matrix(~ Sex - 1, data))
encoded_data$Sex <- NULL

Step 9: Handling Outliers

Outliers are extreme values that can affect statistical analysis and model performance, so detecting and treating them improves reliability. Using the IQR method, we calculate lower and upper bounds and remove Fare values that fall outside this range

boxplot(data$Fare, main = "Before Removing Outliers")

Q1 <- quantile(data$Fare, 0.25)
Q3 <- quantile(data$Fare, 0.75)
IQR_value <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

data <- data[data$Fare >= lower_bound & data$Fare <= upper_bound, ]
boxplot(data$Fare, main = "After Removing Outliers")

Output:

Step 10: Review the Cleaned Dataset

After completing preprocessing, it is important to verify the final structure and summary of the dataset. This ensures that all transformations, scaling and cleaning steps were applied correctly.

summary(data)

Output:

Step 11: Correlation Analysis and Data Splitting

Correlation analysis helps identify relationships between numerical variables, while splitting the dataset into training and testing sets prepares it for model building and evaluation.

cor(data[, sapply(data, is.numeric)])

set.seed(123)
train_index <- sample(seq_len(nrow(data)), size = 0.7*nrow(data))
train_data <- data[train_index, ]
test_data  <- data[-train_index, ]

Output:

You can download full code from here.

Implementation

Step 1: Installing and Loading Required Packages

Step 2: Load the Dataset

Step 3: Data Inspection

Step 4: Convert Data Types

Step 5: Handle Missing Values

Step 6: Remove Duplicate Rows

Step 7: Feature Scaling

Step 8: Encode Categorical Variables

Step 9: Handling Outliers

Step 10: Review the Cleaned Dataset

Step 11: Correlation Analysis and Data Splitting

Explore