Data preprocessing is essential in data analysis and machine learning as real-world data is often incomplete, noisy or inconsistent. In R, it involves cleaning, organizing and structuring data before analysis or modeling to ensure accurate and reliable results.
- Handling missing values, outliers and inconsistent data entries.
- Applying transformation methods such as normalization, scaling and categorical encoding.
- Structuring and integrating data to prepare datasets for analysis and machine learning tasks.
Implementation
Step 1: Installing and Loading Required Packages
- The tidyverse package provides a collection of essential tools for data manipulation, transformation and visualization in R.
- The package is installed using the install.packages("tidyverse") function.
- The library(tidyverse) function loads the package into the R session for use in the script.
install.packages("tidyverse")
library(tidyverse)
Step 2: Load the Dataset
The dataset is loaded into R using read.csv() and character columns are kept as characters instead of automatic factors.
You can download dataset from here
data <- read.csv("/path/to/your/data.csv")
Step 3: Data Inspection
Before preprocessing, it is important to examine the dataset to understand its dimensions, structure and basic statistical details.
1. Check Dataset Dimensions: This function returns the total number of rows and columns in the dataset.
dim(data)
Output:
891 . 7
2. View Sample Records: This displays the first six rows to get a quick preview of the dataset.
head(data)
Output:

3. Examine Data Structure: This shows the internal structure of the dataset, including column names and data types.
str(data)
Output:

4. Descriptive Statistics: To get an overview of the dataset, we can compute descriptive statistics using the summary() function. The summary() function generates basic descriptive statistics for each column, including the minimum, maximum, mean and quartiles.
summary(data)
Output:

Step 4: Convert Data Types
Convert columns into appropriate formats so numerical variables can perform calculations and categorical variables can be properly grouped and encoded for analysis.
data$Age <- as.numeric(data$Age)
data$Fare <- as.numeric(data$Fare)
data$Sex <- as.factor(data$Sex)
Step 5: Handle Missing Values
Replace missing numeric values with the median and categorical values with the most frequent category.
data$Age[is.na(data$Age)] <- median(data$Age, na.rm = TRUE)
data$Fare[is.na(data$Fare)] <- median(data$Fare, na.rm = TRUE)
mode_sex <- names(sort(table(data$Sex), decreasing = TRUE))[1]
data$Sex[is.na(data$Sex)] <- mode_sex
Step 6: Remove Duplicate Rows
This removes repeated rows to maintain clean and consistent data.
data <- distinct(data)
Step 7: Feature Scaling
Feature scaling adjusts numerical variables to a comparable range so that no single feature dominates the model due to larger values. It improves the performance and convergence speed of many machine learning algorithms.
1. Standardization: Standardization transforms data so that it has a mean of 0 and a standard deviation of 1. This method is commonly used in algorithms that depend on distance calculations such as KNN and SVM.
data$Fare <- as.numeric(scale(data$Fare))
data$Age <- as.numeric(scale(data$Age))
2. Normalization: Normalization scales the data to a fixed range, usually [0, 1]. This is useful when features need to be within a specific range for machine learning algorithms.
data$Fare <- (data$Fare - min(data$Fare)) /
(max(data$Fare) - min(data$Fare))
data$Age <- (data$Age - min(data$Age)) /
(max(data$Age) - min(data$Age))
Step 8: Encode Categorical Variables
One-hot encoding converts categorical values into binary numeric columns.
encoded_data <- cbind(data, model.matrix(~ Sex - 1, data))
encoded_data$Sex <- NULL
Step 9: Handling Outliers
Outliers are extreme values that can affect statistical analysis and model performance, so detecting and treating them improves reliability. Using the IQR method, we calculate lower and upper bounds and remove Fare values that fall outside this range
boxplot(data$Fare, main = "Before Removing Outliers")
Q1 <- quantile(data$Fare, 0.25)
Q3 <- quantile(data$Fare, 0.75)
IQR_value <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value
data <- data[data$Fare >= lower_bound & data$Fare <= upper_bound, ]
boxplot(data$Fare, main = "After Removing Outliers")
Output:
Step 10: Review the Cleaned Dataset
After completing preprocessing, it is important to verify the final structure and summary of the dataset. This ensures that all transformations, scaling and cleaning steps were applied correctly.
summary(data)
Output:

Step 11: Correlation Analysis and Data Splitting
Correlation analysis helps identify relationships between numerical variables, while splitting the dataset into training and testing sets prepares it for model building and evaluation.
cor(data[, sapply(data, is.numeric)])
set.seed(123)
train_index <- sample(seq_len(nrow(data)), size = 0.7*nrow(data))
train_data <- data[train_index, ]
test_data <- data[-train_index, ]
Output:

You can download full code from here.