Employee Attrition Prediction in R

Last Updated : 23 Jul, 2025

From the perspective of Industrial Companies that run with huge employees, employee attrition is a significant concern for many organizations as it affects productivity and financial health. Predicting which employees are likely to leave can help companies implement strategies to retain and hire valuable talent. In this article, we will explore how to predict employee attrition using the R Programming Language.

Objective and Goals of Employee Attrition Prediction

The main objectives of this project are:

  • To understand the factors contributing to employee attrition.
  • To build predictive models that accurately identify employees at risk of leaving.
  • To propose retention strategies based on the insights gained from the models.

Dataset Explanation

This data set is collected from the IBM Human Resources department. The dataset contains 1470 observations and 35 variables. Within 35 variables “Attrition” is the dependent variable in the dataset.

Dataset Link: Employee Attrition

The dataset was collected through HR records and employee surveys. Key features inside the dataset includes:

  • Personal Information: Age, Gender, Marital Status.
  • Job Information: Job Role, Department, Job Level, Job Satisfaction.
  • Compensation: Monthly Income, Stock Option Level.
  • Performance Metrics: Performance Rating, Training Times Last Year.
  • Work-Life Balance: Work Life Balance, Overtime

Step 1: Loading and Inspecting the Data

We use the readr and the dplyr packages in R for loading and inspecting the dataset.

R
# Load necessary libraries
library(readr)
library(dplyr)
library(caret)

# Load the dataset
data <- read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")

# Inspect the data
head(data)

Output:

  Age Attrition    BusinessTravel DailyRate             Department DistanceFromHome
1 41 Yes Travel_Rarely 1102 Sales 1
2 49 No Travel_Frequently 279 Research & Development 8
3 37 Yes Travel_Rarely 1373 Research & Development 2
4 33 No Travel_Frequently 1392 Research & Development 3
5 27 No Travel_Rarely 591 Research & Development 2
6 32 No Travel_Frequently 1005 Research & Development 2
Education EducationField EmployeeCount EmployeeNumber EnvironmentSatisfaction
1 2 Life Sciences 1 1 2
2 1 Life Sciences 1 2 3
3 2 Other 1 4 4
4 4 Life Sciences 1 5 4
5 1 Medical 1 7 1
6 2 Life Sciences 1 8 4
Gender HourlyRate JobInvolvement JobLevel JobRole JobSatisfaction
1 Female 94 3 2 Sales Executive 4
2 Male 61 2 2 Research Scientist 2
3 Male 92 2 1 Laboratory Technician 3
4 Female 56 3 1 Research Scientist 3
5 Male 40 3 1 Laboratory Technician 2
6 Male 79 3 1 Laboratory Technician 4
MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked Over18 OverTime
1 Single 5993 19479 8 Y Yes
2 Married 5130 24907 1 Y No
3 Single 2090 2396 6 Y Yes
4 Married 2909 23159 1 Y Yes
5 Married 3468 16632 9 Y No
6 Single 3068 11864 0 Y No
PercentSalaryHike PerformanceRating RelationshipSatisfaction StandardHours
1 11 3 1 80
2 23 4 4 80
3 15 3 2 80
4 11 3 3 80
5 12 3 4 80
6 13 3 3 80
StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
1 0 8 0 1
2 1 10 3 3
3 0 7 3 3
4 0 8 3 3
5 1 6 3 3
6 0 8 2 2
YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
1 6 4 0 5
2 10 7 1 7
3 0 0 0 0
4 8 7 3 0
5 2 2 2 2
6 7 7 3 6

Step 2: Detect the missing values and Outliers

We have to check if there are any missing values in the dataset. In case of any missing values found, the function below returns 1 in the output.

R
sum(is.na(data))

# Identify outliers in Monthly Income
boxplot(data$MonthlyIncome, main = "Boxplot of Monthly Income")
# Remove outliers (example using IQR method)
Q1 <- quantile(data$MonthlyIncome, 0.25)
Q3 <- quantile(data$MonthlyIncome, 0.75)
IQR <- Q3 - Q1
data <- data %>% filter(MonthlyIncome >= (Q1 - 1.5 * IQR) & 
                        MonthlyIncome <= (Q3 + 1.5 * IQR)

Output:

[1] 0

Step 3: Exploratory Data Analysis (EDA)

Visualizing the data before feeding it into a model is a most important step in making the target audience understand you analysis and prediction and upon what base does your prediction stand by.

Visualizing the trends using EDA

Visualizing the data before feeding it into a model is crucial for several reasons:

  • Understanding Data Distribution: Helps identify the distribution of different variables and the target variable. This can reveal class imbalances, outliers, and skewness.
  • Identifying Relationships: Helps understand relationships between features and the target variable, which can inform feature selection and engineering.
  • Detecting Anomalies: Helps spot anomalies and outliers that might affect model performance.
R
# Plot the distribution of attrition
ggplot(data, aes(x = Attrition)) + 
  geom_bar(fill = "blue") +
  labs(title = "Attrition Distribution", x = "Attrition", y = "Count")

Output:

gh
Employee Attrition Prediction in R

Plot the relationship between job satisfaction and attrition

Now we will Plot the relationship between job satisfaction and attrition.

R
# Plot the relationship between job satisfaction and attrition
ggplot(data, aes(x = JobSatisfaction, fill = Attrition)) + 
  geom_bar(position = "dodge") +
  labs(title = "Job Satisfaction vs Attrition", x = "Job Satisfaction", y = "Count")

Output:

sc2
Employee Attrition Prediction in R

Plot the relationship between monthly income and attrition

Now we will Plot the relationship between monthly income and attrition.

R
# Plot the relationship between monthly income and attrition
ggplot(data, aes(x = Attrition, y = MonthlyIncome)) + 
  geom_boxplot() +
  labs(title = "Monthly Income vs Attrition", x = "Attrition", y = "Monthly Income")

Output:

gh
Employee Attrition Prediction in R

Plot the relationship between age and attrition

Now we will Plot the relationship between age and attrition.

R
# Plot the relationship between age and attrition
ggplot(data, aes(x = Age, fill = Attrition)) + 
  geom_histogram(position = "dodge", bins = 30) +
  labs(title = "Age vs Attrition", x = "Age", y = "Count")

Output:

gh
Employee Attrition Prediction in R

Step 4: Splitting the dataset

With the help of the Training data set we will build up our model and test its accuracy using the Testing Data set.

R
# Split the data into training and testing sets
set.seed(123)
trainIndex <- createDataPartition(data$Attrition, p = .7, 
                                  list = FALSE, 
                                  times = 1)
trainData <- data[ trainIndex,]
testData  <- data[-trainIndex,]

We have successfully split the whole data set into two parts. Now we have 1025 Training data & 445 Testing data.

Step 5: Model Building

Now we will build model using Random Forest that constructs multiple decision trees during training and outputs the mode of the classes (classification) or mean prediction (regression) of the individual trees. It improves predictive accuracy and controls overfitting.

R
# Create a new feature for total satisfaction
data$TotalSatisfaction <- data$JobSatisfaction + data$EnvironmentSatisfaction + 
                                                 data$RelationshipSatisfaction

# Create a new feature for years at company divided by age
data$YearsAtCompanyByAge <- data$YearsAtCompany / data$Age

# Random Forest
rf_model <- randomForest(Attrition ~ ., data = trainData, importance = TRUE)
print(rf_model)
# Predict on test data
rf_pred <- predict(rf_model, testData)
confusionMatrix(rf_pred, testData$Attrition)

Output:

Call:
randomForest(formula = Attrition ~ ., data = trainData, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 5

OOB estimate of error rate: 13.88%
Confusion matrix:
No Yes class.error
No 857 7 0.008101852
Yes 136 30 0.819277108

Confusion Matrix and Statistics

Reference
Prediction No Yes
No 366 59
Yes 3 12

Accuracy : 0.8591
95% CI : (0.823, 0.8902)
No Information Rate : 0.8386
P-Value [Acc > NIR] : 0.1345

Kappa : 0.2361

Mcnemar's Test P-Value : 2.848e-12

Sensitivity : 0.9919
Specificity : 0.1690
Pos Pred Value : 0.8612
Neg Pred Value : 0.8000
Prevalence : 0.8386
Detection Rate : 0.8318
Detection Prevalence : 0.9659
Balanced Accuracy : 0.5804

'Positive' Class : No

OverTime and MonthlyIncome are the most important factors influencing attrition.

  • Employees who have been with the company for a longer time or have more total working years are less likely to leave.
  • Similar patterns regarding travel and job roles as logistic regression.

Conclusion

Based on the above analysis, prediction and findings, Logistic Regression and Random Forest models provide valuable insights into the factors influencing employee attrition.Predicting employee attrition using machine learning models in R provides a valuable insights into the factors driving turnover(Employee Attrition). By understanding these factors, organizations can implement targeted strategies to retain their top talent, thereby enhancing productivity and reducing the costs associated with employee turnover.

Comment