Employee Attrition Prediction in R

From the perspective of Industrial Companies that run with huge employees, employee attrition is a significant concern for many organizations as it affects productivity and financial health. Predicting which employees are likely to leave can help companies implement strategies to retain and hire valuable talent. In this article, we will explore how to predict employee attrition using the R Programming Language.

Objective and Goals of Employee Attrition Prediction

The main objectives of this project are:

To understand the factors contributing to employee attrition.
To build predictive models that accurately identify employees at risk of leaving.
To propose retention strategies based on the insights gained from the models.

Dataset Explanation

This data set is collected from the IBM Human Resources department. The dataset contains 1470 observations and 35 variables. Within 35 variables “Attrition” is the dependent variable in the dataset.

Dataset Link: Employee Attrition

The dataset was collected through HR records and employee surveys. Key features inside the dataset includes:

Personal Information: Age, Gender, Marital Status.
Job Information: Job Role, Department, Job Level, Job Satisfaction.
Compensation: Monthly Income, Stock Option Level.
Performance Metrics: Performance Rating, Training Times Last Year.
Work-Life Balance: Work Life Balance, Overtime

Step 1: Loading and Inspecting the Data

We use the readr and the dplyr packages in R for loading and inspecting the dataset.

# Load necessary libraries
library(readr)
library(dplyr)
library(caret)

# Load the dataset
data <- read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")

# Inspect the data
head(data)

Output:

  Age Attrition    BusinessTravel DailyRate             Department DistanceFromHome
1  41       Yes     Travel_Rarely      1102                  Sales                1
2  49        No Travel_Frequently       279 Research & Development                8
3  37       Yes     Travel_Rarely      1373 Research & Development                2
4  33        No Travel_Frequently      1392 Research & Development                3
5  27        No     Travel_Rarely       591 Research & Development                2
6  32        No Travel_Frequently      1005 Research & Development                2
  Education EducationField EmployeeCount EmployeeNumber EnvironmentSatisfaction
1         2  Life Sciences             1              1                       2
2         1  Life Sciences             1              2                       3
3         2          Other             1              4                       4
4         4  Life Sciences             1              5                       4
5         1        Medical             1              7                       1
6         2  Life Sciences             1              8                       4
  Gender HourlyRate JobInvolvement JobLevel               JobRole JobSatisfaction
1 Female         94              3        2       Sales Executive               4
2   Male         61              2        2    Research Scientist               2
3   Male         92              2        1 Laboratory Technician               3
4 Female         56              3        1    Research Scientist               3
5   Male         40              3        1 Laboratory Technician               2
6   Male         79              3        1 Laboratory Technician               4
  MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked Over18 OverTime
1        Single          5993       19479                  8      Y      Yes
2       Married          5130       24907                  1      Y       No
3        Single          2090        2396                  6      Y      Yes
4       Married          2909       23159                  1      Y      Yes
5       Married          3468       16632                  9      Y       No
6        Single          3068       11864                  0      Y       No
  PercentSalaryHike PerformanceRating RelationshipSatisfaction StandardHours
1                11                 3                        1            80
2                23                 4                        4            80
3                15                 3                        2            80
4                11                 3                        3            80
5                12                 3                        4            80
6                13                 3                        3            80
  StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
1                0                 8                     0               1
2                1                10                     3               3
3                0                 7                     3               3
4                0                 8                     3               3
5                1                 6                     3               3
6                0                 8                     2               2
  YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
1              6                  4                       0                    5
2             10                  7                       1                    7
3              0                  0                       0                    0
4              8                  7                       3                    0
5              2                  2                       2                    2
6              7                  7                       3                    6

Step 2: Detect the missing values and Outliers

We have to check if there are any missing values in the dataset. In case of any missing values found, the function below returns 1 in the output.

sum(is.na(data))

# Identify outliers in Monthly Income
boxplot(data$MonthlyIncome, main = "Boxplot of Monthly Income")
# Remove outliers (example using IQR method)
Q1 <- quantile(data$MonthlyIncome, 0.25)
Q3 <- quantile(data$MonthlyIncome, 0.75)
IQR <- Q3 - Q1
data <- data %>% filter(MonthlyIncome >= (Q1 - 1.5 * IQR) & 
                        MonthlyIncome <= (Q3 + 1.5 * IQR)

Output:

[1] 0

Step 3: Exploratory Data Analysis (EDA)

Visualizing the data before feeding it into a model is a most important step in making the target audience understand you analysis and prediction and upon what base does your prediction stand by.

Visualizing the trends using EDA

Visualizing the data before feeding it into a model is crucial for several reasons:

Understanding Data Distribution: Helps identify the distribution of different variables and the target variable. This can reveal class imbalances, outliers, and skewness.
Identifying Relationships: Helps understand relationships between features and the target variable, which can inform feature selection and engineering.
Detecting Anomalies: Helps spot anomalies and outliers that might affect model performance.

# Plot the distribution of attrition
ggplot(data, aes(x = Attrition)) + 
  geom_bar(fill = "blue") +
  labs(title = "Attrition Distribution", x = "Attrition", y = "Count")

Output:

Plot the relationship between job satisfaction and attrition

Now we will Plot the relationship between job satisfaction and attrition.

# Plot the relationship between job satisfaction and attrition
ggplot(data, aes(x = JobSatisfaction, fill = Attrition)) + 
  geom_bar(position = "dodge") +
  labs(title = "Job Satisfaction vs Attrition", x = "Job Satisfaction", y = "Count")

Output:

sc2 — Employee Attrition Prediction in R

Plot the relationship between monthly income and attrition

Now we will Plot the relationship between monthly income and attrition.

# Plot the relationship between monthly income and attrition
ggplot(data, aes(x = Attrition, y = MonthlyIncome)) + 
  geom_boxplot() +
  labs(title = "Monthly Income vs Attrition", x = "Attrition", y = "Monthly Income")

Output:

Plot the relationship between age and attrition

Now we will Plot the relationship between age and attrition.

# Plot the relationship between age and attrition
ggplot(data, aes(x = Age, fill = Attrition)) + 
  geom_histogram(position = "dodge", bins = 30) +
  labs(title = "Age vs Attrition", x = "Age", y = "Count")

Output:

Step 4: Splitting the dataset

With the help of the Training data set we will build up our model and test its accuracy using the Testing Data set.

# Split the data into training and testing sets
set.seed(123)
trainIndex <- createDataPartition(data$Attrition, p = .7, 
                                  list = FALSE, 
                                  times = 1)
trainData <- data[ trainIndex,]
testData  <- data[-trainIndex,]

We have successfully split the whole data set into two parts. Now we have 1025 Training data & 445 Testing data.

Step 5: Model Building

Now we will build model using Random Forest that constructs multiple decision trees during training and outputs the mode of the classes (classification) or mean prediction (regression) of the individual trees. It improves predictive accuracy and controls overfitting.

# Create a new feature for total satisfaction
data$TotalSatisfaction <- data$JobSatisfaction + data$EnvironmentSatisfaction + 
                                                 data$RelationshipSatisfaction

# Create a new feature for years at company divided by age
data$YearsAtCompanyByAge <- data$YearsAtCompany / data$Age

# Random Forest
rf_model <- randomForest(Attrition ~ ., data = trainData, importance = TRUE)
print(rf_model)
# Predict on test data
rf_pred <- predict(rf_model, testData)
confusionMatrix(rf_pred, testData$Attrition)

Output:

Call:
 randomForest(formula = Attrition ~ ., data = trainData, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 5

        OOB estimate of  error rate: 13.88%
Confusion matrix:
     No Yes class.error
No  857   7 0.008101852
Yes 136  30 0.819277108

Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  366  59
       Yes   3  12
                                         
               Accuracy : 0.8591         
                 95% CI : (0.823, 0.8902)
    No Information Rate : 0.8386         
    P-Value [Acc > NIR] : 0.1345         
                                         
                  Kappa : 0.2361         
                                         
 Mcnemar's Test P-Value : 2.848e-12      
                                         
            Sensitivity : 0.9919         
            Specificity : 0.1690         
         Pos Pred Value : 0.8612         
         Neg Pred Value : 0.8000         
             Prevalence : 0.8386         
         Detection Rate : 0.8318         
   Detection Prevalence : 0.9659         
      Balanced Accuracy : 0.5804         
                                         
       'Positive' Class : No

OverTime and MonthlyIncome are the most important factors influencing attrition.

Employees who have been with the company for a longer time or have more total working years are less likely to leave.
Similar patterns regarding travel and job roles as logistic regression.

Conclusion

Based on the above analysis, prediction and findings, Logistic Regression and Random Forest models provide valuable insights into the factors influencing employee attrition.Predicting employee attrition using machine learning models in R provides a valuable insights into the factors driving turnover(Employee Attrition). By understanding these factors, organizations can implement targeted strategies to retain their top talent, thereby enhancing productivity and reducing the costs associated with employee turnover.

Employee Attrition Prediction in R

Objective and Goals of Employee Attrition Prediction

Dataset Explanation

Step 1: Loading and Inspecting the Data

Step 2: Detect the missing values and Outliers

Step 3: Exploratory Data Analysis (EDA)

Visualizing the trends using EDA

Plot the relationship between job satisfaction and attrition

Plot the relationship between monthly income and attrition

Plot the relationship between age and attrition

Step 4: Splitting the dataset

Step 5: Model Building

Conclusion

Explore