Evaluation metrics are essential for assessing the performance of classification models in R. After building a model, it is important to examine how correctly it predicts class labels and how well it performs on unseen data. These metrics provide a clear understanding of model behavior and support informed decision-making.
- Measure how accurately a classification model predicts target classes.
- Measure the balance between correct positive predictions and missed detections through precision and recall.
- Help compare multiple models to choose the best-performing approach.

Confusion Matrix–Based Metrics
Confusion matrix–based metrics evaluate classification performance by analyzing the counts of correct and incorrect predictions for each class.
1. Confusion Matrix
A confusion matrix is a table used to evaluate a classification model by comparing actual class labels with predicted class labels. It provides a clear summary of how many predictions are correct and how many are misclassified.
- Displays counts of True Positive, True Negative, False Positive and False Negative predictions.
- Helps identify different types of classification errors made by the model.

- True Positive (TP): Model correctly predicts a positive class.
- True Negative (TN): Model correctly predicts a negative class.
- False Positive (FP): Model predicts positive when the actual class is negative (Type I error).
- False Negative (FN): Model predicts negative when the actual class is positive (Type II error).
2. Accuracy
Accuracy measures the proportion of correctly classified observations out of the total observations. In classification evaluation metrics in R, it provides an overall summary of model performance but does not account for class imbalance.
- Suitable when the dataset is balanced and all classes are equally important.
- Simple to understand and provides a quick overall performance measure.
Accuracy=\frac{TP+TN}{TP+TN+FP+FN }
3. Precision
Precision measures the proportion of correctly predicted positive observations out of all predicted positive observations. It focuses on prediction quality for the positive class.
- Important when False Positives are costly (e.g., spam detection).
- Helps evaluate how reliable positive predictions are.
Precision=\frac{TP}{TP+FP}
4. Recall (Sensitivity)
Recall also called Sensitivity, measures the proportion of actual positive cases correctly identified by the model. It evaluates the model’s ability to detect positive instances.
- Important when missing positive cases is costly (e.g., disease detection).
- Reduces the risk of overlooking important positive cases.
Recall=\frac{TP}{TP+FN}
5. F1 Score
F1 Score is the harmonic mean of Precision and Recall. In evaluation metrics in R, it provides a balanced measure when both False Positives and False Negatives are important.
- Suitable for imbalanced datasets where both Precision and Recall matter.
- Provides a balanced performance measure when classes are uneven.
F1 = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
6. Specificity
Specificity measures the proportion of actual negative cases correctly identified by the model. In R classification evaluation, it reflects the model’s ability to correctly reject negative instances.
- Important when correctly identifying negative cases is critical.
- Helps reduce False Positives.
Specificity=\frac{TN}{TN+FP}
7. Kappa Score (Cohen’s Kappa)
Kappa measures the agreement between predicted and actual class labels while adjusting for agreement that could occur by chance. In classification evaluation metrics in R, it provides a more dependable assessment of model performance than accuracy alone, especially when class distributions are uneven.
- Recommended for imbalanced datasets or when accuracy may give misleading results.
- Accounts for chance agreement, offering a more realistic evaluation of classification performance.
kappa = \frac{P_o - P_e}{1 - P_e}
where
- Po: Observed Accuracy
- Pe: Expected Accuracy by Chance
Probability-Based Metrics
Probability-based metrics evaluate classification models using predicted probabilities rather than only final class labels. These metrics provide deeper insight into model confidence, ranking ability and overall predictive quality.
1. ROC Curve (Receiver Operating Characteristic Curve)
The ROC Curve is a graphical representation that shows the trade-off between True Positive Rate (Recall) and False Positive Rate at different classification thresholds. In R, it is commonly used to evaluate how well a model distinguishes between classes.
- Suitable for binary classification problems where class separation ability is important.
- Evaluates model performance across all threshold values.
TPR = \frac{TP}{TP + FN}
FPR = \frac{FP}{FP + TN}
where
- TPR (True Positive Rate / Recall): Proportion of actual positives correctly identified.
- FPR (False Positive Rate): Proportion of actual negatives incorrectly classified as positive.
2. AUC (Area Under the ROC Curve)
AUC measures the total area under the ROC curve and represents the probability that the model ranks a randomly chosen positive instance higher than a randomly chosen negative instance. It summarizes the model’s discrimination ability into a single value.
- Useful for comparing classification models independent of threshold.
- Provides a single, threshold-independent performance score.

AUC is calculated as the integral of the ROC curve:
AUC = \int_{0}^{1} TPR(FPR)\, d(FPR)
3. Log Loss (Cross-Entropy Loss)
Log Loss measures the performance of a classification model by evaluating the predicted probability values against actual class labels. It penalizes confident but incorrect predictions more heavily.
- Suitable when probability estimates need to be accurate and well-calibrated.
- Considers prediction confidence rather than just final class labels.
\text{LogLoss} = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(p_i) + (1 - y_i)\log(1 - p_i) \right]
Multi-Class Classification Metrics
In multi-class classification, evaluation metrics extend binary classification measures to handle multiple classes. Metrics like macro, micro and weighted averages summarize model performance across all classes.
1. Macro-Averaged F1-Score
Macro F1-Score calculates the F1-Score for each class independently and then takes the unweighted average. It treats all classes equally, regardless of their frequency.
F1_{\text{macro}} = \frac{1}{C} \sum_{i=1}^{C} F1_i
2. Micro-Averaged F1-Score
Micro F1-Score aggregates the contributions of all classes to compute the average metric globally. It counts total true positives, false positives and false negatives across classes.
F1_{\text{micro}} = \frac{2 \sum_{i} TP_i}{2 \sum_{i} TP_i + \sum_{i} FP_i + \sum_{i} FN_i}
3. Weighted F1-Score
Weighted F1-Score calculates the F1-Score for each class and takes the average weighted by the number of true instances in each class.
F1_{\text{weighted}} = \frac{\sum_{i=1}^{C} n_i F1_i}{\sum_{i=1}^{C} n_i}
where ni number of true instances in class i.
Imbalanced Classification Metrics
In imbalanced datasets, accuracy can be misleading, as models may favor majority classes. Imbalanced metrics assess performance across all classes, highlighting both minority and majority class predictions.
1. G-Mean (Geometric Mean)
G-Mean measures the balance between classification performance on positive and negative classes. It evaluates how well a model performs across both majority and minority classes.
\text{G-Mean} = \sqrt{\text{Sensitivity} \times \text{Specificity}}
2. Lift
Lift measures how much better a model performs compared to random selection. It is used in marketing and risk modeling to evaluate targeting effectiveness.
\text{Lift} = \frac{\text{Predicted Positive Rate}}{\text{Baseline Positive Rate}} = \frac{TP / (TP + FP)}{\text{Total Positives} / \text{Total Observations}}
Step By Step Implementation
Step 1: Install and Load Required Libraries
Install and load the R packages needed for model training, evaluation and probability-based metrics.
install.packages("caret")
install.packages("randomForest")
install.packages("pROC")
install.packages("MLmetrics")
library(caret)
library(randomForest)
library(pROC)
library(MLmetrics)
Step 2: Load Dataset
Load the built-in Iris dataset and check its structure and summary statistics.
data(iris)
summary(iris)
plot(iris)
Output:
Step 3: Split Dataset into Training and Testing Sets
We split the dataset into 80% training and 20% testing to evaluate model performance on unseen data.
set.seed(123)
trainIndex <- createDataPartition(iris$Species, p = 0.8, list = FALSE)
trainData <- iris[trainIndex, ]
testData <- iris[-trainIndex, ]
Step 4: Train Random Forest Classifier
Here we train a Random Forest model to predict the species using all other features.
model <- train(Species ~ ., data = trainData, method = "rf")
predictions <- predict(model, newdata = testData)
pred_probs <- predict(model, newdata = testData, type = "prob")
Step 5: Compute Confusion Matrix and Basic Metrics
The confusion matrix shows TP, TN, FP and FN counts. It also calculates Accuracy, Precision, Recall, F1-Score, Specificity and Kappa.
cm <- confusionMatrix(predictions, testData$Species)
print(cm)
class_metrics <- cm$byClass
print(class_metrics)
Output:
Step 6: Compute Multi-Class F1 Scores
We calculate Macro, Micro and Weighted F1-Scores for multi-class evaluation.
# Macro F1
macro_f1 <- mean(class_metrics[,"F1"])
cat("Macro F1-Score:", macro_f1, "\n")
# Micro F1
TP_total <- sum(diag(cm$table))
FP_total <- sum(rowSums(cm$table)) - TP_total
FN_total <- sum(colSums(cm$table)) - TP_total
micro_f1 <- ifelse((2 * TP_total + FP_total + FN_total) == 0, NA,
2 * TP_total / (2 * TP_total + FP_total + FN_total))
# Weighted F1
support <- rowSums(cm$table)
weighted_f1 <- sum(class_metrics[,"F1"] * support / sum(support))
cat("Weighted F1-Score:", weighted_f1, "\n")
Output:
Macro F1-Score: 0.9326599
Micro F1-Score: 0.9333333
Weighted F1-Score: 0.9340067
Step 7: Compute Probability-Based Metrics
Here we calculate ROC curves and AUC for each class using a one-vs-all approach.
# ROC and AUC
roc_list <- list()
auc_list <- c()
for(i in 1:3){
roc_obj <- roc(response = as.numeric(testData$Species == levels(testData$Species)[i]),
predictor = pred_probs[,i])
roc_list[[i]] <- roc_obj
auc_list[i] <- auc(roc_obj)
}
names(auc_list) <- levels(testData$Species)
cat("AUC for each class:\n")
print(auc_list)
Output:

Step 8: Compute Imbalanced Classification Metrics
Compute G-Mean and Lift for each class using a one-vs-all approach. These metrics are helpful when dealing with class imbalance.
gmean_list <- c()
lift_list <- c()
for(i in 1:3){
TP <- sum(predictions == levels(testData$Species)[i] & testData$Species == levels(testData$Species)[i])
TN <- sum(predictions != levels(testData$Species)[i] & testData$Species != levels(testData$Species)[i])
FP <- sum(predictions == levels(testData$Species)[i] & testData$Species != levels(testData$Species)[i])
FN <- sum(predictions != levels(testData$Species)[i] & testData$Species == levels(testData$Species)[i])
sensitivity <- TP / (TP + FN)
specificity <- TN / (TN + FP)
gmean_list[i] <- sqrt(sensitivity * specificity)
baseline_pos <- sum(testData$Species == levels(testData$Species)[i]) / nrow(testData)
predicted_pos <- (TP + FP) / nrow(testData)
lift_list[i] <- predicted_pos / baseline_pos
}
names(gmean_list) <- levels(testData$Species)
names(lift_list) <- levels(testData$Species)
cat("G-Mean for each class:\n")
print(gmean_list)
cat("Lift for each class:\n")
print(lift_list)
Output:

Download code from here.