Computing Classification Evaluation Metrics in R

Last Updated : 30 Apr, 2026

Evaluation metrics are essential for assessing the performance of classification models in R. After building a model, it is important to examine how correctly it predicts class labels and how well it performs on unseen data. These metrics provide a clear understanding of model behavior and support informed decision-making.

  • Measure how accurately a classification model predicts target classes.
  • Measure the balance between correct positive predictions and missed detections through precision and recall.
  • Help compare multiple models to choose the best-performing approach.
evaluation_metrics_in_r
Evaluation Metrics

Confusion Matrix–Based Metrics

Confusion matrix–based metrics evaluate classification performance by analyzing the counts of correct and incorrect predictions for each class.

1. Confusion Matrix

A confusion matrix is a table used to evaluate a classification model by comparing actual class labels with predicted class labels. It provides a clear summary of how many predictions are correct and how many are misclassified.

  • Displays counts of True Positive, True Negative, False Positive and False Negative predictions.
  • Helps identify different types of classification errors made by the model.
predicted_condition_2_
Confusion Matrix
  • True Positive (TP): Model correctly predicts a positive class.
  • True Negative (TN): Model correctly predicts a negative class.
  • False Positive (FP): Model predicts positive when the actual class is negative (Type I error).
  • False Negative (FN): Model predicts negative when the actual class is positive (Type II error).

2. Accuracy

Accuracy measures the proportion of correctly classified observations out of the total observations. In classification evaluation metrics in R, it provides an overall summary of model performance but does not account for class imbalance.

  • Suitable when the dataset is balanced and all classes are equally important.
  • Simple to understand and provides a quick overall performance measure.

Accuracy=\frac{TP+TN}{TP+TN+FP+FN }

3. Precision

Precision measures the proportion of correctly predicted positive observations out of all predicted positive observations. It focuses on prediction quality for the positive class.

  • Important when False Positives are costly (e.g., spam detection).
  • Helps evaluate how reliable positive predictions are.

Precision=\frac{TP}{TP+FP}

4. Recall (Sensitivity)

Recall also called Sensitivity, measures the proportion of actual positive cases correctly identified by the model. It evaluates the model’s ability to detect positive instances.

  • Important when missing positive cases is costly (e.g., disease detection).
  • Reduces the risk of overlooking important positive cases.

Recall=\frac{TP}{TP+FN}

5. F1 Score

F1 Score is the harmonic mean of Precision and Recall. In evaluation metrics in R, it provides a balanced measure when both False Positives and False Negatives are important.

  • Suitable for imbalanced datasets where both Precision and Recall matter.
  • Provides a balanced performance measure when classes are uneven.

F1 = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

6. Specificity

Specificity measures the proportion of actual negative cases correctly identified by the model. In R classification evaluation, it reflects the model’s ability to correctly reject negative instances.

  • Important when correctly identifying negative cases is critical.
  • Helps reduce False Positives.

Specificity=\frac{TN}{TN+FP}

7. Kappa Score (Cohen’s Kappa)

Kappa measures the agreement between predicted and actual class labels while adjusting for agreement that could occur by chance. In classification evaluation metrics in R, it provides a more dependable assessment of model performance than accuracy alone, especially when class distributions are uneven.

  • Recommended for imbalanced datasets or when accuracy may give misleading results.
  • Accounts for chance agreement, offering a more realistic evaluation of classification performance.

kappa = \frac{P_o - P_e}{1 - P_e}

where

  • Po: Observed Accuracy
  • Pe: Expected Accuracy by Chance

Probability-Based Metrics

Probability-based metrics evaluate classification models using predicted probabilities rather than only final class labels. These metrics provide deeper insight into model confidence, ranking ability and overall predictive quality.

1. ROC Curve (Receiver Operating Characteristic Curve)

The ROC Curve is a graphical representation that shows the trade-off between True Positive Rate (Recall) and False Positive Rate at different classification thresholds. In R, it is commonly used to evaluate how well a model distinguishes between classes.

  • Suitable for binary classification problems where class separation ability is important.
  • Evaluates model performance across all threshold values.

TPR = \frac{TP}{TP + FN}

FPR = \frac{FP}{FP + TN}

where

  • TPR (True Positive Rate / Recall): Proportion of actual positives correctly identified.
  • FPR (False Positive Rate): Proportion of actual negatives incorrectly classified as positive.

2. AUC (Area Under the ROC Curve)

AUC measures the total area under the ROC curve and represents the probability that the model ranks a randomly chosen positive instance higher than a randomly chosen negative instance. It summarizes the model’s discrimination ability into a single value.

  • Useful for comparing classification models independent of threshold.
  • Provides a single, threshold-independent performance score.
AUC-ROC-Curve
ROC-AUC Classification Evaluation Metric

AUC is calculated as the integral of the ROC curve:

AUC = \int_{0}^{1} TPR(FPR)\, d(FPR)

3. Log Loss (Cross-Entropy Loss)

Log Loss measures the performance of a classification model by evaluating the predicted probability values against actual class labels. It penalizes confident but incorrect predictions more heavily.

  • Suitable when probability estimates need to be accurate and well-calibrated.
  • Considers prediction confidence rather than just final class labels.

\text{LogLoss} = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(p_i) + (1 - y_i)\log(1 - p_i) \right]

Multi-Class Classification Metrics

In multi-class classification, evaluation metrics extend binary classification measures to handle multiple classes. Metrics like macro, micro and weighted averages summarize model performance across all classes.

1. Macro-Averaged F1-Score

Macro F1-Score calculates the F1-Score for each class independently and then takes the unweighted average. It treats all classes equally, regardless of their frequency.

F1_{\text{macro}} = \frac{1}{C} \sum_{i=1}^{C} F1_i

2. Micro-Averaged F1-Score

Micro F1-Score aggregates the contributions of all classes to compute the average metric globally. It counts total true positives, false positives and false negatives across classes.

F1_{\text{micro}} = \frac{2 \sum_{i} TP_i}{2 \sum_{i} TP_i + \sum_{i} FP_i + \sum_{i} FN_i}

3. Weighted F1-Score

Weighted F1-Score calculates the F1-Score for each class and takes the average weighted by the number of true instances in each class.

F1_{\text{weighted}} = \frac{\sum_{i=1}^{C} n_i F1_i}{\sum_{i=1}^{C} n_i}

where ni number of true instances in class i.

Imbalanced Classification Metrics

In imbalanced datasets, accuracy can be misleading, as models may favor majority classes. Imbalanced metrics assess performance across all classes, highlighting both minority and majority class predictions.

1. G-Mean (Geometric Mean)

G-Mean measures the balance between classification performance on positive and negative classes. It evaluates how well a model performs across both majority and minority classes.

\text{G-Mean} = \sqrt{\text{Sensitivity} \times \text{Specificity}}

2. Lift

Lift measures how much better a model performs compared to random selection. It is used in marketing and risk modeling to evaluate targeting effectiveness.

\text{Lift} = \frac{\text{Predicted Positive Rate}}{\text{Baseline Positive Rate}} = \frac{TP / (TP + FP)}{\text{Total Positives} / \text{Total Observations}}

Step By Step Implementation

Step 1: Install and Load Required Libraries

Install and load the R packages needed for model training, evaluation and probability-based metrics.

R
install.packages("caret")
install.packages("randomForest")
install.packages("pROC")
install.packages("MLmetrics")

library(caret)
library(randomForest)
library(pROC)
library(MLmetrics)

Step 2: Load Dataset

Load the built-in Iris dataset and check its structure and summary statistics.

R
data(iris)
summary(iris)
plot(iris)

Output:

Step 3: Split Dataset into Training and Testing Sets

We split the dataset into 80% training and 20% testing to evaluate model performance on unseen data.

R
set.seed(123)
trainIndex <- createDataPartition(iris$Species, p = 0.8, list = FALSE)
trainData <- iris[trainIndex, ]
testData <- iris[-trainIndex, ]

Step 4: Train Random Forest Classifier

Here we train a Random Forest model to predict the species using all other features.

R
model <- train(Species ~ ., data = trainData, method = "rf")
predictions <- predict(model, newdata = testData)
pred_probs <- predict(model, newdata = testData, type = "prob") 

Step 5: Compute Confusion Matrix and Basic Metrics

The confusion matrix shows TP, TN, FP and FN counts. It also calculates Accuracy, Precision, Recall, F1-Score, Specificity and Kappa.

R
cm <- confusionMatrix(predictions, testData$Species)
print(cm)

class_metrics <- cm$byClass
print(class_metrics)

Output:

Step 6: Compute Multi-Class F1 Scores

We calculate Macro, Micro and Weighted F1-Scores for multi-class evaluation.

R
# Macro F1
macro_f1 <- mean(class_metrics[,"F1"])
cat("Macro F1-Score:", macro_f1, "\n")

# Micro F1
TP_total <- sum(diag(cm$table))
FP_total <- sum(rowSums(cm$table)) - TP_total
FN_total <- sum(colSums(cm$table)) - TP_total

micro_f1 <- ifelse((2 * TP_total + FP_total + FN_total) == 0, NA,
                   2 * TP_total / (2 * TP_total + FP_total + FN_total))

# Weighted F1
support <- rowSums(cm$table)
weighted_f1 <- sum(class_metrics[,"F1"] * support / sum(support))
cat("Weighted F1-Score:", weighted_f1, "\n")

Output:

Macro F1-Score: 0.9326599

Micro F1-Score: 0.9333333

Weighted F1-Score: 0.9340067

Step 7: Compute Probability-Based Metrics

Here we calculate ROC curves and AUC for each class using a one-vs-all approach.

R
# ROC and AUC
roc_list <- list()
auc_list <- c()
for(i in 1:3){
  roc_obj <- roc(response = as.numeric(testData$Species == levels(testData$Species)[i]),
                 predictor = pred_probs[,i])
  roc_list[[i]] <- roc_obj
  auc_list[i] <- auc(roc_obj)
}
names(auc_list) <- levels(testData$Species)
cat("AUC for each class:\n")
print(auc_list)

Output:

Screenshot-2026-03-03-164747
AUC

Step 8: Compute Imbalanced Classification Metrics

Compute G-Mean and Lift for each class using a one-vs-all approach. These metrics are helpful when dealing with class imbalance.

R
gmean_list <- c()
lift_list <- c()
for(i in 1:3){
  TP <- sum(predictions == levels(testData$Species)[i] & testData$Species == levels(testData$Species)[i])
  TN <- sum(predictions != levels(testData$Species)[i] & testData$Species != levels(testData$Species)[i])
  FP <- sum(predictions == levels(testData$Species)[i] & testData$Species != levels(testData$Species)[i])
  FN <- sum(predictions != levels(testData$Species)[i] & testData$Species == levels(testData$Species)[i])
  
  sensitivity <- TP / (TP + FN)
  specificity <- TN / (TN + FP)
  
  gmean_list[i] <- sqrt(sensitivity * specificity)
  
  baseline_pos <- sum(testData$Species == levels(testData$Species)[i]) / nrow(testData)
  predicted_pos <- (TP + FP) / nrow(testData)
  lift_list[i] <- predicted_pos / baseline_pos
}
names(gmean_list) <- levels(testData$Species)
names(lift_list) <- levels(testData$Species)

cat("G-Mean for each class:\n")
print(gmean_list)
cat("Lift for each class:\n")
print(lift_list)

Output:

Screenshot-2026-03-03-164930
Output

Download code from here.

Comment

Explore