Naive Bayes Classifier in R Programming

Naive Bayes Classifier is a machine learning algorithm used to classify data into categories. It uses Bayes' Theorem to calculate the probability of each class based on the input features. It assumes that all features are independent of each other.

Bayes’ Theorem Formula

Naive Bayes algorithm is based on Bayes theorem. Bayes theorem gives the conditional probability of an event A given another event B has occurred.

P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}

Where:

P(A|B) = Conditional probability of A given B.
P(B|A) = Conditional probability of B given A.
P(A) = Probability of event A.
P(B) = Probability of event B.

For many predictors, we can formulate the posterior probability as follows:

P(A \mid B) = P(B_1 \mid A) \cdot P(B_2 \mid A) \cdot P(B_3 \mid A) \cdot P(B_4 \mid A) \cdots

Example Using Bayes’ Theorem

Consider a sample space: {HH, HT, TH, TT}

where, H = Head, T = Tail

We are asked to find the probability that the second coin is a Head given that the first coin is a Tail.

The event A: Second coin is a Head
The event B: First coin is a Tail
P(A | B) is the conditional probability we want to find.
P(B | A) is the probability of the first coin being a Tail, given that the second coin is a Head.
P(A) is the probability of the second coin being a Head (which is 1/2, because the outcome of one coin does not affect the other).
P(B) is the probability of the first coin being a Tail which is also 1/2.

Now applying Bayes’ Theorem:

P(A \mid B) = \frac{(1/2) \cdot (1/2)}{1/2}= \frac{1/4}{1/2}= \frac{1}{2}= 0.5

Therefore, the probability that the second coin is a Head, given that the first coin is a Tail, is 0.5.

Implementation of Naive Bayes Classifier

We follow these steps to build and evaluate a Naive Bayes model using the Iris dataset.

1. Installing and Load Required Packages

We install the necessary packages and load them.

e1071: Contains Naive Bayes classifier (naiveBayes()) and other useful machine learning functions.
caTools: Provides utilities for data splitting (for training and test sets).
caret: Simplifies machine learning tasks like training models, evaluating them and creating confusion matrices.
library(): This function loads the installed packages into the R environment, allowing their functions to be used.

install.packages("e1071")
install.packages("caTools")
install.packages("caret")

library(e1071)
library(caTools)
library(caret)

2. Loading the Dataset

We begin by loading the dataset and checking its structure.

data(): Loads a dataset into R. For example, the iris dataset which contains information about Iris flower species (sepal and petal length and width).
head(): Displays the first few rows (default is 6) of the dataset for a quick overview.

data(iris)
head(iris)

Output:

3. Splitting the Dataset

We split the data into training and testing sets using a 70:30 ratio.

set.seed(): Ensures reproducibility by setting the seed for the random number generator.
sample.split(): From the caTools package, it splits the data. The SplitRatio argument defines the proportion for training data (e.g., 70%). It returns a logical vector indicating rows in the training set.
subset(): Creates subsets of the iris dataset, used to generate the train_cl and test_cl datasets based on the split.

set.seed(123)
split <- sample.split(iris, SplitRatio = 0.7)
train_cl <- subset(iris, split == TRUE)
test_cl <- subset(iris, split == FALSE)

4. Scaling the Features

We scale the numerical features to normalize the data.

scale(): Standardizes the dataset by transforming the numeric columns (1 to 4, corresponding to the features of the iris dataset) so that each feature has a mean of 0 and a standard deviation of 1.

train_scale <- scale(train_cl[, 1:4])
test_scale <- scale(test_cl[, 1:4])

5. Training the Naive Bayes Model

We train the Naive Bayes classifier using the training set.

naiveBayes(): From the e1071 package, this function trains the Naive Bayes classifier. The Species ~ . formula indicates that we are predicting Species based on all other variables in the dataset (the dot represents all other columns).
The trained model is stored in the classifier_cl variable.

classifier_cl <- naiveBayes(Species ~ ., data = train_cl)
classifier_cl

Output:

6. Making Predictions

We use the trained model to predict species on the test data.

predict(): Uses the trained classifier to predict the target variable (Species) for the test data (test_cl). The predicted values are stored in the y_pred variable.

y_pred <- predict(classifier_cl, newdata = test_cl)

7. Evaluating the Model

We create a confusion matrix and evaluate the model performance.

table(): Creates a confusion matrix by comparing the true class labels (test_cl$Species) with the predicted class labels (y_pred).
confusionMatrix(): From the caret package, this function calculates metrics like accuracy, precision, recall and F1-score from the confusion matrix.

cm <- table(test_cl$Species, y_pred)
confusionMatrix(cm)

Output:

The output shows that the Naive Bayes model achieved 95% accuracy, with strong performance across all classes, though some misclassifications occurred between Versicolor and Virginica.

Naive Bayes Classifier in R Programming

Bayes’ Theorem Formula

Example Using Bayes’ Theorem

Implementation of Naive Bayes Classifier

1. Installing and Load Required Packages

2. Loading the Dataset

3. Splitting the Dataset

4. Scaling the Features

5. Training the Naive Bayes Model

6. Making Predictions

7. Evaluating the Model

Explore