Text Classification using Logistic Regression

Logistic Regression in Text Classification classifies text into predefined categories by learning patterns from numerical text representations. It is widely used for tasks such as spam detection, sentiment analysis, and document classification due to its simplicity, efficiency, and strong performance on structured text features.

Provides a simple and effective baseline for text classification tasks.
Works efficiently with sparse text representations such as Bag of Words (BoW) and TF-IDF.
Requires less computational power than deep learning-based NLP models.
Produces interpretable probability scores for classification decisions.

Working

Logistic Regression classifies text by converting documents into numerical feature vectors and learning the relationship between these features and their corresponding class labels.

Input Text The process begins by providing a collection of text documents along with their corresponding class labels (for example, spam or ham).
Text Vectorization The text documents are converted into numerical feature vectors using CountVectorizer, which represents each document based on the frequency of words it contains.
Feature Representation The generated feature vectors are used as input features, where each feature represents the occurrence of a word in the document.
Model Training The Logistic Regression model learns the relationship between the feature vectors and their corresponding class labels by estimating the probability of each class.
Probability Estimation For a new text document, the trained model computes the probability of belonging to each class using the sigmoid function.
Text Classification The document is assigned to the class with the highest predicted probability, such as spam or ham.
Model Evaluation The trained model is evaluated on unseen test data using performance metrics such as accuracy and the confusion matrix to measure its classification performance.

You can learn more about Logistic Regression from here.

Implementation using Scikit-Learn

In this implementation, we use the SMS Spam Collection Dataset, which contains SMS (Short Message Service) messages labeled as ham (legitimate messages) and spam (unwanted messages).

Step 1. Import Libraries

Import the required libraries for data preprocessing, text vectorization, model training, and performance evaluation.

In this code we have used pandas and sklearn.
CountVectorizer converts text messages into numerical feature vectors based on word frequencies.
train_test_split separates the dataset into training and testing subsets for model evaluation.
LogisticRegression is used to train a classifier that predicts whether a message is spam or ham.
accuracy_score and confusion_matrix are used to measure the classification performance of the trained model.

Python

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

Step 2. Load and Prepare the Data

Load the SMS Spam Collection dataset into a DataFrame and prepare it for training.
Read the dataset from the CSV file using latin-1 encoding.
Rename the columns to label and text for better readability.
Convert class labels from ham and spam to numerical values (0 and 1) for model training.

Python

data = pd.read_csv('/content/spam.csv', encoding='latin-1')
data.rename(columns={'v1': 'label', 'v2': 'text'}, inplace=True)
data['label'] = data['label'].map({'ham': 0, 'spam': 1})

Step 3. Text Vectorization

Convert the text messages into numerical feature vectors using CountVectorizer.
Each message is represented as a vector based on the frequency of words it contains, making it suitable for Logistic Regression.

Python

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data['text'])
y = data['label']

Step 4. Split Data into Training and Testing Sets

Split the feature vectors and labels into training and testing datasets.
The training set is used to train the Logistic Regression model, while the testing set is used to evaluate its performance on unseen messages.

Python

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

Step 5. Train the Logistic Regression Model

Create a Logistic Regression classifier and train it using the training dataset to learn the relationship between word features and message labels.

Python

model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

Output:

Screenshot-2025-04-04-172831 — Logistic Regression

Step 6. Model Evaluation

Predict the labels of the test messages and evaluate the model using accuracy and the confusion matrix.
The accuracy measures the percentage of correctly classified messages, while the confusion matrix provides a detailed summary of correct and incorrect predictions.

Python

y_pred = model.predict(X_test)
print("Accuracy_score" ,accuracy_score(y_test, y_pred))

cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(f"[[{cm[0,0]} {cm[0,1]}]")
print(f" [{cm[1,0]} {cm[1,1]}]]")

Output:

Screenshot-2026-07-03-170623 — Confusion Matrix

The model achieved an accuracy of approximately 97.4% on the test dataset. It shows:

1199 ham messages were correctly classified.
159 spam messages were correctly classified.
32 ham messages were incorrectly classified as spam.
3 spam messages were incorrectly classified as ham.

Step 7. Manual Testing Function to Classify Text Messages

Create a helper function that accepts a new text message, converts it into a feature vector using the trained CountVectorizer, and predicts whether the message is spam or ham using the trained Logistic Regression model.

Python

def classify_message(model, vectorizer, message):
    message_vect = vectorizer.transform([message])
    prediction = model.predict(message_vect)
    return "spam" if prediction[0] == 0 else "ham"

message = "WINNER!! You have won a $1000 cash prize. Call now to claim."
print(classify_message(model, vectorizer, message))

Output:

spam

You can download the complete code from here.

Applications

Spam Detection: Classifies SMS or email messages as spam or legitimate using their textual content.
Sentiment Analysis: Identifies whether customer reviews or feedback express positive or negative opinions.
News Classification: Categorizes news articles into topics such as sports, business, or technology.
Intent Classification: Recognizes user intents in chatbots and virtual assistants for appropriate responses.
Language Identification: Detects the language of a text document for multilingual applications.

Advantages

Easy to implement, train, and interpret for text classification tasks.
Learns quickly even on large text datasets with sparse features.
Provides probability scores for each predicted class.
Performs effectively with Bag of Words and TF-IDF representations.
Requires fewer computational resources than deep learning models.
Delivers reliable performance for many binary and multi-class text classification problems.

Limitations

Cannot capture complex non-linear relationships in text data.
Performance relies heavily on the quality of text vectorization.
Does not consider word order or contextual meaning like transformer models.
May underperform on advanced NLP problems requiring deeper language understanding.

Text Classification using Logistic Regression

Working

Implementation using Scikit-Learn

Step 1. Import Libraries

Step 2. Load and Prepare the Data

Step 3. Text Vectorization

Step 4. Split Data into Training and Testing Sets

Step 5. Train the Logistic Regression Model

Step 6. Model Evaluation

Step 7. Manual Testing Function to Classify Text Messages

Applications

Advantages

Limitations

Related Articles

Explore