Logistic Regression in Text Classification classifies text into predefined categories by learning patterns from numerical text representations. It is widely used for tasks such as spam detection, sentiment analysis, and document classification due to its simplicity, efficiency, and strong performance on structured text features.
- Provides a simple and effective baseline for text classification tasks.
- Works efficiently with sparse text representations such as Bag of Words (BoW) and TF-IDF.
- Requires less computational power than deep learning-based NLP models.
- Produces interpretable probability scores for classification decisions.
Working
Logistic Regression classifies text by converting documents into numerical feature vectors and learning the relationship between these features and their corresponding class labels.
- Input Text The process begins by providing a collection of text documents along with their corresponding class labels (for example, spam or ham).
- Text Vectorization The text documents are converted into numerical feature vectors using CountVectorizer, which represents each document based on the frequency of words it contains.
- Feature Representation The generated feature vectors are used as input features, where each feature represents the occurrence of a word in the document.
- Model Training The Logistic Regression model learns the relationship between the feature vectors and their corresponding class labels by estimating the probability of each class.
- Probability Estimation For a new text document, the trained model computes the probability of belonging to each class using the sigmoid function.
- Text Classification The document is assigned to the class with the highest predicted probability, such as spam or ham.
- Model Evaluation The trained model is evaluated on unseen test data using performance metrics such as accuracy and the confusion matrix to measure its classification performance.
You can learn more about Logistic Regression from here.
Implementation using Scikit-Learn
In this implementation, we use the SMS Spam Collection Dataset, which contains SMS (Short Message Service) messages labeled as ham (legitimate messages) and spam (unwanted messages).
Step 1. Import Libraries
Import the required libraries for data preprocessing, text vectorization, model training, and performance evaluation.
- In this code we have used pandas and sklearn.
- CountVectorizer converts text messages into numerical feature vectors based on word frequencies.
- train_test_split separates the dataset into training and testing subsets for model evaluation.
- LogisticRegression is used to train a classifier that predicts whether a message is spam or ham.
- accuracy_score and confusion_matrix are used to measure the classification performance of the trained model.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
Step 2. Load and Prepare the Data
- Load the SMS Spam Collection dataset into a DataFrame and prepare it for training.
- Read the dataset from the CSV file using latin-1 encoding.
- Rename the columns to label and text for better readability.
- Convert class labels from ham and spam to numerical values (0 and 1) for model training.
data = pd.read_csv('/content/spam.csv', encoding='latin-1')
data.rename(columns={'v1': 'label', 'v2': 'text'}, inplace=True)
data['label'] = data['label'].map({'ham': 0, 'spam': 1})
Step 3. Text Vectorization
- Convert the text messages into numerical feature vectors using CountVectorizer.
- Each message is represented as a vector based on the frequency of words it contains, making it suitable for Logistic Regression.
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data['text'])
y = data['label']
Step 4. Split Data into Training and Testing Sets
- Split the feature vectors and labels into training and testing datasets.
- The training set is used to train the Logistic Regression model, while the testing set is used to evaluate its performance on unseen messages.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
Step 5. Train the Logistic Regression Model
Create a Logistic Regression classifier and train it using the training dataset to learn the relationship between word features and message labels.
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
Output:

Step 6. Model Evaluation
- Predict the labels of the test messages and evaluate the model using accuracy and the confusion matrix.
- The accuracy measures the percentage of correctly classified messages, while the confusion matrix provides a detailed summary of correct and incorrect predictions.
y_pred = model.predict(X_test)
print("Accuracy_score" ,accuracy_score(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(f"[[{cm[0,0]} {cm[0,1]}]")
print(f" [{cm[1,0]} {cm[1,1]}]]")
Output:

The model achieved an accuracy of approximately 97.4% on the test dataset. It shows:
- 1199 ham messages were correctly classified.
- 159 spam messages were correctly classified.
- 32 ham messages were incorrectly classified as spam.
- 3 spam messages were incorrectly classified as ham.
Step 7. Manual Testing Function to Classify Text Messages
Create a helper function that accepts a new text message, converts it into a feature vector using the trained CountVectorizer, and predicts whether the message is spam or ham using the trained Logistic Regression model.
def classify_message(model, vectorizer, message):
message_vect = vectorizer.transform([message])
prediction = model.predict(message_vect)
return "spam" if prediction[0] == 0 else "ham"
message = "WINNER!! You have won a $1000 cash prize. Call now to claim."
print(classify_message(model, vectorizer, message))
Output:
spam
You can download the complete code from here.
Applications
- Spam Detection: Classifies SMS or email messages as spam or legitimate using their textual content.
- Sentiment Analysis: Identifies whether customer reviews or feedback express positive or negative opinions.
- News Classification: Categorizes news articles into topics such as sports, business, or technology.
- Intent Classification: Recognizes user intents in chatbots and virtual assistants for appropriate responses.
- Language Identification: Detects the language of a text document for multilingual applications.
Advantages
- Easy to implement, train, and interpret for text classification tasks.
- Learns quickly even on large text datasets with sparse features.
- Provides probability scores for each predicted class.
- Performs effectively with Bag of Words and TF-IDF representations.
- Requires fewer computational resources than deep learning models.
- Delivers reliable performance for many binary and multi-class text classification problems.
Limitations
- Cannot capture complex non-linear relationships in text data.
- Performance relies heavily on the quality of text vectorization.
- Does not consider word order or contextual meaning like transformer models.
- May underperform on advanced NLP problems requiring deeper language understanding.