Stepwise Regression in Python

Last Updated : 14 Apr, 2026

Stepwise regression is a method of fitting a regression model by iteratively adding or removing variables. It is used to build a model that is accurate and parsimonious, meaning that it has the smallest number of variables that can explain the data.

Stepwise regression combines both forward selection and backward elimination approaches:

  • Forward Selection: In forward selection, the algorithm starts with an empty model and iteratively adds variables to the model until no further improvement is made.
  • Backward Elimination: In backward elimination, the algorithm starts with a model that includes all variables and iteratively removes variables until no further improvement is made.

Unlike pure forward or backward methods, stepwise regression dynamically adds or removes variables at each step based on a chosen criterion (such as AIC, BIC, or p-values).

Use of Stepwise Regression

  • Builds accurate and parsimonious models (minimum necessary variables)
  • Automatically selects the most relevant features
  • Removes irrelevant or redundant variables
  • Reduces model complexity and improves interpretability
  • Helps minimize overfitting and improves generalization

 Stepwise Regression And Other Regression Models

  • Stepwise Regression: Automatically adds/removes variables to select the best subset
  • (OLS) ordinary least squares : Uses all variables; no automatic feature selection
  • LASSO: Performs regularization and shrinks coefficients, indirectly selecting features

Advantages of Stepwise Regression:

  • Saves manual effort in feature selection
  • Reduces unnecessary variables

Limitations:

  • May not always find the optimal model
  • Sensitive to data and variable order

Difference between stepwise regression and Linear regression

FeatureLinear RegressionStepwise Regression
PurposeModels relationship between variablesSelects best subset of variables + builds model
VariablesUses all given predictorsSelects important predictors automatically
ProcessOne-time model fittingIterative (add/remove variables)
Feature SelectionNot includedBuilt-in feature selection
ComplexityFixedDynamic

Implemplementation of Stepwise Regression in Python

To perform stepwise regression in Python, you can follow these steps:

  • Install the mlxtend library by running pip install mlxtend in your command prompt or terminal.
  • Import the necessary modules from the mlxtend library, including sequential_feature_selector and linear_model.
  • Define the features and target variables in your dataset.
  • Initialize the stepwise regression model with the sequential_feature_selector and specify the type of regression to be used (e.g. linear_model.LinearRegression for linear regression).
  • Fit the stepwise regression model to your dataset using the fit method.

Use the k_features attribute of the fitted model to see which features were selected by the stepwise regression.

Importing Libraries

To implement stepwise regression, you will need to have the following libraries installed:

  • Pandas: For data manipulation and analysis.
  • NumPy: For working with arrays and matrices.
  • Sklearn: for machine learning algorithms and preprocessing tools
  • mlxtend: for feature selection algorithms

The first step is to define the array of data and convert it into a dataframe using the NumPy and pandas libraries. Then, the features and target are selected from the dataframe using the iloc method.

Python
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from mlxtend.feature_selection import SequentialFeatureSelector

# Define the array of data
data = np.array([[1, 2, 3, 4],
                 [5, 6, 7, 8],
                 [9, 10, 11, 12]])

# Convert the array into a dataframe
df = pd.DataFrame(data)

# Select the features and target
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

Model Development in Stepwise Regression

Next, stepwise regression is performed using the SequentialFeatureSelector() function from the mlxtend library. This function uses a logistic regression model to select the most important features in the dataset, and the number of selected features can be specified using the k_features parameter.

Python
# Perform stepwise regression
sfs = SequentialFeatureSelector(linear_model.LogisticRegression(),
                                k_features=3,
                                forward=True,
                                scoring='accuracy',
                                cv=None)
selected_features = sfs.fit(X, y)

After the stepwise regression is complete, the selected features are checked using the selected_features.k_feature_names_ attribute and a data frame with only the selected features are created. Finally, the data is split into train and test sets using the train_test_split() function from the sklearn library, and a logistic regression model is fit using the selected features. The model performance is then evaluated using the accuracy_score() function from the sklearn library.

Python
# Create a dataframe with only the selected features
selected_columns = [0, 1, 2, 3]
df_selected = df[selected_columns]

# Split the data into train and test sets
X_train, X_test,\
    y_train, y_test = train_test_split(
        df_selected, y,
        test_size=0.3,
        random_state=42)

# Fit a logistic regression model using the selected features
logreg = linear_model.LogisticRegression()
logreg.fit(X_train, y_train)

# Make predictions using the test set
y_pred = logreg.predict(X_test)

# Evaluate the model performance
print(y_pred)

Output:

[8]

The difference between linear regression and stepwise regression is that stepwise regression is a method for building a regression model by iteratively adding or removing predictors, while linear regression is a method for modeling the relationship between a response and one or more predictor variables.

In the stepwise regression examples, the mlxtend library is used to iteratively add or remove predictors based on their relationship with the response variable, while in the linear regression examples, all predictors are used to fit the model.

Comment