Skip to content

Auto MPG Linear Regression Analysis

This notebook performs a complete linear regression analysis on the Auto MPG dataset. The goal is to predict the miles per gallon (mpg) of vehicles based on various features such as engine size, weight, and model year.

Dataset

The dataset is designed to predict city-cycle fuel consumption in miles per gallon (MPG) based on various vehicle attributes. Originally from the StatLib library maintained at Carnegie Mellon University, originally used in the 1983 American Statistical Association Exposition. Predicting fuel efficiency helps manufacturers design better vehicles, assists consumers in making informed decisions, and supports environmental policy planning.

Use Case Description

Our objective is to predict the mpg using linear regression techniques. This involves: - Exploring relationships between car attributes and fuel consumption - Identifying statistically significant predictors - Validating model assumptions and performance - Diagnosing issues such as multicollinearity, autocorrelation, and outliers - Evaluating model robustness using cross-validation and residual analysis

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_percentage_error
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SequentialFeatureSelector
from statsmodels.api import OLS, add_constant
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.stats.stattools import durbin_watson
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('../Data/auto-mpg.csv')
df.dropna(inplace=True)
df
mpg cylinders displacement horsepower weight acceleration model-year
0 18.0 8 307.0 130.0 3504 12.0 70
1 15.0 8 350.0 165.0 3693 11.5 70
2 18.0 8 318.0 150.0 3436 11.0 70
3 16.0 8 304.0 150.0 3433 12.0 70
4 17.0 8 302.0 140.0 3449 10.5 70
... ... ... ... ... ... ... ...
393 27.0 4 140.0 86.0 2790 15.6 82
394 44.0 4 97.0 52.0 2130 24.6 82
395 32.0 4 135.0 84.0 2295 11.6 82
396 28.0 4 120.0 79.0 2625 18.6 82
397 31.0 4 119.0 82.0 2720 19.4 82

396 rows × 7 columns

Feature Engineering and Cleaning

Data attributes and description

Feature Type Description
mpg Continuous Miles per gallon (target variable)
cylinders Multi-valued discrete Number of cylinders
displacement Continuous Engine displacement
horsepower Continuous Engine horsepower
weight Continuous Vehicle weight
acceleration Continuous Time to accelerate from 0 to 60 mph
model year Multi-valued discrete Year of manufacture
df.describe()
mpg cylinders displacement horsepower weight acceleration model-year
count 396.000000 396.000000 396.000000 396.000000 396.000000 396.000000 396.000000
mean 23.517172 5.457071 193.650253 104.189394 2973.000000 15.555808 76.027778
std 7.834368 1.703511 104.422387 38.402030 847.690354 2.758295 3.696969
min 9.000000 3.000000 68.000000 46.000000 1613.000000 8.000000 70.000000
25% 17.375000 4.000000 104.750000 75.000000 2225.250000 13.800000 73.000000
50% 23.000000 4.000000 148.500000 92.000000 2803.500000 15.500000 76.000000
75% 29.000000 8.000000 263.250000 125.000000 3610.000000 17.125000 79.000000
max 46.600000 8.000000 455.000000 230.000000 5140.000000 24.800000 82.000000

Categorical encoding

To use categorical variables in machine learning models, they must be converted into numerical form through encoding. Common types include:
- Label Encoding: Assigns a unique integer to each category. Simple but may imply order where none exists. - One-Hot Encoding: Creates a binary column for each category. Prevents ordinal assumptions. - Dummy Encoding: A variant of one-hot encoding that drops one column to avoid multicollinearity (dummy variable trap).

For example, if the feature 'cylinders' has categories 3,4,5,6,8 dummy encoding would create four columns: cylinders_4, cylinders_5, cylinders_6, cylinders_8 with cylinders_3 implied when all are zero.

df = pd.get_dummies(df, columns=['cylinders', 'model-year'], drop_first=True)
df.columns
Index(['mpg', 'displacement', 'horsepower', 'weight', 'acceleration',
       'cylinders_4', 'cylinders_5', 'cylinders_6', 'cylinders_8',
       'model-year_71', 'model-year_72', 'model-year_73', 'model-year_74',
       'model-year_75', 'model-year_76', 'model-year_77', 'model-year_78',
       'model-year_79', 'model-year_80', 'model-year_81', 'model-year_82'],
      dtype='object')

Train test split

We separate the dataset into independent variables (features) and the dependent variable (target). The independent variables (X) are the predictors that influence the outcome, while the dependent variable (Y) is what we aim to predict (in this case, the car's miles per gallon (MPG))

X = df.drop(columns=['mpg'])
y = df['mpg']

We divide our data into two parts: training data (80%) and testing data (20%). The training set is used to fit the model, while the test set evaluates how well the model generalizes to unseen data. This helps prevent overfitting and gives us a realistic estimate of model performance.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Baseline Linear Regression model

We fit a baseline linear regression on the training data to establish reference performance. The model assumes a linear relationship between features and MPG, with no hyperparameter tuning. We evaluate RMSE and R² on the test set.

lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

Accuracy

After training, we evaluate the model using: - RMSE (Root Mean Squared Error): quantifies the average prediction error in MPG (lower is better).
\(\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}\)
- R² (Coefficient of Determination): Indicates how much variance in the target is explained by the model (higher is better).
\(R^2 = 1 - \frac{SS_{res}}{SS_{tot}}\)
- Adjusted R²: Adjusts R² for the number of predictors, penalizing unnecessary complexity.
\(\text{Adjusted } R^2 = 1 - \left( \frac{(1 - R^2)(n - 1)}{n - p - 1} \right)\)
- MAPE (Mean Absolute Percentage Error): Expresses prediction error as a percentage. It’s useful for understanding the error relative to the actual values. (lower is better)
\(\text{MAPE} = \frac{100\%}{n} \sum_{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i} \right|\)

These metrics help us understand both accuracy and model fit.

def calculate_adjusted_r_squared(X_test, r2):
    n = X_test.shape[0]
    p = X_test.shape[1]
    adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
    return adj_r2

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred) * 100

print(f"RMSE: {rmse:.3f}")
print(f"R²: {r2:.3f}")
print('Adjusted R-squared: {0}'.format(round(calculate_adjusted_r_squared(X_test, r2), 3)))
print(f"MAPE: {mape:.2f}%")
RMSE: 2.552
R²: 0.874
Adjusted R-squared: 0.832
MAPE: 9.25%

MAPE is 9% (lower is better). The model explains 87% of the variance in MPG (R² = 0.87). We next examine coefficient significance and overall model. These statistical tests help assess the reliability and explanatory power of the regression.

Detailed statistical summary

t-test for Individual Coefficients

Used to determine whether each independent variable significantly contributes to the model. - Null Hypothesis (H₀): The coefficient of the predictor variable is equal to zero (i.e., the variable has no effect).
$$ H_0: \beta_i = 0$$ - Alternative Hypothesis (H₁): The coefficient of the predictor variable is not equal to zero (i.e., the variable has a significant effect).
$$ H_1: \beta_i \ne 0 $$

F-test for Overall Model Significance

Used to test whether the regression model as a whole is statistically significant. - Null Hypothesis (H₀): All regression coefficients are equal to zero (i.e., the model has no explanatory power). $$ H_0: \beta_1 = \beta_2 = \dots = \beta_k = 0$$ - Alternative Hypothesis (H₁): At least one regression coefficient is not zero (i.e., the model explains some variability in the response). $$ \text{At least one } \beta_i \ne 0 $$ Using Ordinary Least Squares (OLS) from statsmodels, we get a detailed statistical summary: - p-values: Test the significance of each feature. A low p-value (< 0.05) suggests the feature is statistically significant. - t-statistics: Measure how many standard deviations the coefficient is from zero. - F-statistic: Tests whether the overall regression model is a good fit. - Confidence Intervals: Provide a range in which the true coefficient likely falls. This analysis helps us interpret the model beyond just accuracy.

X_const = add_constant(X)
model = OLS(y, X_const).fit()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.864
Model:                            OLS   Adj. R-squared:                  0.857
Method:                 Least Squares   F-statistic:                     119.5
Date:                Sun, 26 Oct 2025   Prob (F-statistic):          5.23e-149
Time:                        15:42:54   Log-Likelihood:                -980.93
No. Observations:                 396   AIC:                             2004.
Df Residuals:                     375   BIC:                             2087.
Df Model:                          20                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const            33.1314      2.397     13.822      0.000      28.418      37.845
displacement      0.0008      0.007      0.129      0.898      -0.012       0.014
horsepower       -0.0236      0.013     -1.788      0.075      -0.050       0.002
weight           -0.0057      0.001     -9.036      0.000      -0.007      -0.004
acceleration      0.0326      0.089      0.365      0.715      -0.143       0.208
cylinders_4       6.5792      1.573      4.184      0.000       3.487       9.671
cylinders_5       7.3758      2.368      3.115      0.002       2.719      12.032
cylinders_6       4.1555      1.760      2.361      0.019       0.695       7.616
cylinders_8       7.0737      2.032      3.480      0.001       3.077      11.070
model-year_71     1.0667      0.844      1.264      0.207      -0.592       2.726
model-year_72    -0.4355      0.831     -0.524      0.601      -2.070       1.199
model-year_73    -0.3859      0.749     -0.516      0.606      -1.858       1.086
model-year_74     1.4282      0.883      1.618      0.107      -0.308       3.164
model-year_75     1.1654      0.865      1.347      0.179      -0.536       2.867
model-year_76     1.6378      0.831      1.972      0.049       0.005       3.271
model-year_77     3.0367      0.845      3.592      0.000       1.374       4.699
model-year_78     3.0020      0.802      3.741      0.000       1.424       4.580
model-year_79     4.7447      0.850      5.579      0.000       3.073       6.417
model-year_80     9.3696      0.884     10.595      0.000       7.631      11.109
model-year_81     6.7887      0.874      7.767      0.000       5.070       8.507
model-year_82     7.4486      0.856      8.699      0.000       5.765       9.132
==============================================================================
Omnibus:                       30.461   Durbin-Watson:                   1.513
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               57.549
Skew:                           0.460   Prob(JB):                     3.19e-13
Kurtosis:                       4.625   Cond. No.                     7.88e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.88e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Feature Significance Summary

Statistically Significant Features

These features have p-values less than 0.05, indicating they significantly contribute to predicting MPG: - weight (negative relationship with mpg) - cylinders_4, cylinders_5, cylinders_6, cylinders_8 - model-year_76 to 82 These variables are strong predictors of fuel efficiency in the dataset.

Non-Significant Features

These features have p-values greater than 0.05, suggesting they do not significantly impact the target variable in this model: - displacement - horsepower - acceleration - model-year_71 to 75

These may be candidates for removal or further investigation (e.g., multicollinearity or interaction effects).

Overall Model Significance

  • R-squared = 0.864: The model explains 86.4% of the variance in MPG, which indicates a strong fit.
  • F-statistic = 119, with a p-value < 0.05: This means the overall regression model is highly statistically significant, and at least one predictor variable is meaningfully related to the target.

Assumptions of Linear Regression

  1. Linearity: The relationship between predictors and response is linear.
  2. Independence: Observations are independent.
  3. Homoscedasticity: Constant variance of residuals.
  4. Normality of residuals: Residuals are normally distributed.
  5. No autocorrelation: Residuals are not autocorrelated.
  6. No multicollinearity: Predictors are not highly correlated.

Testing residuals

For a linear regression, we have the following assumptions on the residuals (errors): 1. Randomly scattered (no patterns): indicates linearity and homoscedasticity. 2. Normally distributed: checked using a Q-Q plot. 3. Mean of zero: ensures unbiased predictions. 4. No autocorrelation of residuals: ensures no lag or lead patterns in the data (using Durbin Watson test)

We use residual plots to visually inspect these assumptions. Violations may suggest model misspecification or the need for transformation.

residuals = y_test - y_pred
sns.histplot(residuals, kde=True)
plt.title("Residuals Distribution")
plt.show()

stats.probplot(residuals, dist="norm", plot=plt)
plt.title("Q-Q Plot of Residuals")
plt.show()

plt.scatter(y_pred, residuals)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel("Predicted MPG")
plt.ylabel("Residuals")
plt.title("Residuals vs Predicted Values")
plt.show()

png

png

png

Results 1. Residuals vs. Predicted Values Plot - We observe residuals are scattered around the horizontal line at zero. - This suggests that the model captures the linear relationship reasonably well. - No clear pattern is visible, which supports the assumption of linearity and homoscedasticity (constant variance of residuals). - However, if there’s any slight funnel shape or curvature, it might hint at mild heteroscedasticity or non-linearity.

  1. Q-Q Plot of Residuals
  2. We observe most points lie close to the red reference line.
  3. This indicates that the residuals are approximately normally distributed, which satisfies one of the key assumptions of linear regression.
  4. Minor deviations at the tails suggest a few outliers or slight skewness.

  5. Histogram of Residuals

  6. We observe the distribution is roughly symmetric but slightly right-skewed.
  7. Most residuals are centered around zero, which is good. The slight skew suggests a few larger positive errors, but overall, the distribution is close to normal.

Durbin-Watson statistic

The Durbin-Watson (DW) statistic tests for autocorrelation in residuals: - Value ≈ 2 → no autocorrelation - Value < 2 → positive autocorrelation - Value > 2 → negative autocorrelation

DW test interpretation is limited for cross‑sectional data without a natural sequence, the test assumes a meaningful order within the data, so it might not be applicable for mpg dataset.

dw_stat = durbin_watson(residuals)
print('Durbin-Watson statistic:', dw_stat)
Durbin-Watson statistic: 2.2787422579479517

Conclusion

The residual diagnostics suggest that the model assumptions are mostly satisfied: - Linearity: Supported by the residuals vs. predicted plot. - Homoscedasticity: No strong evidence of non-constant variance. - Normality of residuals: Mostly satisfied with minor deviations. - Autocorrelation: No autocorrelation within the data

Multicollinearity

One of the assumptions of linear regression is "no perfect multicollinearity": Linear regression assumes that no independent variable is a perfect linear function of another. VIF helps quantify and diagnose this issue. VIF measures how much the variance of a regression coefficient is inflated due to multicollinearity. Multicollinearity is when two or more independent variables are highly correlated. High multicollinearity can make coefficient estimates unstable and difficult to interpret.

Multicollinearity violates one of the key assumptions of linear regression: that the independent variables are not highly correlated. It can lead to: - Inflated standard errors - Unreliable p-values - Difficulty in identifying the true effect of each predictor

When multicollinearity is high, coefficients (\(\beta\) values) and p‑values become unstable, even if predictions can remain reasonably accurate. The predicted results are not affected by multicollinearity.

For each predictor \(X_i\), VIF is calculated as: $$ \text{VIF}_i = \frac{1}{1 - R_i^2} $$ Where \(R_i^2\) is the R-squared value from regressing \(X_i\) on all other predictors.

Interpreting VIF Values

VIF Value Interpretation
1 No multicollinearity
1 – 5 Moderate correlation (usually acceptable)
> 5 High correlation (potential concern)
> 10 Serious multicollinearity (consider action)
vif_data = pd.DataFrame()
X_vif = add_constant(pd.get_dummies(df.drop(columns=['mpg']), drop_first=True))
vif_data['feature'] = X_vif.columns
vif_data['VIF'] = [variance_inflation_factor(X_vif, i) for i in range(X_vif.shape[1])]
print(vif_data.sort_values("VIF", ascending=False))
          feature         VIF
0           const  259.596012
8     cylinders_8   35.914435
5     cylinders_4   27.913953
7     cylinders_6   23.177861
1    displacement   21.175481
3          weight   12.802439
2      horsepower   11.594109
4    acceleration    2.734647
14  model-year_76    2.446511
16  model-year_78    2.404265
18  model-year_80    2.398209
20  model-year_82    2.390149
13  model-year_75    2.368613
19  model-year_81    2.342316
11  model-year_73    2.298801
17  model-year_79    2.217575
12  model-year_74    2.159981
15  model-year_77    2.122018
10  model-year_72    2.051657
9   model-year_71    2.043253
6     cylinders_5    1.904919

We can see that multiple features are highly correlated such as cylinders, displacement, weight and horsepower.

Outlier detection

Outliers can distort model performance. We use z-scores to detect them \(z = \frac{(x - \mu)}{\sigma}\)
A z-score > 3 or < -3 typically indicates an outlier. Identifying and handling outliers ensures the model is not overly influenced by extreme values.

continuous = ['displacement','horsepower','weight','acceleration']
numeric_df = df[continuous]
z_scores = np.abs(stats.zscore(numeric_df))
outliers = (z_scores > 3).sum(axis=1)
num_outliers = (outliers > 0).sum()
print('Number of outliers detected using Z-score method:', num_outliers)
Number of outliers detected using Z-score method: 7

Cross-validation accuracy scores

Cross-validation splits the data into multiple folds to train and test the model on different subsets. This gives a more robust estimate of model performance. We typically use k-fold cross-validation (e.g., k=5 or 10) to reduce variance and avoid overfitting.

cv = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(lr, X, y, cv=cv, scoring='neg_root_mean_squared_error')
print("CV RMSE (mean ± std):", -cv_scores.mean(), "±", cv_scores.std())
CV RMSE (mean ± std): 3.094602349591984 ± 0.5105627986552046

We can see that the cross validations scores is varying and this means that the model is not robust enough on unseen data.

Building a better model

We improve the model through a multi‑step approach: addressing multicollinearity, feature engineering, and regularization with cross‑validated hyperparameters:
- Addressing multicollinearity using techniques like removing high-VIF features, combining correlated variables, or applying dimensionality reduction. - Enhance the model with feature engineering: create interaction terms, polynomial features, or log transformations based on bivariate visualizations. - To improve generalization and reduce overfitting, apply regularization techniques such as Lasso (for feature selection) and Ridge (for coefficient shrinkage), and tune their hyperparameters using Grid Search. - Validate the model using k-fold cross-validation to ensure stability and robustness across different data splits.

Visualizations

# Importing the necessary libraries
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from ml_plottings import regression_bivariate_analysis, plot_correlation_matrix
df = pd.read_csv('../Data/auto-mpg.csv')
df['cylinders'] = df.cylinders.astype(str)
df['model-year'] = df['model-year'].astype(str)
df.dropna(inplace=True)
regression_bivariate_analysis(df, 'mpg')

png

png

png

png

png

png

Observations 1. Displacement and horsepower show a non-linear relationship with mpg, indicating that linear regression may not capture their effects accurately without transformation. 2. MPG varies significantly by cylinder count — vehicles with 4 cylinders have the highest average mpg, followed by those with 5 cylinders, suggesting a strong categorical influence. 3. Model year has a noticeable impact on mpg, with vehicles from 1977 onwards showing a higher average and wider range of mpg values, likely due to improvements in fuel efficiency over time.

Correlation matrix

A correlation matrix is a table that shows the pairwise correlation coefficients between features in a dataset. It helps identify linear relationships between variables, where values close to +1 or -1 indicate strong positive or negative correlations, respectively. This is used to detect multicollinearity.

col_for_corr = ['cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'model-year']

plot_correlation_matrix(df, col_for_corr)

png

From the correlation matrix, we observe that displacement, horsepower, cylinders_8, and weight are strongly positively correlated with each other, indicating that these features tend to increase together. In contrast, cylinders_4 and acceleration show negative correlations with this group, suggesting that vehicles with fewer cylinders and higher acceleration typically have lower displacement and weight(they are more fuel-efficient). These relationships are further confirmed by the high VIF scores for the positively correlated features, all exceeding 10, which signals severe multicollinearity and potential redundancy in the model.

Feature engineering

Feature engineering is done to address the non-linear relationships correlations and improve model performance: 1. The patterns suggest a hyperbolic (1/x) or logarithmic relationship. Therefore, we create \(1/horsepower\), \(1/weight\) and \(log(\text{displacement})\) 2. From domain knowledge, horsepower and weight jointly influence fuel efficiency. To capture this interaction, we create \(\frac{1}{(\text{horsepower} \times \text{weight})}\). This also helps in reducing VIF (multi-correlation) as we are combining highly correlated features

# Creating 1/var
df['1_hp'] = 1/df['horsepower']
df['1_weight'] = 1/df['weight']
df['1_disp'] = 1/df['displacement']

# Creating log(var)
df['log_disp'] = np.log(df.displacement)
df['log_hp'] = np.log(df.horsepower)
df['log_weight'] = np.log(df.weight)

# Creating domain knowledge vars
df['1_hp_times_weight'] = 1/(df['horsepower']*df['weight'])

regression_bivariate_analysis(df[['1_hp', '1_weight', '1_disp', '1_hp_times_weight', 'log_hp','log_weight', 'log_disp', 'mpg']], 'mpg')

png

png

png

png

png

png

png

We can see that most of the newly created variables have a linear relationship.
Instead of treating each model year as a separate category, we can group them into three meaningful segments based on mpg trends: 1. Part 1: 1970–1975 - Lower average mpg and narrower range - Represents older, less fuel-efficient vehicles 2. Part 2: 1976–1979 - Moderate increase in mpg - Transitional period with improving fuel standards 3. Part 3: 1980–1982 - Significant jump in average mpg and wider range - Reflects newer, more efficient models

This can be due to regulatory changes or fuel efficiency improvements in these eras.

df['model_year_int'] = df['model-year'].astype(int)
bins = [1969, 1975, 1979, 1982]
labels = ['1970–1975','1976–1979','1980–1982']
df['model_year_group'] = pd.cut(1900+df['model_year_int'], bins=[1969,1975,1979,1982], labels=labels)

regression_bivariate_analysis(df[['model_year_group', 'mpg']], 'mpg')

png

We can see that there is a significant difference between the groups.

Feature selection

Since we have several variables that are highly correlated (as seen in the correlation matrix and VIF scores), not significant (from the p-values in statistical summary) and multiple new variables created, it's important to apply feature selection to reduce redundancy, improve model interpretability, and prevent overfitting. Two effective techniques are: 1. Forward and Backward Selection: These are stepwise selection methods. These methods help identify a subset of features that contribute most to predicting mpg 2. Regularization (Lasso and Ridge Regression): Regularization stabilizes estimates under multicollinearity and improves generalization. Ridge (L2) shrinks coefficients; Lasso (L1) can set some to zero (feature selection). We select the regularization strength via cross‑validated grid search in a leakage‑safe pipeline.

Forward and backward selection

These are feature selection techniques: - Forward Selection: Start with no features, add one at a time based on performance improvement. - Backward Elimination: Start with all features, remove the least significant one iteratively.

These methods help in building a simpler, more interpretable model by retaining only the most relevant predictors.

df_fb_selection = pd.get_dummies(df, columns=['model_year_group', 'cylinders'], drop_first=True)
X_fb_selection = df_fb_selection.drop(['mpg', 'model-year'], axis=1, inplace=False)
y_fb_selection = df_fb_selection.mpg

cv5 = KFold(n_splits=5, shuffle=True, random_state=42)
sfs_forward = SequentialFeatureSelector(LinearRegression(), n_features_to_select=5, direction='forward', cv=cv5)
sfs_forward.fit(X_fb_selection, y_fb_selection)

sfs_backward = SequentialFeatureSelector(LinearRegression(), n_features_to_select=5, direction='backward', cv=cv5)
sfs_backward.fit(X_fb_selection, y_fb_selection)

forward_features = X_fb_selection.columns[sfs_forward.get_support()].tolist()
backward_features = X_fb_selection.columns[sfs_backward.get_support()].tolist()
print('Forward selected features:', forward_features)
print('Backward selected features:', backward_features)
Forward selected features: ['weight', '1_hp', 'log_weight', 'model_year_int', 'model_year_group_1980–1982']
Backward selected features: ['1_hp', '1_weight', 'model_year_int', 'model_year_group_1980–1982', 'cylinders_4']

Lasso and Ridge Regularization

Regularization techniques shrink the magnitude of coefficients reducing multicollinearity.
- Lasso Regression adds an L1 penalty, which can shrink some coefficients to zero, effectively performing feature selection. L1 penalty adds a penalty equal to the absolute value of the coefficients. \(\text{Loss} = \text{RSS} + \alpha \sum |\beta_i|\) - Ridge Regression adds an L2 penalty, which reduces the magnitude of coefficients but retains all features.L2 penalty adds a penalty equal to the square of the coefficients \(\text{Loss} = \text{RSS} + \alpha \sum \beta_i^2\)

Because the penalty is on the magnitude of the \(\beta\) values, the scale of the variables become important. First we need to bring all the variables to the same scale before applying Lasso and Ridge regularizations.

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.linear_model import Lasso, Ridge
from sklearn.model_selection import train_test_split

# list of features
num_features = ['displacement','horsepower','weight','acceleration',
                '1_hp','1_weight','1_disp','log_disp','log_hp','log_weight',
                '1_hp_times_weight']
cat_features = ['model_year_group','cylinders']

# Scaling and one hot encoding
preprocess = ColumnTransformer([
    ('num', MinMaxScaler(), num_features),
    ('cat', OneHotEncoder(handle_unknown='ignore', drop='first'), cat_features)
])

# Test train split
X_lasso = df[num_features + cat_features]
y_lasso = df['mpg']
X_train, X_test, y_train, y_test = train_test_split(X_lasso, y_lasso, test_size=0.2, random_state=42)

# Creating pipelines
lasso_pipe = Pipeline([('preprocess', preprocess), ('reg', Lasso(alpha=0.1, max_iter=10000))])
ridge_pipe = Pipeline([('preprocess', preprocess), ('reg', Ridge(alpha=1.0))])

# Fitting the models
lasso_pipe.fit(X_train, y_train)
ridge_pipe.fit(X_train, y_train)


# Create DataFrame for coefficients
ohe = lasso_pipe.named_steps['preprocess'].transformers_[1][1]
cat_names = list(ohe.get_feature_names_out(cat_features))
feat_names = num_features + cat_names

coef_df = pd.DataFrame({
    'Feature': feat_names,
    'Lasso Coefficient': lasso_pipe.named_steps['reg'].coef_,
    'Ridge Coefficient': ridge_pipe.named_steps['reg'].coef_
})

# Plot
coef_df.set_index('Feature').plot(kind='bar', figsize=(10, 6))
plt.title('Lasso vs Ridge Coefficients')
plt.ylabel('Coefficient Value')
plt.grid(True)
plt.show();

png

These methods show how different approaches can be used to do feature selection.

We have looked at various approaches that provide different feature selections and different results. We have also observed that within each model, different parameters can be fine-tuned to get different results. We can use cross validation and grid search algorithms to identify the best parameters and models across all the combinations possible. They improve the robustness and performance of a base machine learning model.
- Cross-validation helps evaluate a model's generalizability by splitting the training data into multiple folds, ensuring that the model is tested on different subsets and not just a single train-test split. This reduces the risk of overfitting and gives a more reliable estimate of model performance. - GridSearchCV builds on this by systematically searching through a predefined set of hyperparameters (like alpha in regularized regressions) using cross-validation to find the combination that yields the best performance metric (e.g., R² or RMSE).

Together, they help in selecting the most optimal model configuration, making the final model more accurate and stable when applied to unseen data.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.model_selection import KFold, GridSearchCV, train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error


# Define models
lr = LinearRegression()
lasso = Lasso(max_iter=10000)
ridge = Ridge()
elastic_net = ElasticNet(max_iter=10000)

# Features
num_features = ['displacement','horsepower','weight','acceleration', '1_hp','1_weight','1_disp',
                'log_disp','log_hp','log_weight', '1_hp_times_weight']
cat_features = ['model_year_group','cylinders']

# Ensure correct types
df['model_year_group'] = df['model_year_group'].astype('category')
df['cylinders'] = df['cylinders'].astype('category')

# Create a pipeline
preprocess = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_features),
        ('cat', OneHotEncoder(handle_unknown='ignore', drop='first'), cat_features)
    ],
    remainder='drop'
)

pipe = Pipeline([
    ('preprocess', preprocess),
    ('regressor', Lasso())  # placeholder; will be overridden by grid
])

# Define hyperparameter grids
param_lr = {'regressor': [lr]}
param_lasso = {'regressor': [lasso], 'regressor__alpha': [0.01, 0.1, 1, 10]}
param_ridge = {'regressor': [ridge], 'regressor__alpha': [0.01, 0.1, 1, 10]}
param_elastic_net = {'regressor': [elastic_net], 'regressor__alpha': [0.01, 0.1, 1, 10], 'regressor__l1_ratio':[0.25, 0.5, 0.75]}

# Combine all
param_grid = [param_lr, param_lasso, param_ridge, param_elastic_net]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(df[num_features+cat_features], df.mpg, test_size=0.2, random_state=42)

# Grid Search CV
grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='neg_root_mean_squared_error')
grid_search.fit(X_train, y_train)

# Best model and coefficients
best_model = grid_search.best_estimator_.named_steps['regressor']
feature_names = num_features + list(grid_search.best_estimator_.named_steps['preprocess'].transformers_[1][1].get_feature_names_out(cat_features))
coefficients = pd.Series(best_model.coef_, index=feature_names)

# Output
print("Best Parameters:", grid_search.best_params_)
print("Best(least) RMSE Error:", -grid_search.best_score_)
print("\nFeature Coefficients (β values): (for scaled variables)")
print(coefficients)
Best Parameters: {'regressor': Ridge(), 'regressor__alpha': 0.1}
Best(least) RMSE Error: 2.9181586103855244

Feature Coefficients (β values): (for scaled variables)
displacement                  0.849895
horsepower                   -0.421171
weight                        3.749970
acceleration                 -0.862817
1_hp                          9.801057
1_weight                      5.663238
1_disp                        3.234586
log_disp                      1.114903
log_hp                        2.405622
log_weight                   -4.632756
1_hp_times_weight            -8.933534
model_year_group_1976–1979    2.956526
model_year_group_1980–1982    7.097160
cylinders_4                   7.177942
cylinders_5                   8.560141
cylinders_6                   6.633527
cylinders_8                   7.123100
dtype: float64

We can see that the cross-validation root mean square error reduces from 3.1 to 2.9. This indicates that linear regression—augmented with thoughtful features, preprocessing, and regularization produces a strong and interpretable MPG model on historical data, while diagnostics and CV guard against over‑confidence and reveal improvement avenues.

References

Dataset: Auto MPG Data Set, UCI Machine Learning Repository (originally from StatLib / CMU; used in the 1983 ASA Exposition)

Back to top