Auto MPG Linear Regression Analysis¶
This notebook performs a complete linear regression analysis on the Auto MPG dataset. The goal is to predict the miles per gallon (mpg) of vehicles based on various features such as engine size, weight, and model year.
Dataset¶
The dataset is designed to predict city-cycle fuel consumption in miles per gallon (MPG) based on various vehicle attributes. Originally from the StatLib library maintained at Carnegie Mellon University, originally used in the 1983 American Statistical Association Exposition. Predicting fuel efficiency helps manufacturers design better vehicles, assists consumers in making informed decisions, and supports environmental policy planning.
Use Case Description¶
Our objective is to predict the mpg using linear regression techniques. This involves: - Exploring relationships between car attributes and fuel consumption - Identifying statistically significant predictors - Validating model assumptions and performance - Diagnosing issues such as multicollinearity, autocorrelation, and outliers - Evaluating model robustness using cross-validation and residual analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_percentage_error
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SequentialFeatureSelector
from statsmodels.api import OLS, add_constant
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.stats.stattools import durbin_watson
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('../Data/auto-mpg.csv')
df.dropna(inplace=True)
df
| mpg | cylinders | displacement | horsepower | weight | acceleration | model-year | |
|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 |
| 1 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 |
| 2 | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 |
| 3 | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 |
| 4 | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 393 | 27.0 | 4 | 140.0 | 86.0 | 2790 | 15.6 | 82 |
| 394 | 44.0 | 4 | 97.0 | 52.0 | 2130 | 24.6 | 82 |
| 395 | 32.0 | 4 | 135.0 | 84.0 | 2295 | 11.6 | 82 |
| 396 | 28.0 | 4 | 120.0 | 79.0 | 2625 | 18.6 | 82 |
| 397 | 31.0 | 4 | 119.0 | 82.0 | 2720 | 19.4 | 82 |
396 rows × 7 columns
Feature Engineering and Cleaning¶
Data attributes and description
| Feature | Type | Description |
|---|---|---|
| mpg | Continuous | Miles per gallon (target variable) |
| cylinders | Multi-valued discrete | Number of cylinders |
| displacement | Continuous | Engine displacement |
| horsepower | Continuous | Engine horsepower |
| weight | Continuous | Vehicle weight |
| acceleration | Continuous | Time to accelerate from 0 to 60 mph |
| model year | Multi-valued discrete | Year of manufacture |
df.describe()
| mpg | cylinders | displacement | horsepower | weight | acceleration | model-year | |
|---|---|---|---|---|---|---|---|
| count | 396.000000 | 396.000000 | 396.000000 | 396.000000 | 396.000000 | 396.000000 | 396.000000 |
| mean | 23.517172 | 5.457071 | 193.650253 | 104.189394 | 2973.000000 | 15.555808 | 76.027778 |
| std | 7.834368 | 1.703511 | 104.422387 | 38.402030 | 847.690354 | 2.758295 | 3.696969 |
| min | 9.000000 | 3.000000 | 68.000000 | 46.000000 | 1613.000000 | 8.000000 | 70.000000 |
| 25% | 17.375000 | 4.000000 | 104.750000 | 75.000000 | 2225.250000 | 13.800000 | 73.000000 |
| 50% | 23.000000 | 4.000000 | 148.500000 | 92.000000 | 2803.500000 | 15.500000 | 76.000000 |
| 75% | 29.000000 | 8.000000 | 263.250000 | 125.000000 | 3610.000000 | 17.125000 | 79.000000 |
| max | 46.600000 | 8.000000 | 455.000000 | 230.000000 | 5140.000000 | 24.800000 | 82.000000 |
Categorical encoding¶
To use categorical variables in machine learning models, they must be converted into numerical form through encoding. Common types include:
- Label Encoding: Assigns a unique integer to each category. Simple but may imply order where none exists. - One-Hot Encoding: Creates a binary column for each category. Prevents ordinal assumptions. - Dummy Encoding: A variant of one-hot encoding that drops one column to avoid multicollinearity (dummy variable trap).
For example, if the feature 'cylinders' has categories 3,4,5,6,8 dummy encoding would create four columns: cylinders_4, cylinders_5, cylinders_6, cylinders_8 with cylinders_3 implied when all are zero.
df = pd.get_dummies(df, columns=['cylinders', 'model-year'], drop_first=True)
df.columns
Index(['mpg', 'displacement', 'horsepower', 'weight', 'acceleration',
'cylinders_4', 'cylinders_5', 'cylinders_6', 'cylinders_8',
'model-year_71', 'model-year_72', 'model-year_73', 'model-year_74',
'model-year_75', 'model-year_76', 'model-year_77', 'model-year_78',
'model-year_79', 'model-year_80', 'model-year_81', 'model-year_82'],
dtype='object')
Train test split¶
We separate the dataset into independent variables (features) and the dependent variable (target). The independent variables (X) are the predictors that influence the outcome, while the dependent variable (Y) is what we aim to predict (in this case, the car's miles per gallon (MPG))
X = df.drop(columns=['mpg'])
y = df['mpg']
We divide our data into two parts: training data (80%) and testing data (20%). The training set is used to fit the model, while the test set evaluates how well the model generalizes to unseen data. This helps prevent overfitting and gives us a realistic estimate of model performance.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Baseline Linear Regression model¶
We fit a baseline linear regression on the training data to establish reference performance. The model assumes a linear relationship between features and MPG, with no hyperparameter tuning. We evaluate RMSE and R² on the test set.
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
Accuracy¶
After training, we evaluate the model using: - RMSE (Root Mean Squared Error): quantifies the average prediction error in MPG (lower is better).
\(\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}\)
- R² (Coefficient of Determination): Indicates how much variance in the target is explained by the model (higher is better).
\(R^2 = 1 - \frac{SS_{res}}{SS_{tot}}\)
- Adjusted R²: Adjusts R² for the number of predictors, penalizing unnecessary complexity.
\(\text{Adjusted } R^2 = 1 - \left( \frac{(1 - R^2)(n - 1)}{n - p - 1} \right)\)
- MAPE (Mean Absolute Percentage Error): Expresses prediction error as a percentage. It’s useful for understanding the error relative to the actual values. (lower is better)
\(\text{MAPE} = \frac{100\%}{n} \sum_{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i} \right|\)
These metrics help us understand both accuracy and model fit.
def calculate_adjusted_r_squared(X_test, r2):
n = X_test.shape[0]
p = X_test.shape[1]
adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
return adj_r2
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred) * 100
print(f"RMSE: {rmse:.3f}")
print(f"R²: {r2:.3f}")
print('Adjusted R-squared: {0}'.format(round(calculate_adjusted_r_squared(X_test, r2), 3)))
print(f"MAPE: {mape:.2f}%")
RMSE: 2.552
R²: 0.874
Adjusted R-squared: 0.832
MAPE: 9.25%
MAPE is 9% (lower is better). The model explains 87% of the variance in MPG (R² = 0.87). We next examine coefficient significance and overall model. These statistical tests help assess the reliability and explanatory power of the regression.
Detailed statistical summary¶
t-test for Individual Coefficients¶
Used to determine whether each independent variable significantly contributes to the model. - Null Hypothesis (H₀): The coefficient of the predictor variable is equal to zero (i.e., the variable has no effect).
$$ H_0: \beta_i = 0$$ - Alternative Hypothesis (H₁): The coefficient of the predictor variable is not equal to zero (i.e., the variable has a significant effect).
$$ H_1: \beta_i \ne 0 $$
F-test for Overall Model Significance¶
Used to test whether the regression model as a whole is statistically significant. - Null Hypothesis (H₀): All regression coefficients are equal to zero (i.e., the model has no explanatory power). $$ H_0: \beta_1 = \beta_2 = \dots = \beta_k = 0$$ - Alternative Hypothesis (H₁): At least one regression coefficient is not zero (i.e., the model explains some variability in the response). $$ \text{At least one } \beta_i \ne 0 $$ Using Ordinary Least Squares (OLS) from statsmodels, we get a detailed statistical summary: - p-values: Test the significance of each feature. A low p-value (< 0.05) suggests the feature is statistically significant. - t-statistics: Measure how many standard deviations the coefficient is from zero. - F-statistic: Tests whether the overall regression model is a good fit. - Confidence Intervals: Provide a range in which the true coefficient likely falls. This analysis helps us interpret the model beyond just accuracy.
X_const = add_constant(X)
model = OLS(y, X_const).fit()
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.864
Model: OLS Adj. R-squared: 0.857
Method: Least Squares F-statistic: 119.5
Date: Sun, 26 Oct 2025 Prob (F-statistic): 5.23e-149
Time: 15:42:54 Log-Likelihood: -980.93
No. Observations: 396 AIC: 2004.
Df Residuals: 375 BIC: 2087.
Df Model: 20
Covariance Type: nonrobust
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
const 33.1314 2.397 13.822 0.000 28.418 37.845
displacement 0.0008 0.007 0.129 0.898 -0.012 0.014
horsepower -0.0236 0.013 -1.788 0.075 -0.050 0.002
weight -0.0057 0.001 -9.036 0.000 -0.007 -0.004
acceleration 0.0326 0.089 0.365 0.715 -0.143 0.208
cylinders_4 6.5792 1.573 4.184 0.000 3.487 9.671
cylinders_5 7.3758 2.368 3.115 0.002 2.719 12.032
cylinders_6 4.1555 1.760 2.361 0.019 0.695 7.616
cylinders_8 7.0737 2.032 3.480 0.001 3.077 11.070
model-year_71 1.0667 0.844 1.264 0.207 -0.592 2.726
model-year_72 -0.4355 0.831 -0.524 0.601 -2.070 1.199
model-year_73 -0.3859 0.749 -0.516 0.606 -1.858 1.086
model-year_74 1.4282 0.883 1.618 0.107 -0.308 3.164
model-year_75 1.1654 0.865 1.347 0.179 -0.536 2.867
model-year_76 1.6378 0.831 1.972 0.049 0.005 3.271
model-year_77 3.0367 0.845 3.592 0.000 1.374 4.699
model-year_78 3.0020 0.802 3.741 0.000 1.424 4.580
model-year_79 4.7447 0.850 5.579 0.000 3.073 6.417
model-year_80 9.3696 0.884 10.595 0.000 7.631 11.109
model-year_81 6.7887 0.874 7.767 0.000 5.070 8.507
model-year_82 7.4486 0.856 8.699 0.000 5.765 9.132
==============================================================================
Omnibus: 30.461 Durbin-Watson: 1.513
Prob(Omnibus): 0.000 Jarque-Bera (JB): 57.549
Skew: 0.460 Prob(JB): 3.19e-13
Kurtosis: 4.625 Cond. No. 7.88e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.88e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Feature Significance Summary¶
Statistically Significant Features¶
These features have p-values less than 0.05, indicating they significantly contribute to predicting MPG: - weight (negative relationship with mpg) - cylinders_4, cylinders_5, cylinders_6, cylinders_8 - model-year_76 to 82 These variables are strong predictors of fuel efficiency in the dataset.
Non-Significant Features¶
These features have p-values greater than 0.05, suggesting they do not significantly impact the target variable in this model: - displacement - horsepower - acceleration - model-year_71 to 75
These may be candidates for removal or further investigation (e.g., multicollinearity or interaction effects).
Overall Model Significance¶
- R-squared = 0.864: The model explains 86.4% of the variance in MPG, which indicates a strong fit.
- F-statistic = 119, with a p-value < 0.05: This means the overall regression model is highly statistically significant, and at least one predictor variable is meaningfully related to the target.
Assumptions of Linear Regression¶
- Linearity: The relationship between predictors and response is linear.
- Independence: Observations are independent.
- Homoscedasticity: Constant variance of residuals.
- Normality of residuals: Residuals are normally distributed.
- No autocorrelation: Residuals are not autocorrelated.
- No multicollinearity: Predictors are not highly correlated.
Testing residuals¶
For a linear regression, we have the following assumptions on the residuals (errors): 1. Randomly scattered (no patterns): indicates linearity and homoscedasticity. 2. Normally distributed: checked using a Q-Q plot. 3. Mean of zero: ensures unbiased predictions. 4. No autocorrelation of residuals: ensures no lag or lead patterns in the data (using Durbin Watson test)
We use residual plots to visually inspect these assumptions. Violations may suggest model misspecification or the need for transformation.
residuals = y_test - y_pred
sns.histplot(residuals, kde=True)
plt.title("Residuals Distribution")
plt.show()
stats.probplot(residuals, dist="norm", plot=plt)
plt.title("Q-Q Plot of Residuals")
plt.show()
plt.scatter(y_pred, residuals)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel("Predicted MPG")
plt.ylabel("Residuals")
plt.title("Residuals vs Predicted Values")
plt.show()



Results 1. Residuals vs. Predicted Values Plot - We observe residuals are scattered around the horizontal line at zero. - This suggests that the model captures the linear relationship reasonably well. - No clear pattern is visible, which supports the assumption of linearity and homoscedasticity (constant variance of residuals). - However, if there’s any slight funnel shape or curvature, it might hint at mild heteroscedasticity or non-linearity.
- Q-Q Plot of Residuals
- We observe most points lie close to the red reference line.
- This indicates that the residuals are approximately normally distributed, which satisfies one of the key assumptions of linear regression.
-
Minor deviations at the tails suggest a few outliers or slight skewness.
-
Histogram of Residuals
- We observe the distribution is roughly symmetric but slightly right-skewed.
- Most residuals are centered around zero, which is good. The slight skew suggests a few larger positive errors, but overall, the distribution is close to normal.
Durbin-Watson statistic
The Durbin-Watson (DW) statistic tests for autocorrelation in residuals: - Value ≈ 2 → no autocorrelation - Value < 2 → positive autocorrelation - Value > 2 → negative autocorrelation
DW test interpretation is limited for cross‑sectional data without a natural sequence, the test assumes a meaningful order within the data, so it might not be applicable for mpg dataset.
dw_stat = durbin_watson(residuals)
print('Durbin-Watson statistic:', dw_stat)
Durbin-Watson statistic: 2.2787422579479517
Conclusion
The residual diagnostics suggest that the model assumptions are mostly satisfied: - Linearity: Supported by the residuals vs. predicted plot. - Homoscedasticity: No strong evidence of non-constant variance. - Normality of residuals: Mostly satisfied with minor deviations. - Autocorrelation: No autocorrelation within the data
Multicollinearity¶
One of the assumptions of linear regression is "no perfect multicollinearity": Linear regression assumes that no independent variable is a perfect linear function of another. VIF helps quantify and diagnose this issue. VIF measures how much the variance of a regression coefficient is inflated due to multicollinearity. Multicollinearity is when two or more independent variables are highly correlated. High multicollinearity can make coefficient estimates unstable and difficult to interpret.
Multicollinearity violates one of the key assumptions of linear regression: that the independent variables are not highly correlated. It can lead to: - Inflated standard errors - Unreliable p-values - Difficulty in identifying the true effect of each predictor
When multicollinearity is high, coefficients (\(\beta\) values) and p‑values become unstable, even if predictions can remain reasonably accurate. The predicted results are not affected by multicollinearity.
For each predictor \(X_i\), VIF is calculated as: $$ \text{VIF}_i = \frac{1}{1 - R_i^2} $$ Where \(R_i^2\) is the R-squared value from regressing \(X_i\) on all other predictors.
Interpreting VIF Values
| VIF Value | Interpretation |
|---|---|
| 1 | No multicollinearity |
| 1 – 5 | Moderate correlation (usually acceptable) |
| > 5 | High correlation (potential concern) |
| > 10 | Serious multicollinearity (consider action) |
vif_data = pd.DataFrame()
X_vif = add_constant(pd.get_dummies(df.drop(columns=['mpg']), drop_first=True))
vif_data['feature'] = X_vif.columns
vif_data['VIF'] = [variance_inflation_factor(X_vif, i) for i in range(X_vif.shape[1])]
print(vif_data.sort_values("VIF", ascending=False))
feature VIF
0 const 259.596012
8 cylinders_8 35.914435
5 cylinders_4 27.913953
7 cylinders_6 23.177861
1 displacement 21.175481
3 weight 12.802439
2 horsepower 11.594109
4 acceleration 2.734647
14 model-year_76 2.446511
16 model-year_78 2.404265
18 model-year_80 2.398209
20 model-year_82 2.390149
13 model-year_75 2.368613
19 model-year_81 2.342316
11 model-year_73 2.298801
17 model-year_79 2.217575
12 model-year_74 2.159981
15 model-year_77 2.122018
10 model-year_72 2.051657
9 model-year_71 2.043253
6 cylinders_5 1.904919
We can see that multiple features are highly correlated such as cylinders, displacement, weight and horsepower.
Outlier detection¶
Outliers can distort model performance. We use z-scores to detect them \(z = \frac{(x - \mu)}{\sigma}\)
A z-score > 3 or < -3 typically indicates an outlier. Identifying and handling outliers ensures the model is not overly influenced by extreme values.
continuous = ['displacement','horsepower','weight','acceleration']
numeric_df = df[continuous]
z_scores = np.abs(stats.zscore(numeric_df))
outliers = (z_scores > 3).sum(axis=1)
num_outliers = (outliers > 0).sum()
print('Number of outliers detected using Z-score method:', num_outliers)
Number of outliers detected using Z-score method: 7
Cross-validation accuracy scores¶
Cross-validation splits the data into multiple folds to train and test the model on different subsets. This gives a more robust estimate of model performance. We typically use k-fold cross-validation (e.g., k=5 or 10) to reduce variance and avoid overfitting.
cv = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(lr, X, y, cv=cv, scoring='neg_root_mean_squared_error')
print("CV RMSE (mean ± std):", -cv_scores.mean(), "±", cv_scores.std())
CV RMSE (mean ± std): 3.094602349591984 ± 0.5105627986552046
We can see that the cross validations scores is varying and this means that the model is not robust enough on unseen data.
Building a better model¶
We improve the model through a multi‑step approach: addressing multicollinearity, feature engineering, and regularization with cross‑validated hyperparameters:
- Addressing multicollinearity using techniques like removing high-VIF features, combining correlated variables, or applying dimensionality reduction. - Enhance the model with feature engineering: create interaction terms, polynomial features, or log transformations based on bivariate visualizations. - To improve generalization and reduce overfitting, apply regularization techniques such as Lasso (for feature selection) and Ridge (for coefficient shrinkage), and tune their hyperparameters using Grid Search. - Validate the model using k-fold cross-validation to ensure stability and robustness across different data splits.
Visualizations¶
# Importing the necessary libraries
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from ml_plottings import regression_bivariate_analysis, plot_correlation_matrix
df = pd.read_csv('../Data/auto-mpg.csv')
df['cylinders'] = df.cylinders.astype(str)
df['model-year'] = df['model-year'].astype(str)
df.dropna(inplace=True)
regression_bivariate_analysis(df, 'mpg')






Observations 1. Displacement and horsepower show a non-linear relationship with mpg, indicating that linear regression may not capture their effects accurately without transformation. 2. MPG varies significantly by cylinder count — vehicles with 4 cylinders have the highest average mpg, followed by those with 5 cylinders, suggesting a strong categorical influence. 3. Model year has a noticeable impact on mpg, with vehicles from 1977 onwards showing a higher average and wider range of mpg values, likely due to improvements in fuel efficiency over time.
Correlation matrix¶
A correlation matrix is a table that shows the pairwise correlation coefficients between features in a dataset. It helps identify linear relationships between variables, where values close to +1 or -1 indicate strong positive or negative correlations, respectively. This is used to detect multicollinearity.
col_for_corr = ['cylinders', 'displacement', 'horsepower', 'weight',
'acceleration', 'model-year']
plot_correlation_matrix(df, col_for_corr)

From the correlation matrix, we observe that displacement, horsepower, cylinders_8, and weight are strongly positively correlated with each other, indicating that these features tend to increase together. In contrast, cylinders_4 and acceleration show negative correlations with this group, suggesting that vehicles with fewer cylinders and higher acceleration typically have lower displacement and weight(they are more fuel-efficient). These relationships are further confirmed by the high VIF scores for the positively correlated features, all exceeding 10, which signals severe multicollinearity and potential redundancy in the model.
Feature engineering¶
Feature engineering is done to address the non-linear relationships correlations and improve model performance: 1. The patterns suggest a hyperbolic (1/x) or logarithmic relationship. Therefore, we create \(1/horsepower\), \(1/weight\) and \(log(\text{displacement})\) 2. From domain knowledge, horsepower and weight jointly influence fuel efficiency. To capture this interaction, we create \(\frac{1}{(\text{horsepower} \times \text{weight})}\). This also helps in reducing VIF (multi-correlation) as we are combining highly correlated features
# Creating 1/var
df['1_hp'] = 1/df['horsepower']
df['1_weight'] = 1/df['weight']
df['1_disp'] = 1/df['displacement']
# Creating log(var)
df['log_disp'] = np.log(df.displacement)
df['log_hp'] = np.log(df.horsepower)
df['log_weight'] = np.log(df.weight)
# Creating domain knowledge vars
df['1_hp_times_weight'] = 1/(df['horsepower']*df['weight'])
regression_bivariate_analysis(df[['1_hp', '1_weight', '1_disp', '1_hp_times_weight', 'log_hp','log_weight', 'log_disp', 'mpg']], 'mpg')







We can see that most of the newly created variables have a linear relationship.
Instead of treating each model year as a separate category, we can group them into three meaningful segments based on mpg trends: 1. Part 1: 1970–1975 - Lower average mpg and narrower range - Represents older, less fuel-efficient vehicles 2. Part 2: 1976–1979 - Moderate increase in mpg - Transitional period with improving fuel standards 3. Part 3: 1980–1982 - Significant jump in average mpg and wider range - Reflects newer, more efficient models
This can be due to regulatory changes or fuel efficiency improvements in these eras.
df['model_year_int'] = df['model-year'].astype(int)
bins = [1969, 1975, 1979, 1982]
labels = ['1970–1975','1976–1979','1980–1982']
df['model_year_group'] = pd.cut(1900+df['model_year_int'], bins=[1969,1975,1979,1982], labels=labels)
regression_bivariate_analysis(df[['model_year_group', 'mpg']], 'mpg')

We can see that there is a significant difference between the groups.
Feature selection¶
Since we have several variables that are highly correlated (as seen in the correlation matrix and VIF scores), not significant (from the p-values in statistical summary) and multiple new variables created, it's important to apply feature selection to reduce redundancy, improve model interpretability, and prevent overfitting. Two effective techniques are: 1. Forward and Backward Selection: These are stepwise selection methods. These methods help identify a subset of features that contribute most to predicting mpg 2. Regularization (Lasso and Ridge Regression): Regularization stabilizes estimates under multicollinearity and improves generalization. Ridge (L2) shrinks coefficients; Lasso (L1) can set some to zero (feature selection). We select the regularization strength via cross‑validated grid search in a leakage‑safe pipeline.
Forward and backward selection¶
These are feature selection techniques: - Forward Selection: Start with no features, add one at a time based on performance improvement. - Backward Elimination: Start with all features, remove the least significant one iteratively.
These methods help in building a simpler, more interpretable model by retaining only the most relevant predictors.
df_fb_selection = pd.get_dummies(df, columns=['model_year_group', 'cylinders'], drop_first=True)
X_fb_selection = df_fb_selection.drop(['mpg', 'model-year'], axis=1, inplace=False)
y_fb_selection = df_fb_selection.mpg
cv5 = KFold(n_splits=5, shuffle=True, random_state=42)
sfs_forward = SequentialFeatureSelector(LinearRegression(), n_features_to_select=5, direction='forward', cv=cv5)
sfs_forward.fit(X_fb_selection, y_fb_selection)
sfs_backward = SequentialFeatureSelector(LinearRegression(), n_features_to_select=5, direction='backward', cv=cv5)
sfs_backward.fit(X_fb_selection, y_fb_selection)
forward_features = X_fb_selection.columns[sfs_forward.get_support()].tolist()
backward_features = X_fb_selection.columns[sfs_backward.get_support()].tolist()
print('Forward selected features:', forward_features)
print('Backward selected features:', backward_features)
Forward selected features: ['weight', '1_hp', 'log_weight', 'model_year_int', 'model_year_group_1980–1982']
Backward selected features: ['1_hp', '1_weight', 'model_year_int', 'model_year_group_1980–1982', 'cylinders_4']
Lasso and Ridge Regularization¶
Regularization techniques shrink the magnitude of coefficients reducing multicollinearity.
- Lasso Regression adds an L1 penalty, which can shrink some coefficients to zero, effectively performing feature selection. L1 penalty adds a penalty equal to the absolute value of the coefficients. \(\text{Loss} = \text{RSS} + \alpha \sum |\beta_i|\) - Ridge Regression adds an L2 penalty, which reduces the magnitude of coefficients but retains all features.L2 penalty adds a penalty equal to the square of the coefficients \(\text{Loss} = \text{RSS} + \alpha \sum \beta_i^2\)
Because the penalty is on the magnitude of the \(\beta\) values, the scale of the variables become important. First we need to bring all the variables to the same scale before applying Lasso and Ridge regularizations.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.linear_model import Lasso, Ridge
from sklearn.model_selection import train_test_split
# list of features
num_features = ['displacement','horsepower','weight','acceleration',
'1_hp','1_weight','1_disp','log_disp','log_hp','log_weight',
'1_hp_times_weight']
cat_features = ['model_year_group','cylinders']
# Scaling and one hot encoding
preprocess = ColumnTransformer([
('num', MinMaxScaler(), num_features),
('cat', OneHotEncoder(handle_unknown='ignore', drop='first'), cat_features)
])
# Test train split
X_lasso = df[num_features + cat_features]
y_lasso = df['mpg']
X_train, X_test, y_train, y_test = train_test_split(X_lasso, y_lasso, test_size=0.2, random_state=42)
# Creating pipelines
lasso_pipe = Pipeline([('preprocess', preprocess), ('reg', Lasso(alpha=0.1, max_iter=10000))])
ridge_pipe = Pipeline([('preprocess', preprocess), ('reg', Ridge(alpha=1.0))])
# Fitting the models
lasso_pipe.fit(X_train, y_train)
ridge_pipe.fit(X_train, y_train)
# Create DataFrame for coefficients
ohe = lasso_pipe.named_steps['preprocess'].transformers_[1][1]
cat_names = list(ohe.get_feature_names_out(cat_features))
feat_names = num_features + cat_names
coef_df = pd.DataFrame({
'Feature': feat_names,
'Lasso Coefficient': lasso_pipe.named_steps['reg'].coef_,
'Ridge Coefficient': ridge_pipe.named_steps['reg'].coef_
})
# Plot
coef_df.set_index('Feature').plot(kind='bar', figsize=(10, 6))
plt.title('Lasso vs Ridge Coefficients')
plt.ylabel('Coefficient Value')
plt.grid(True)
plt.show();

These methods show how different approaches can be used to do feature selection.
Cross validation and Grid search¶
We have looked at various approaches that provide different feature selections and different results. We have also observed that within each model, different parameters can be fine-tuned to get different results. We can use cross validation and grid search algorithms to identify the best parameters and models across all the combinations possible. They improve the robustness and performance of a base machine learning model.
- Cross-validation helps evaluate a model's generalizability by splitting the training data into multiple folds, ensuring that the model is tested on different subsets and not just a single train-test split. This reduces the risk of overfitting and gives a more reliable estimate of model performance. - GridSearchCV builds on this by systematically searching through a predefined set of hyperparameters (like alpha in regularized regressions) using cross-validation to find the combination that yields the best performance metric (e.g., R² or RMSE).
Together, they help in selecting the most optimal model configuration, making the final model more accurate and stable when applied to unseen data.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.model_selection import KFold, GridSearchCV, train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
# Define models
lr = LinearRegression()
lasso = Lasso(max_iter=10000)
ridge = Ridge()
elastic_net = ElasticNet(max_iter=10000)
# Features
num_features = ['displacement','horsepower','weight','acceleration', '1_hp','1_weight','1_disp',
'log_disp','log_hp','log_weight', '1_hp_times_weight']
cat_features = ['model_year_group','cylinders']
# Ensure correct types
df['model_year_group'] = df['model_year_group'].astype('category')
df['cylinders'] = df['cylinders'].astype('category')
# Create a pipeline
preprocess = ColumnTransformer(
transformers=[
('num', StandardScaler(), num_features),
('cat', OneHotEncoder(handle_unknown='ignore', drop='first'), cat_features)
],
remainder='drop'
)
pipe = Pipeline([
('preprocess', preprocess),
('regressor', Lasso()) # placeholder; will be overridden by grid
])
# Define hyperparameter grids
param_lr = {'regressor': [lr]}
param_lasso = {'regressor': [lasso], 'regressor__alpha': [0.01, 0.1, 1, 10]}
param_ridge = {'regressor': [ridge], 'regressor__alpha': [0.01, 0.1, 1, 10]}
param_elastic_net = {'regressor': [elastic_net], 'regressor__alpha': [0.01, 0.1, 1, 10], 'regressor__l1_ratio':[0.25, 0.5, 0.75]}
# Combine all
param_grid = [param_lr, param_lasso, param_ridge, param_elastic_net]
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(df[num_features+cat_features], df.mpg, test_size=0.2, random_state=42)
# Grid Search CV
grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='neg_root_mean_squared_error')
grid_search.fit(X_train, y_train)
# Best model and coefficients
best_model = grid_search.best_estimator_.named_steps['regressor']
feature_names = num_features + list(grid_search.best_estimator_.named_steps['preprocess'].transformers_[1][1].get_feature_names_out(cat_features))
coefficients = pd.Series(best_model.coef_, index=feature_names)
# Output
print("Best Parameters:", grid_search.best_params_)
print("Best(least) RMSE Error:", -grid_search.best_score_)
print("\nFeature Coefficients (β values): (for scaled variables)")
print(coefficients)
Best Parameters: {'regressor': Ridge(), 'regressor__alpha': 0.1}
Best(least) RMSE Error: 2.9181586103855244
Feature Coefficients (β values): (for scaled variables)
displacement 0.849895
horsepower -0.421171
weight 3.749970
acceleration -0.862817
1_hp 9.801057
1_weight 5.663238
1_disp 3.234586
log_disp 1.114903
log_hp 2.405622
log_weight -4.632756
1_hp_times_weight -8.933534
model_year_group_1976–1979 2.956526
model_year_group_1980–1982 7.097160
cylinders_4 7.177942
cylinders_5 8.560141
cylinders_6 6.633527
cylinders_8 7.123100
dtype: float64
We can see that the cross-validation root mean square error reduces from 3.1 to 2.9. This indicates that linear regression—augmented with thoughtful features, preprocessing, and regularization produces a strong and interpretable MPG model on historical data, while diagnostics and CV guard against over‑confidence and reveal improvement avenues.
References¶
Dataset: Auto MPG Data Set, UCI Machine Learning Repository (originally from StatLib / CMU; used in the 1983 ASA Exposition)