Linear regression
Linear Regression
Date: 03-08-2019
Author: Achyuthuni Sri Harsha
Introduction
Regression problems are an important category of problems in analytics in which the response variable \(Y\) takes a continuous value. For example, a regression goal is predicting housing prices in an area. In this blog post, I will attempt to solve a supervised regression problem using the famous Boston housing price data set. Other than location and square footage, a house value is determined by various other factors.
The data used in this blog is taken from Kaggle. It originates from the UCI Machine Learning Repository. The Boston housing data was collected in 1978 and each of the 506 entries represent aggregated data about 14 features for homes from various suburbs in Boston, Massachusetts.
The data frame contains the following columns:
crim: per capita crime rate by town.
zn: proportion of residential land zoned for lots over 25,000 sq.ft.
indus: proportion of non-retail business acres per town.
chas: Charles River categorical variable (tract bounds or otherwise).
nox: nitrogen oxides concentration (parts per 10 million).
rm: average number of rooms per dwelling.
age: proportion of owner-occupied units built prior to 1940.
dis: weighted mean of distances to five Boston employment centers.
rad: index of accessibility to radial highways.
tax: full-value property-tax rate per $10,000.
ptratio: pupil-teacher ratio by town.
black: racial discrimination factor.
lstat: lower status of the population (percent)
The target variable is
medv: median value of owner-occupied homes in $1000s.
In particular, in this blog I want to use multiple linear regression for the analysis. A sample of the data is given below:
crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | medv |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2.92400 | 0.0 | 19.58 | otherwise | 0.6050 | 6.101 | 93.0 | 2.2834 | 5 | 403 | 14.7 | 240.16 | 9.81 | 25.0 |
0.12816 | 12.5 | 6.07 | otherwise | 0.4090 | 5.885 | 33.0 | 6.4980 | 4 | 345 | 18.9 | 396.90 | 8.79 | 20.9 |
0.08244 | 30.0 | 4.93 | otherwise | 0.4280 | 6.481 | 18.5 | 6.1899 | 6 | 300 | 16.6 | 379.41 | 6.36 | 23.7 |
0.06588 | 0.0 | 2.46 | otherwise | 0.4880 | 7.765 | 83.3 | 2.7410 | 3 | 193 | 17.8 | 395.56 | 7.56 | 39.8 |
0.02009 | 95.0 | 2.68 | otherwise | 0.4161 | 8.034 | 31.9 | 5.1180 | 4 | 224 | 14.7 | 390.55 | 2.88 | 50.0 |
The summary statistics for the data is:
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Length:506
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 Class :character
## Median : 0.25651 Median : 0.00 Median : 9.69 Mode :character
## Mean : 3.61352 Mean : 11.36 Mean :11.14
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10
## Max. :88.97620 Max. :100.00 Max. :27.74
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Data Cleaning and EDA
Zero and Near Zero Variance features do not explain any variance in the predictor variable.freqRatio | percentUnique | zeroVar | nzv | |
---|---|---|---|---|
crim | 1.000000 | 99.6047431 | FALSE | FALSE |
zn | 17.714286 | 5.1383399 | FALSE | FALSE |
indus | 4.400000 | 15.0197628 | FALSE | FALSE |
chas | 13.457143 | 0.3952569 | FALSE | FALSE |
nox | 1.277778 | 16.0079051 | FALSE | FALSE |
rm | 1.000000 | 88.1422925 | FALSE | FALSE |
age | 10.750000 | 70.3557312 | FALSE | FALSE |
dis | 1.250000 | 81.4229249 | FALSE | FALSE |
rad | 1.147826 | 1.7786561 | FALSE | FALSE |
tax | 3.300000 | 13.0434783 | FALSE | FALSE |
ptratio | 4.117647 | 9.0909091 | FALSE | FALSE |
black | 40.333333 | 70.5533597 | FALSE | FALSE |
lstat | 1.000000 | 89.9209486 | FALSE | FALSE |
medv | 2.000000 | 45.2569170 | FALSE | FALSE |
There are no near zero or zero variance columns
Similarly I can check for linearly dependent columns among the continuous variables.
## $linearCombos
## list()
##
## $remove
## NULL
There are no linearly dependent columns.
Uni-variate analysis
Now, I want to do some basic EDA on each column. On each continuous column, I want to visually check the following:
1. Variation in the column
2. Its distribution
3. Any outliers
4. q-q plot with normal distribution
## [1] "Univariate plots for crim"
## [1] "Univariate plots for zn"
## [1] "Univariate plots for indus"
## [1] "Univariate plots for nox"
## [1] "Univariate plots for rm"
## [1] "Univariate plots for age"
## [1] "Univariate plots for dis"
## [1] "Univariate plots for rad"
## [1] "Univariate plots for tax"
## [1] "Univariate plots for ptratio"
## [1] "Univariate plots for black"
## [1] "Univariate plots for lstat"
## [1] "Univariate plots for medv"
For categorical variables, I want to look at the frequencies.
Observations:
1. If I look at rad and tax, I observe that there seem to be two categories. Houses having rad < 10 follow a normal distribution, and there are some houses with rad = 24. As rad is an index, it could also be thought of as a categorical variable instead of a continuous variable.
2. For data points with rad= 25, the behavior in location based features looks different. For example indus, tax and ptratio have a different slope at the same points where the rad is 24. (observation for variation plots(top left plots))
Therefore I am tempted to group the houses which have rad = 24 into one category, and create interaction variables of that column with rad, indus, ptratio and tax. The new variable is called rad_cat. Also, I would like to convert rad itself to categorical and see if it can explain better than continuous variable.
Additionally, from researching on the internet, I found that the cost might have a different slope with the number of bedrooms for different class of people. So, I want to visualize that interaction variable also.
Bi variate analysis
I want to understand the relationship of each continuous variable with the \(y\) variable. I will achieve that by doing the following:
1. A scatter plot to look at the relationship between the \(x\) and the \(y\) variables.
2. Draw a linear regression line and a smoothed means line to understand the curve fit.
3. Predict using Linear regression using the variable alone to observe the increase in R-squared.
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.816 -5.455 -1.970 2.633 29.615
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.99736 0.45955 52.220 < 2e-16 ***
## crim -0.39123 0.04855 -8.059 8.75e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.595 on 405 degrees of freedom
## Multiple R-squared: 0.1382, Adjusted R-squared: 0.1361
## F-statistic: 64.94 on 1 and 405 DF, p-value: 8.748e-15
##
## [1] "----------------------------------------------------------------------------------------------------"
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.9353 -5.6853 -0.9847 2.4653 29.0647
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.93532 0.48739 42.954 < 2e-16 ***
## zn 0.14247 0.01818 7.835 4.15e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.812 on 405 degrees of freedom
## Multiple R-squared: 0.1316, Adjusted R-squared: 0.1295
## F-statistic: 61.39 on 1 and 405 DF, p-value: 4.155e-14
##
## [1] "----------------------------------------------------------------------------------------------------"
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.922 -5.144 -1.631 2.972 33.069
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.04045 0.77385 38.82 <2e-16 ***
## indus -0.66951 0.05936 -11.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.249 on 405 degrees of freedom
## Multiple R-squared: 0.239, Adjusted R-squared: 0.2372
## F-statistic: 127.2 on 1 and 405 DF, p-value: < 2.2e-16
##
## [1] "----------------------------------------------------------------------------------------------------"
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.688 -5.146 -2.299 2.794 30.643
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.020 2.081 20.195 <2e-16 ***
## nox -35.028 3.680 -9.518 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.548 on 405 degrees of freedom
## Multiple R-squared: 0.1828, Adjusted R-squared: 0.1808
## F-statistic: 90.6 on 1 and 405 DF, p-value: < 2.2e-16
##
## [1] "----------------------------------------------------------------------------------------------------"
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.886 -2.551 0.174 3.009 39.729
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.6702 2.8680 -12.79 <2e-16 ***
## rm 9.4450 0.4538 20.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.574 on 405 degrees of freedom
## Multiple R-squared: 0.5168, Adjusted R-squared: 0.5156
## F-statistic: 433.1 on 1 and 405 DF, p-value: < 2.2e-16
##
## [1] "----------------------------------------------------------------------------------------------------"
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.138 -5.266 -2.033 2.333 31.332
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.1468 1.1373 27.386 < 2e-16 ***
## age -0.1248 0.0154 -8.104 6.33e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.772 on 405 degrees of freedom
## Multiple R-squared: 0.1395, Adjusted R-squared: 0.1374
## F-statistic: 65.68 on 1 and 405 DF, p-value: 6.333e-15
##
## [1] "----------------------------------------------------------------------------------------------------"
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.010 -5.867 -1.968 2.297 30.275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.4243 0.9436 19.526 < 2e-16 ***
## dis 1.1127 0.2188 5.085 5.62e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.168 on 405 degrees of freedom
## Multiple R-squared: 0.06002, Adjusted R-squared: 0.0577
## F-statistic: 25.86 on 1 and 405 DF, p-value: 5.619e-07
##
## [1] "----------------------------------------------------------------------------------------------------"
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.022 -5.310 -2.298 3.375 33.475
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.7220 0.6412 41.677 <2e-16 ***
## rad -0.4249 0.0493 -8.619 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.693 on 405 degrees of freedom
## Multiple R-squared: 0.155, Adjusted R-squared: 0.1529
## F-statistic: 74.28 on 1 and 405 DF, p-value: < 2.2e-16
##
## [1] "----------------------------------------------------------------------------------------------------"
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.039 -5.235 -2.191 3.166 34.209
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.707058 1.080289 31.20 <2e-16 ***
## tax -0.026900 0.002427 -11.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.283 on 405 degrees of freedom
## Multiple R-squared: 0.2328, Adjusted R-squared: 0.2309
## F-statistic: 122.9 on 1 and 405 DF, p-value: < 2.2e-16
##
## [1] "----------------------------------------------------------------------------------------------------"
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.4897 -4.9434 -0.7651 3.4363 31.2566
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.8226 3.4209 18.95 <2e-16 ***
## ptratio -2.2811 0.1837 -12.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.047 on 405 degrees of freedom
## Multiple R-squared: 0.2758, Adjusted R-squared: 0.274
## F-statistic: 154.2 on 1 and 405 DF, p-value: < 2.2e-16
##
## [1] "----------------------------------------------------------------------------------------------------"
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.573 -5.028 -1.864 2.874 27.066
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.505391 1.785873 5.882 8.47e-09 ***
## black 0.033945 0.004844 7.008 1.02e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.93 on 405 degrees of freedom
## Multiple R-squared: 0.1081, Adjusted R-squared: 0.1059
## F-statistic: 49.11 on 1 and 405 DF, p-value: 1.017e-11
##
## [1] "----------------------------------------------------------------------------------------------------"
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.023 -4.173 -1.390 2.172 24.327
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.88480 0.63310 55.10 <2e-16 ***
## lstat -0.96665 0.04336 -22.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.336 on 405 degrees of freedom
## Multiple R-squared: 0.551, Adjusted R-squared: 0.5499
## F-statistic: 497 on 1 and 405 DF, p-value: < 2.2e-16
##
## [1] "----------------------------------------------------------------------------------------------------"