KNN imputation
Missing value imputation using KNN
Harsha Achyuthuni
07/07/2019
Problem
Real world data is not always clean. Its often messy and contains unexpected/missing values. In this post I will use a non-parametric algorithm called k-nearest-neighbors (KNN) to replace missing values.
Data
The data is technical spec of cars. I have taken this data set from UCI Machine learning repository which in turn took it form StatLib library which is maintained at Carnegie Mellon University. The data set was used in the 1983 American Statistical Association Exposition.
mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | name |
---|---|---|---|---|---|---|---|---|
12.0 | 8 | 429 | 198 | 4952 | 11.5 | 73 | 1 | mercury marquis brougham |
19.0 | 6 | 225 | 95 | 3264 | 16.0 | 75 | 1 | plymouth valiant custom |
26.4 | 4 | 140 | 88 | 2870 | 18.1 | 80 | 1 | ford fairmont |
18.0 | 121 | 112 | 2933 | 14.5 | 72 | 2 | volvo 145e (sw) | |
19.8 | 6 | 200 | 85 | 2990 | 18.2 | 79 | 1 | mercury zephyr 6 |
The data set contains the following columns:
1. mpg: continuous (miles per gallon)
2. cylinders: multi-valued discrete 3. displacement: continuous (cu. inches)
4. horsepower: continuous
5. weight: continuous(lbs.)
6. acceleration: continuous (sec.)
7. model year: multi-valued discrete (modulo 100)
8. origin: multi-valued discrete (1. American, 2. European, 3. Japanese)
9. car name: string (unique for each instance)
Now I want to find if this data set contains any abnormal values.
summary(cars_info)
## mpg cylinders displacement horsepower
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0
## 1st Qu.:17.50 1st Qu.:4.000 1st Qu.:104.0 1st Qu.: 75.0
## Median :23.00 Median :4.000 Median :146.0 Median : 93.5
## Mean :23.52 Mean :5.458 Mean :193.5 Mean :104.5
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:262.0 3rd Qu.:126.0
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0
## NA's :5
## weight acceleration year origin
## Min. :1613 Min. : 8.00 Min. :70.00 1:248
## 1st Qu.:2223 1st Qu.:13.80 1st Qu.:73.00 2: 70
## Median :2800 Median :15.50 Median :76.00 3: 79
## Mean :2970 Mean :15.56 Mean :75.99
## 3rd Qu.:3609 3rd Qu.:17.10 3rd Qu.:79.00
## Max. :5140 Max. :24.80 Max. :82.00
##
## name
## ford pinto : 6
## amc matador : 5
## ford maverick : 5
## toyota corolla: 5
## amc gremlin : 4
## amc hornet : 4
## (Other) :368
I find that horsepower contains 5 NA values. I can ignore the data points with horsepower NA, or I could impute the NA values using KNN or other methods. Before imputing, I want to make a strong case that my imputation would be right.mpg | cylinders | displacement | weight | acceleration | year | origin | name |
---|---|---|---|---|---|---|---|
25.0 | 4 | 98 | 2046 | 19.0 | 71 | 1 | ford pinto |
21.0 | 6 | 200 | 2875 | 17.0 | 74 | 1 | ford maverick |
40.9 | 4 | 85 | 1835 | 17.3 | 80 | 2 | renault lecar deluxe |
23.6 | 4 | 140 | 2905 | 14.3 | 80 | 1 | ford mustang cobra |
34.5 | 4 | 100 | 2320 | 15.8 | 81 | 2 | renault 18i |
The assumption behind using KNN for missing values is that a point value can be approximated by the values of the points that are closest to it, based on other variables.
Let me take three variables from the above data set, mpg, acceleration and horsepower. Intuitively, these variables seem to be related.
ggplot(cars_info, aes(x = mpg, y = acceleration, color = horsepower)) +
geom_point(show.legend = TRUE) +
labs(x = 'Mpg', y='Acceleration', title = "Auto MPG",
color = 'Horsepower') +
scale_color_gradient(low = "green", high = "red",
na.value = "blue", guide = "legend") +
theme_minimal()+theme(legend.position="bottom")
In the above plot, the blue color points are null values. I can infer that cars of similar mpg and acceleration have similar horsepower. For a given missing value, I can look at the mpg of the car, its acceleration, look for its k nearest neighbors and get the cars horsepower.
I am using preprocess function in caret package for imputing NA’s. The K value that I am taking is 20 (~ close to square root of number of variables)
library(caret)
preProcValues <- preProcess(cars_info %>% dplyr::select(mpg, cylinders, displacement, weight, acceleration, origin, horsepower),
method = c("knnImpute"),
k = 20,
knnSummary = mean)
impute_cars_info <- predict(preProcValues, cars_info,na.action = na.pass)
The impute_cars_info data set will be normalized. To de-normalize and get the original data back:
procNames <- data.frame(col = names(preProcValues$mean), mean = preProcValues$mean, sd = preProcValues$std)
for(i in procNames$col){
impute_cars_info[i] <- impute_cars_info[i]*preProcValues$std[i]+preProcValues$mean[i]
}
The imputed horsepower for the missing data points is:name | year | origin | mpg | cylinders | displacement | weight | acceleration | horsepower |
---|---|---|---|---|---|---|---|---|
ford maverick | 74 | 1 | 21.0 | 6 | 200 | 2875 | 17.0 | 93.60 |
ford mustang cobra | 80 | 1 | 23.6 | 4 | 140 | 2905 | 14.3 | 94.95 |
ford pinto | 71 | 1 | 25.0 | 4 | 98 | 2046 | 19.0 | 72.45 |
renault 18i | 81 | 2 | 34.5 | 4 | 100 | 2320 | 15.8 | 73.75 |
renault lecar deluxe | 80 | 2 | 40.9 | 4 | 85 | 1835 | 17.3 | 65.10 |
name | year | horsepower | actual_hp | difference |
---|---|---|---|---|
ford maverick | 74 | 93.60 | 84 | 9.60 |
ford mustang cobra | 80 | 94.95 | 118 | 23.05 |
ford pinto | 71 | 72.45 | 100 | 27.55 |
renault 18i | 81 | 73.75 | 81 | 7.25 |
renault lecar deluxe | 80 | 65.10 | 51 | 14.10 |
Out of the 5 cars, I was able to impute horsepower for 2 cars with less than 10hp difference, one car within 15hp and two cars within 30hp difference. To get better results, I should use other imputation techniques. Generally these 5 cars are removed while doing any analysis. In R, you could find the removed data set as mtcars.
[…] Null value imputation using KNN (mtcars data) […]