Skip to content

Null Value Imputation (R)

Problem

Real world data is not always clean. Its often messy and contains unexpected/missing values. In this post I will use a non-parametric algorithm called k-nearest-neighbors (KNN) to replace missing values.

Data

The data is technical spec of cars. I have taken this data set from UCI Machine learning repository which in turn took it from StatLib library which is maintained at Carnegie Mellon University. The data set was used in the 1983 American Statistical Association Exposition.

Sample Data
mpg cylinders displacement horsepower weight acceleration year origin name
20.0 6 156 122 2807 13.5 73 3 toyota mark ii
37.7 4 89 62 2050 17.3 81 3 toyota tercel
15.5 8 304 120 3962 13.9 76 1 amc matador
23.0 4 120 97 2506 14.5 72 3 toyouta corona mark ii (sw)
30.7 6 145 76 3160 19.6 81 2 volvo diesel

The data set contains the following columns:
1. mpg: continuous (miles per gallon)
2. cylinders: multivalued discrete 3. displacement: continuous (cu. inches)
4. horsepower: continuous
5. weight: continuous(lbs.)
6. acceleration: continuous (sec.)
7. model year: multivalued discrete (modulo 100)
8. origin: multivalued discrete (1. American, 2. European, 3. Japanese)
9. car name: string (unique for each instance)

Now I want to find if this data set contains any abnormal values.

summary(cars_info)
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.50   1st Qu.:4.000   1st Qu.:104.0   1st Qu.: 75.0   1st Qu.:2223  
##  Median :23.00   Median :4.000   Median :146.0   Median : 93.5   Median :2800  
##  Mean   :23.52   Mean   :5.458   Mean   :193.5   Mean   :104.5   Mean   :2970  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:262.0   3rd Qu.:126.0   3rd Qu.:3609  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                  NA's   :5                     
##   acceleration        year       origin      name          
##  Min.   : 8.00   Min.   :70.00   1:248   Length:397        
##  1st Qu.:13.80   1st Qu.:73.00   2: 70   Class :character  
##  Median :15.50   Median :76.00   3: 79   Mode  :character  
##  Mean   :15.56   Mean   :75.99                             
##  3rd Qu.:17.10   3rd Qu.:79.00                             
##  Max.   :24.80   Max.   :82.00                             
## 

KNN

I find that horsepower contains 5 NA values. I can ignore the data points with horsepower NA, or I could impute the NA values using KNN or other methods. Before imputing, I want to make a strong case that my imputation would be right.

Cars with missing horsepower
mpg cylinders displacement weight acceleration year origin name
25.0 4 98 2046 19.0 71 1 ford pinto
21.0 6 200 2875 17.0 74 1 ford maverick
40.9 4 85 1835 17.3 80 2 renault lecar deluxe
23.6 4 140 2905 14.3 80 1 ford mustang cobra
34.5 4 100 2320 15.8 81 2 renault 18i

The assumption behind using KNN for missing values is that a point value can be approximated by the values of the points that are closest to it, based on other variables.
Let me take three variables from the above data set, mpg, acceleration and horsepower. Intuitively, these variables seem to be related.

ggplot(cars_info, aes(x = mpg, y = acceleration, color = horsepower)) + 
  geom_point(show.legend = TRUE) +
  labs(x = 'Mpg', y='Acceleration',  title = "Auto MPG",
       color = 'Horsepower') + 
  scale_color_gradient(low = "green", high = "red",
                       na.value = "blue", guide = "legend") +
  theme_minimal()+theme(legend.position="bottom")

In the above plot, the blue colour points are null values. I can infer that cars of similar mpg and acceleration have similar horsepower. For a given missing value, I can look at the mpg of the car, its acceleration, look for its k nearest neighbours and get the car's horsepower.
I am using preprocess function in caret package for imputing NA's. The K value that I am taking is 20 (~ close to square root of number of variables)

Imputation using caret

library(caret)
preProcValues <- preProcess(cars_info %>% 
                          dplyr::select(mpg, cylinders, displacement, weight, acceleration, origin, horsepower),
                            method = c("knnImpute"),
                            k = 20,
                            knnSummary = mean)
impute_cars_info <- predict(preProcValues, cars_info,na.action = na.pass)

The impute_cars_info data set will be normalized. To de-normalize and get the original data back:

procNames <- data.frame(col = names(preProcValues$mean), mean = preProcValues$mean, sd = preProcValues$std)
for(i in procNames$col){
 impute_cars_info[i] <- impute_cars_info[i]*preProcValues$std[i]+preProcValues$mean[i] 
}

The imputed horsepower for the missing data points is:

Imputed data set
name year origin mpg cylinders displacement weight acceleration horsepower
ford maverick 74 1 21.0 6 200 2875 17.0 93.60
ford mustang cobra 80 1 23.6 4 140 2905 14.3 94.95
ford pinto 71 1 25.0 4 98 2046 19.0 72.45
renault 18i 81 2 34.5 4 100 2320 15.8 73.75
renault lecar deluxe 80 2 40.9 4 85 1835 17.3 65.10

The actual hp for the cars is as follows:

Comparison
name year horsepower actual_hp difference
ford maverick 74 93.60 84 9.60
ford mustang cobra 80 94.95 118 23.05
ford pinto 71 72.45 100 27.55
renault 18i 81 73.75 81 7.25
renault lecar deluxe 80 65.10 51 14.10

Out of the 5 cars, I was able to impute horsepower for 2 cars with less than 10hp difference, one car within 15hp and two cars within 30hp difference. To get better results, I should use other imputation techniques. Generally these 5 cars are removed while doing any analysis. In R, you could find the removed data set as mtcars.

Back to top