# Chi-Square test of independence (R)

In this post, I would like to look into Chi Square test of independence. The data set I am going to use is published in https://smartcities.data.gov.in which is a Government of India project under the National Data Sharing and Accessibility Policy.

I want to find what are the safest and deadliest ways to travel on Bangalore roads. The Injuries_and_Fatalities_Bengaluru_from_2016to2018.csv data set has the total number of injuries and fatalities in Bangalore from 2016 to 2018. I want to take injuries as a dummy for the number of incidents that took place.

As I want to test that there is significant difference in the fatalities with different types of transport, the null and alternate hypothesis will be as follows:
$H_0$: The type of transport is independent of the fatalities
$H_1$: The type of transport is dependent

Sample data set:

##                                                                   instance
## 1 2017 - Total Injuries - Other modes of road transport (auto, bus, lorry)
## 2                                       2018 - Total Fatalities - Bicycles
## 3                                   2017 - Total Fatalities - Two-wheelers
## 4                                     2018 - Total Fatalities - Pedestrian
## 5                                       2017 - Total Fatalities - Bicycles
##   count year             type                                        transport
## 1  1380 2017   Total Injuries Other modes of road transport (auto, bus, lorry)
## 2     9 2018 Total Fatalities                                         Bicycles
## 3    98 2017 Total Fatalities                                     Two-wheelers
## 4   276 2018 Total Fatalities                                       Pedestrian
## 5     8 2017 Total Fatalities                                         Bicycles


The contingency table for the year 2017 is

contingency_table <- data %>% filter(year == 2017) %>%
dplyr::select(type, transport, count) %>%
library(kableExtra)
kable(contingency_table,
caption = 'Contingency Table') %>%
kable_styling(full_width = F) %>%
column_spec(1, bold = T) %>%
collapse_rows(columns = 1:2, valign = "middle") %>%
scroll_box()

Contingency Table
transport Total Fatalities Total Injuries
Bicycles 8 31
Other modes of road transport (auto, bus, lorry) 252 1380
Pedestrian 284 1346
Two-wheelers 98 1499

A Mosaic plot for the same is:

library(ggmosaic)
ggplot(data = data) +
geom_mosaic(aes(weight = count, x = product(transport), fill = type), na.rm=TRUE) +
labs(x = 'Type of transport', y='%',  title = 'What type of transport to use') +
theme_minimal()+theme(legend.position="bottom")


From the above plot I can observe that there is a significant difference in the percentages of fatalities in each transport. To find if this percent is significant, I will conduct a chi-square test of independence.

library(gmodels)
# Converting contingency table to flat tables
# Two vectors to hold values of columns
caseType <- c();  conditionType <- c()

# For each cell, repeat the rowname, colname combo
# as many times
for(i in 1:nrow(contingency_table)) {
for(j in 2:ncol(contingency_table)) {
numRepeats <- contingency_table[i, j]

caseType <- append(caseType,
rep(contingency_table[i,1],
numRepeats))
conditionType <- append(conditionType,
rep(colnames(contingency_table)[j],
numRepeats))
}
}

# Construct the table from the vectors
flatTable <- data.frame(caseType, conditionType)
CrossTable(flatTable$caseType, flatTable$conditionType,
dnn=c("Transportation Type", "Accident type"),
expected=TRUE)

##
##
##    Cell Contents
## |-------------------------|
## |                       N |
## |              Expected N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table:  4898
##
##
##                                                  | Accident type
##                              Transportation Type | Total Fatalities |   Total Injuries |        Row Total |
## -------------------------------------------------|------------------|------------------|------------------|
##                                         Bicycles |                8 |               31 |               39 |
##                                                  |            5.112 |           33.888 |                  |
##                                                  |            1.632 |            0.246 |                  |
##                                                  |            0.205 |            0.795 |            0.008 |
##                                                  |            0.012 |            0.007 |                  |
##                                                  |            0.002 |            0.006 |                  |
## -------------------------------------------------|------------------|------------------|------------------|
## Other modes of road transport (auto, bus, lorry) |              252 |             1380 |             1632 |
##                                                  |          213.913 |         1418.087 |                  |
##                                                  |            6.782 |            1.023 |                  |
##                                                  |            0.154 |            0.846 |            0.333 |
##                                                  |            0.393 |            0.324 |                  |
##                                                  |            0.051 |            0.282 |                  |
## -------------------------------------------------|------------------|------------------|------------------|
##                                       Pedestrian |              284 |             1346 |             1630 |
##                                                  |          213.650 |         1416.350 |                  |
##                                                  |           23.164 |            3.494 |                  |
##                                                  |            0.174 |            0.826 |            0.333 |
##                                                  |            0.442 |            0.316 |                  |
##                                                  |            0.058 |            0.275 |                  |
## -------------------------------------------------|------------------|------------------|------------------|
##                                     Two-wheelers |               98 |             1499 |             1597 |
##                                                  |          209.325 |         1387.675 |                  |
##                                                  |           59.206 |            8.931 |                  |
##                                                  |            0.061 |            0.939 |            0.326 |
##                                                  |            0.153 |            0.352 |                  |
##                                                  |            0.020 |            0.306 |                  |
## -------------------------------------------------|------------------|------------------|------------------|
##                                     Column Total |              642 |             4256 |             4898 |
##                                                  |            0.131 |            0.869 |                  |
## -------------------------------------------------|------------------|------------------|------------------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 =  104.4776     d.f. =  3     p =  1.692478e-22
##
##
##

chi.test <- chisq.test(contingency_table[,2:3], rescale.p = TRUE)
print(chi.test)

##
##  Pearson's Chi-squared test
##
## data:  contingency_table[, 2:3]
## X-squared = 104.48, df = 3, p-value < 2.2e-16

chi.sq.plot(chi.sq = chi.test$statistic, df = chi.test$parameter, title = 'Null hypothesis to test independence')


As $p < \alpha$, where $\alpha = 0.05$, I reject the Null hypothesis. There is a significant difference in the mortality rate with different vehicles. Travelling on two-wheeler is the safest while bicycle is the most dangerous.