# Chi Square test of independence

# Chi Square test of independence

#### Harsha Achyuthuni

#### July 24 2019

In this post, I would like to look into Chi Square test of independence. The data set I am going to use is published in https://smartcities.data.gov.in which is a Government of India project under the National Data Sharing and Accessibility Policy.

I want to find what are the safest and deadliest ways to travel on Bangalore roads. The Injuries_and_Fatalities_Bengaluru_from_2016to2018.csv data set has total number of injuries and fatalities in Bangalore from 2016 to 2018. I want to take injuries as a dummy for the number of incidents that took place.

As I want to test that there is significant difference in the fatalities with different types of transport, the null and alternate hypothesis will be as follows:

H_{0}: The type of transport is independent of the fatalities

H_{1}: The type of transport is dependent

Sample data set:

```
## instance
## 1 2018 - Total Injuries - Bicycles
## 2 2017 - Total Injuries - Other modes of road transport (auto, bus, lorry)
## 3 2016 - Total Fatalities - Two-wheelers
## 4 2017 - Total Injuries - Pedestrian
## 5 2017 - Total Fatalities - Pedestrian
## count year type
## 1 43 2018 Total Injuries
## 2 1380 2017 Total Injuries
## 3 374 2016 Total Fatalities
## 4 1346 2017 Total Injuries
## 5 284 2017 Total Fatalities
## transport
## 1 Bicycles
## 2 Other modes of road transport (auto, bus, lorry)
## 3 Two-wheelers
## 4 Pedestrian
## 5 Pedestrian
```

The contingency table for the year 2017 is

```
contingency_table <- data %>% filter(year == 2017) %>%
dplyr::select(type, transport, count) %>%
spread(type, count)
library(kableExtra)
kable(contingency_table,
caption = 'Contingency Table') %>%
kable_styling(full_width = F) %>%
column_spec(1, bold = T) %>%
collapse_rows(columns = 1:2, valign = "middle") %>%
scroll_box()
```

transport | Total Fatalities | Total Injuries |
---|---|---|

Bicycles | 8 | 31 |

Other modes of road transport (auto, bus, lorry) | 252 | 1380 |

Pedestrian | 284 | 1346 |

Two-wheelers | 98 | 1499 |

A Mosaic plot for the same is:

```
library(ggmosaic)
ggplot(data = data) +
geom_mosaic(aes(weight = count, x = product(transport), fill = type), na.rm=TRUE) +
labs(x = 'Type of transport', y='%', title = 'What type of transport to use') +
theme_minimal()+theme(legend.position="bottom")
```

From the above plot I can observe that there is a significant difference in the percentages of fatalities in each transport. To find if this percent is significant, I will conduct a chi-square test of independence.

```
library(gmodels)
# Converting contingency table to flat tables
# Two vectors to hold values of columns
caseType <- c(); conditionType <- c()
# For each cell, repeat the rowname, colname combo
# as many times
for(i in 1:nrow(contingency_table)) {
for(j in 2:ncol(contingency_table)) {
numRepeats <- contingency_table[i, j]
caseType <- append(caseType,
rep(contingency_table[i,1],
numRepeats))
conditionType <- append(conditionType,
rep(colnames(contingency_table)[j],
numRepeats))
}
}
# Construct the table from the vectors
flatTable <- data.frame(caseType, conditionType)
CrossTable(flatTable$caseType, flatTable$conditionType,
dnn=c("Transportation Type", "Accident type"),
expected=TRUE)
```

```
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Expected N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 4898
##
##
## | Accident type
## Transportation Type | Total Fatalities | Total Injuries | Row Total |
## -------------------------------------------------|------------------|------------------|------------------|
## Bicycles | 8 | 31 | 39 |
## | 5.112 | 33.888 | |
## | 1.632 | 0.246 | |
## | 0.205 | 0.795 | 0.008 |
## | 0.012 | 0.007 | |
## | 0.002 | 0.006 | |
## -------------------------------------------------|------------------|------------------|------------------|
## Other modes of road transport (auto, bus, lorry) | 252 | 1380 | 1632 |
## | 213.913 | 1418.087 | |
## | 6.782 | 1.023 | |
## | 0.154 | 0.846 | 0.333 |
## | 0.393 | 0.324 | |
## | 0.051 | 0.282 | |
## -------------------------------------------------|------------------|------------------|------------------|
## Pedestrian | 284 | 1346 | 1630 |
## | 213.650 | 1416.350 | |
## | 23.164 | 3.494 | |
## | 0.174 | 0.826 | 0.333 |
## | 0.442 | 0.316 | |
## | 0.058 | 0.275 | |
## -------------------------------------------------|------------------|------------------|------------------|
## Two-wheelers | 98 | 1499 | 1597 |
## | 209.325 | 1387.675 | |
## | 59.206 | 8.931 | |
## | 0.061 | 0.939 | 0.326 |
## | 0.153 | 0.352 | |
## | 0.020 | 0.306 | |
## -------------------------------------------------|------------------|------------------|------------------|
## Column Total | 642 | 4256 | 4898 |
## | 0.131 | 0.869 | |
## -------------------------------------------------|------------------|------------------|------------------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 104.4776 d.f. = 3 p = 1.692478e-22
##
##
##
```

```
chi.test <- chisq.test(contingency_table[,2:3], rescale.p = TRUE)
print(chi.test)
```

```
##
## Pearson's Chi-squared test
##
## data: contingency_table[, 2:3]
## X-squared = 104.48, df = 3, p-value < 2.2e-16
```

`chi.sq.plot(chi.sq = chi.test$statistic, df = chi.test$parameter, title = 'Null hypothesis to test independence')`

As p < a, where a = 0.05, I reject the Null hypothesis. There is a significant difference in the mortality rate with different vehicles. Travelling on two-wheeler is the safest while bicycle is the most dangerous.

Created on 24th June 2019, Achyuthuni Sri Harsha

[…] Chi Square test of independence (smart cities data) […]