# Multivariate Analysis

*Harsha Achyuthuni*

*December 18, 2018*

## Introduction

Multivariate EDA techniques generally show the relationship between two or more variables with the dependent variable in the form of either cross-tabulation, statistics or visualization. In the current problem it will help us look at relationships between our data.

This blog is a part of in-time analysis problem. I want to analyse my entry time at office and understand what factors effect it.

After integrating Google maps data with attendence dataset, Currently I chose the following as my factors

1. date (month / week day / season etc)

2. main_activity (means of transport)

3. hours.worked (of the previous day)

4. travelling.time (time it took to travel from house to office)

5. home.addr (the place of residence)

The dependent variable is diff.in.time (difference between my actual in time vs policy in-time) A sample of the data is shown

diff.in.time | date | main_activity | hours.worked | travelling.time | home.addr | diff.out.time | |
---|---|---|---|---|---|---|---|

88 | 23 | 2018-04-13 | ON_BICYCLE | 9.316667 | 918.324 | Old House | -4 |

31 | 38 | 2018-01-12 | ON_FOOT | 2.083333 | 1477.822 | Old House | -453 |

103 | 20 | 2018-05-09 | ON_FOOT | 9.716667 | 749.121 | Old House | 23 |

208 | -13 | 2018-10-24 | ON_BICYCLE | 9.800000 | 1683.958 | New House | 61 |

143 | -3 | 2018-07-10 | IN_VEHICLE | 9.766667 | 574.590 | Old House | 49 |

## Cross-tabulation

For categorical data cross-tabulation is very useful. For two variables, cross-tabulation is performed by making a two-way table with column headings that match the levels of one variable and row headings that match the levels of the other variable, then filling in the counts of all subjects that share a pair of levels. The two variables might be both explanatory, both outcome, or one of each.

I am using Kable to make cool tables.

```
cross_table <- travel %>% group_by(home.addr, main_activity) %>%
summarise(avg.travel.time = mean(travelling.time),
avg.in.time.diff = mean(diff.in.time),
median.in.time.diff = median(diff.in.time)) %>%
arrange(home.addr, main_activity)
library(kableExtra)
kable(cross_table, caption = 'Cross Tabulation') %>%
kable_styling(full_width = F) %>%
column_spec(1, bold = T) %>%
collapse_rows(columns = 1:2, valign = "middle") %>%
scroll_box()
```

home.addr | main_activity | avg.travel.time | avg.in.time.diff | median.in.time.diff |
---|---|---|---|---|

New House | IN_VEHICLE | 1334.0650 | -1.500000 | -1.0 |

ON_BICYCLE | 1547.5557 | -4.000000 | -6.0 | |

ON_FOOT | 1695.7091 | 5.285714 | 5.0 | |

Old House | IN_VEHICLE | 771.1752 | 2.857143 | -4.0 |

ON_BICYCLE | 997.2439 | 15.555556 | 19.5 | |

ON_FOOT | 1176.9413 | 17.357143 | 17.0 |

## Scatter plots

Scatter plots show how much one variable is affected by another.

Let’s see how the travelling time affects the in-time

```
ggplot(travel, aes(x=diff.in.time, y= travelling.time, color = main_activity)) +
geom_point(show.legend = TRUE) +
labs(x = 'In-time difference (Minutes)', y='Travelling time (seconds)', title = "Travelling time vs in-time",
color = 'Mode of transport') +
theme_minimal()+theme(legend.position="bottom")
```

From the above graph, I can see that:

1. For bicycle, as travelling time decreases(low traffic) in-time difference increases(coming earlier to office)

2. There seems to be no relationship between travelling time (traffic) and in-time difference when on foot.

3. Travelling time has little effect on in-time difference when travelling on vehicle.

Let’s see how the hours worked(on previous day) affects the in-time

From the above graph, I can observe that irrespective of mode of transport, my in-time difference increases (coming earlier to office) as hours worked on the previous day increases.

## Box plots

Similarly, I want to see how mode of transport affects in-time difference. For categorical variable, box plots display this information in the most ideal manner.

```
ggplot(travel, aes(x=main_activity, y= diff.in.time, group = main_activity)) +
geom_boxplot() +
labs(x='Mode of transport', y='In time difference (min)') +
theme_minimal()
```

From the above graph, I can observe that:

1. On vehicle, I went to office on average, ~12 minutes after the policy in-time (in-time difference is -12)

2. On cycle, I went to office almost close to the policy in-time

3. While walking, I was almost always before the policy in-time

Similarly for place of residence.

From this graph, I can understand that from New house I was close to ~5 minutes after the policy in-time while I used to be on-time while living in Old house.

Created using R Markdown.

Nice!