Skip to content

Multivariate Analysis (R)

Introduction

Multivariate EDA techniques generally show the relationship between two or more variables with the dependant variable in the form of either cross-tabulation, statistics or visually. In the current problem it will help us look at relationships between our data.

This blog is a part of in-time analysis problem. I want to analyse my entry time at office and understand what factors effect it.
After integrating Google Maps data with attendance dataset, I currently have the factors
1. date (month / week day / season etc)
2. main_activity (means of transport)
3. hours.worked (of the previous day)
4. travelling.time (time it took to travel from house to office)
5. home.addr (the place of residence)

The dependent variable is diff.in.time (difference between my actual in time vs policy in-time) A sample of the data is shown

Sample Data
diff.in.time date main_activity hours.worked travelling.time home.addr diff.out.time
-9 2018-08-14 IN_VEHICLE 8.933333 900.719 Old House 5
17 2018-03-16 ON_FOOT 9.116667 930.126 Old House -10
-14 2018-09-10 ON_FOOT 4.583333 1179.873 Old House -251
-7 2018-10-19 ON_BICYCLE 9.583333 1501.060 New House 42
-9 2018-06-28 IN_VEHICLE 9.783333 670.700 Old House 56

Cross-tabulation

For categorical data cross-tabulation is very useful. For two variables, cross-tabulation is performed by making a two-way table with column headings that match the levels of one variable and row headings that match the levels of the other variable, then filling in the counts of all subjects that share a pair of levels. The two variables might be both explanatory, both outcome, or one of each.

I am using Kable to make cool tables.

cross_table <- travel %>% group_by(home.addr, main_activity) %>% 
  summarise(avg.travel.time = mean(travelling.time),
            avg.in.time.diff = mean(diff.in.time),
            median.in.time.diff =  median(diff.in.time)) %>% 
  arrange(home.addr, main_activity)

library(kableExtra)
kable(cross_table, caption = 'Cross Tabulation') %>% 
  kable_styling(full_width = F) %>%
  column_spec(1, bold = T) %>%
  collapse_rows(columns = 1:2, valign = "middle") %>% 
  scroll_box()
Cross Tabulation
home.addr main_activity avg.travel.time avg.in.time.diff median.in.time.diff
New House IN_VEHICLE 1285.0264 -1.800000 -3
New House ON_BICYCLE 1547.5557 -4.000000 -6
New House ON_FOOT 1695.7091 5.285714 5
Old House IN_VEHICLE 771.1752 2.857143 -4
Old House ON_BICYCLE 1029.6329 14.941176 18
Old House ON_FOOT 1170.4783 17.433628 17

Scatter plots

Scatter plots show how much one variable is affected by another.

To see how travelling time affects in-time

ggplot(travel, aes(x=diff.in.time, y= travelling.time, color = main_activity)) + 
  geom_point(show.legend = TRUE) +
  labs(x = 'In-time difference (Minutes)', y='Travelling time (seconds)',  title = "Travelling time vs in-time",
       color = 'Mode of transport') + 
  theme_minimal()+theme(legend.position="bottom")

From the above graph, I can see that:
1. For bicycle, as travelling time decreases(low traffic) in-time difference increases(coming earlier to office)
2. There seems to be no relationship between travelling time (traffic) and in-time difference when on foot.
3. Travelling time has little affect on it-time difference when travelling on vehicle.

To see how hours worked(on previous day) affects in-time

From the above graph, I can observe that irrespective of mode of transport, my in-time difference increases (coming earlier to office) as hours worked on the previous day increases.

Box plots

Similarly, I want to see how mode of transport affects in-time difference. For categorical variable, box plots display this information in the most ideal manner.

ggplot(travel, aes(x=main_activity, y= diff.in.time, group = main_activity)) + 
  geom_boxplot() +
  labs(x='Mode of transport', y='In time difference (min)') + 
  theme_minimal()

From the above graph, I can observe that:
1. On vehicle, I went to office on average, ~12 minutes after the policy in-time (in-time difference is -12)
2. On cycle, I went to office almost close to the policy in-time
3. While walking, I was almost always before the policy in-time

Similarly, for place of residence.

From this graph, I can understand that from New house I was close to ~5 minutes after the policy in-time while I used to be on-time while living in Old house.

Created using R Markdown.

Credits:
Thinkstats
Experimental Design and Analysis

Back to top