From previous EDA steps, we recognize that the number of car accidents is associated to many factors, such as the time in the day, borough, and workday or weekend. Also, the severity level relates to weather conditions. In this section, we will examine all these four associations.
First, we want to examine whether the time in the day when the accident happend(day/night) relates to the number of accidents.
We divide a day into two parts, day and night, to see whether there is a linear relationship between car accidents number and day or night time of a day. We set 7:00 am to 7:00 pm as day, and 7:00 pm to 7:00 am as night.
reg_day_night_data = accidents1 %>%
mutate(hour = as.numeric(hour),
day_night = ifelse(hour %in% c(7,8,9,10,11,12,13,14,15,16,17,18), "day", "night")) %>%
group_by(crash_date, day_night) %>%
dplyr::summarize(n_obs = n())
reg_day_night = lm(n_obs ~ day_night, reg_day_night_data)
summary(reg_day_night)
##
## Call:
## lm(formula = n_obs ~ day_night, data = reg_day_night_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -140.93 -52.74 -9.99 30.65 377.07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 194.934 4.884 39.91 <2e-16 ***
## day_nightnight -80.442 6.907 -11.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 75.98 on 482 degrees of freedom
## Multiple R-squared: 0.2196, Adjusted R-squared: 0.218
## F-statistic: 135.6 on 1 and 482 DF, p-value: < 2.2e-16
From the regression result, we can see that the \(p-value\) is less than 0.01, so we have
enough evidence to say that there is a linear association between the
number of accidents and the variable day_night. The
regression line is:
\[E(accidents) = 137.669 - 72.281I(day\_night
= night)\] Therefore, there are 72 more accidents expected at
night than in the daytime. More cars on streets in the daytime may
contribute to this result, as car accidents are more likely to happen in
heavy traffic.
Second, we will figure out whether there is a linear relationship between borough and the number of accidents.
reg_borough_data = accidents1 %>%
filter(borough %in% c("Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island")) %>%
group_by(crash_date) %>%
count(borough)
reg_borough = lm(n ~ borough, data = reg_borough_data)
summary(reg_borough)
##
## Call:
## lm(formula = n ~ borough, data = reg_borough_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.864 -12.913 -1.384 8.124 124.079
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.913 1.352 28.791 < 2e-16 ***
## boroughBrooklyn 30.950 1.911 16.192 < 2e-16 ***
## boroughManhattan -8.529 1.911 -4.462 8.87e-06 ***
## boroughQueens 19.008 1.911 9.945 < 2e-16 ***
## boroughStaten Island -32.938 1.911 -17.232 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.03 on 1205 degrees of freedom
## Multiple R-squared: 0.528, Adjusted R-squared: 0.5264
## F-statistic: 337 on 4 and 1205 DF, p-value: < 2.2e-16
As the \(p-value\) of coefficients
is less than the predetermined significance level of 0.05, we have
evidence to reject the null hypothesis that the number of accidents are
the same in different boroughs. The regression model is: \[E(accidents) = 38.913 + 30.950I(borough =
Brooklyn) - 8.529I(borough = Manhattan) + 19.008I(borough = Queens) -
32.938I(borough = Staten Island)\]
From the summary of the regression result, the reference borough is
Bronx. The average number of accidents in Bronx is 38.91. In Brooklyn,
the average number of accidents is 30.95 higher than Bronx. The average
number of accidents in Manhattan is 8.53 lower than Bronx. For Queens,
the average number of accidents is 19 less than Bronx. The average
number of accidents in Staten Island is 32.94 lower than Bronx. Brooklyn
has the highest average number of accidents and Staten Island has the
least average number of accidents.
Next, we want to examine whether workdays/weekends relates to the number of accidents.
As weekdays is a categorical variable so we take
workdays as the reference category. To test whether there is a linear
association between the number of car accidents and workdays, we conduct
the hypothesis test and set \(\alpha\)
= 0.1.
reg_workdays_weekends_data =
accidents1 %>%
filter(!is.na(borough)) %>%
mutate(weekdays_type = if_else(weekday %in% c("Saturday", "Sunday"), "Weekends", "Workdays")) %>%
dplyr::select(-borough) %>%
group_by(crash_date, weekdays_type) %>%
count(crash_date, weekdays_type) %>%
mutate(weekdays_type = as.factor(weekdays_type))
reg_workdays_weekends = lm(n ~ weekdays_type, data = reg_workdays_weekends_data)
summary(reg_workdays_weekends)
##
## Call:
## lm(formula = n ~ weekdays_type, data = reg_workdays_weekends_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -145.95 -68.95 -11.45 69.22 327.22
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 209.948 6.639 31.623 <2e-16 ***
## weekdays_typeWeekends -24.165 12.433 -1.944 0.0531 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 87.32 on 240 degrees of freedom
## Multiple R-squared: 0.0155, Adjusted R-squared: 0.01139
## F-statistic: 3.778 on 1 and 240 DF, p-value: 0.05311
According to the results, \(p-value\) = 0.05311 < 0.1, so we can
reject the null and conclude that we are 90% confident that there is a
significant linear association between the number of car accidents and
weekdays. The regression line is:
\[E(accidents) = 209.948 - 24.165I(weekdays =
weekends)\] On weekends, the average number of car accidents will
decrease by approximately 24.
Finally, we want to explore the association between severity and weather conditions: visibility, temperature, and pressure.
This is the correlation plot between the number of accidents and weather conditions.
chart.Correlation(as.matrix(accidents2[, c(2, 25:29, 31:32)]), histogram=TRUE, pch=19)

From the plot, we can see that there is a potential linear relationship between severity and weather conditions. Then, we investigate the interaction of temperature and pressure.
reg_weather = lm(severity ~ visibility_mi + temperature_f * pressure_in, data = accidents2)
summary(reg_weather)
##
## Call:
## lm(formula = severity ~ visibility_mi + temperature_f * pressure_in,
## data = accidents2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6245 -0.3832 -0.2667 0.3783 1.8275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18.170686 8.219341 -2.211 0.027210 *
## visibility_mi -0.015607 0.007020 -2.223 0.026363 *
## temperature_f 0.554606 0.156171 3.551 0.000396 ***
## pressure_in 0.674596 0.274756 2.455 0.014195 *
## temperature_f:pressure_in -0.018231 0.005216 -3.495 0.000488 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6405 on 1454 degrees of freedom
## (30 observations deleted due to missingness)
## Multiple R-squared: 0.05344, Adjusted R-squared: 0.05084
## F-statistic: 20.52 on 4 and 1454 DF, p-value: < 2.2e-16
Based on the summary output of the model, it is indicated that the main effect of temperature and its interaction with pressure have the most significance. The main effects of visibility and pressure are also significant with \(p-values\) less than the predetermined level of 0.05. The regression line is: \[E(severity) = -18.170 - 0.016visibility + 0.555temperature + 0.675pressure - 0.018temperature*pressure\]