Regression Analysis on NYC Accidents

From previous EDA steps, we recognize that the number of car accidents is associated to many factors, such as the time in the day, borough, and workday or weekend. Also, the severity level relates to weather conditions. In this section, we will examine all these four associations.

Regression of accidents against day/night

First, we want to examine whether the time in the day when the accident happend(day/night) relates to the number of accidents.

We divide a day into two parts, day and night, to see whether there is a linear relationship between car accidents number and day or night time of a day. We set 7:00 am to 7:00 pm as day, and 7:00 pm to 7:00 am as night.

reg_day_night_data = accidents1 %>%
  mutate(hour = as.numeric(hour),
         day_night = ifelse(hour %in% c(7,8,9,10,11,12,13,14,15,16,17,18), "day", "night")) %>%
  group_by(crash_date, day_night) %>% 
  dplyr::summarize(n_obs = n())

reg_day_night = lm(n_obs ~ day_night, reg_day_night_data)
summary(reg_day_night)

## 
## Call:
## lm(formula = n_obs ~ day_night, data = reg_day_night_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -140.93  -52.74   -9.99   30.65  377.07 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     194.934      4.884   39.91   <2e-16 ***
## day_nightnight  -80.442      6.907  -11.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 75.98 on 482 degrees of freedom
## Multiple R-squared:  0.2196, Adjusted R-squared:  0.218 
## F-statistic: 135.6 on 1 and 482 DF,  p-value: < 2.2e-16

From the regression result, we can see that the \(p-value\) is less than 0.01, so we have enough evidence to say that there is a linear association between the number of accidents and the variable day_night. The regression line is:
\[E(accidents) = 137.669 - 72.281I(day\_night = night)\] Therefore, there are 72 more accidents expected at night than in the daytime. More cars on streets in the daytime may contribute to this result, as car accidents are more likely to happen in heavy traffic.

Regression of accidents against borough

Second, we will figure out whether there is a linear relationship between borough and the number of accidents.

reg_borough_data = accidents1 %>% 
  filter(borough %in% c("Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island")) %>% 
  group_by(crash_date) %>% 
  count(borough)

reg_borough = lm(n ~ borough, data = reg_borough_data)
summary(reg_borough)

## 
## Call:
## lm(formula = n ~ borough, data = reg_borough_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.864 -12.913  -1.384   8.124 124.079 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            38.913      1.352  28.791  < 2e-16 ***
## boroughBrooklyn        30.950      1.911  16.192  < 2e-16 ***
## boroughManhattan       -8.529      1.911  -4.462 8.87e-06 ***
## boroughQueens          19.008      1.911   9.945  < 2e-16 ***
## boroughStaten Island  -32.938      1.911 -17.232  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.03 on 1205 degrees of freedom
## Multiple R-squared:  0.528,  Adjusted R-squared:  0.5264 
## F-statistic:   337 on 4 and 1205 DF,  p-value: < 2.2e-16

As the \(p-value\) of coefficients is less than the predetermined significance level of 0.05, we have evidence to reject the null hypothesis that the number of accidents are the same in different boroughs. The regression model is: \[E(accidents) = 38.913 + 30.950I(borough = Brooklyn) - 8.529I(borough = Manhattan) + 19.008I(borough = Queens) - 32.938I(borough = Staten Island)\]
From the summary of the regression result, the reference borough is Bronx. The average number of accidents in Bronx is 38.91. In Brooklyn, the average number of accidents is 30.95 higher than Bronx. The average number of accidents in Manhattan is 8.53 lower than Bronx. For Queens, the average number of accidents is 19 less than Bronx. The average number of accidents in Staten Island is 32.94 lower than Bronx. Brooklyn has the highest average number of accidents and Staten Island has the least average number of accidents.

Regression of accidents against workday/weekend

Next, we want to examine whether workdays/weekends relates to the number of accidents.

As weekdays is a categorical variable so we take workdays as the reference category. To test whether there is a linear association between the number of car accidents and workdays, we conduct the hypothesis test and set \(\alpha\) = 0.1.

reg_workdays_weekends_data = 
  accidents1 %>% 
  filter(!is.na(borough)) %>%
  mutate(weekdays_type = if_else(weekday %in% c("Saturday", "Sunday"), "Weekends", "Workdays")) %>%
  dplyr::select(-borough) %>%
  group_by(crash_date, weekdays_type) %>%
  count(crash_date, weekdays_type) %>%
  mutate(weekdays_type = as.factor(weekdays_type))

reg_workdays_weekends = lm(n ~ weekdays_type, data = reg_workdays_weekends_data)
summary(reg_workdays_weekends)

## 
## Call:
## lm(formula = n ~ weekdays_type, data = reg_workdays_weekends_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -145.95  -68.95  -11.45   69.22  327.22 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            209.948      6.639  31.623   <2e-16 ***
## weekdays_typeWeekends  -24.165     12.433  -1.944   0.0531 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 87.32 on 240 degrees of freedom
## Multiple R-squared:  0.0155, Adjusted R-squared:  0.01139 
## F-statistic: 3.778 on 1 and 240 DF,  p-value: 0.05311

According to the results, \(p-value\) = 0.05311 < 0.1, so we can reject the null and conclude that we are 90% confident that there is a significant linear association between the number of car accidents and weekdays. The regression line is:
\[E(accidents) = 209.948 - 24.165I(weekdays = weekends)\] On weekends, the average number of car accidents will decrease by approximately 24.

Regression of accidents against weather condition

Finally, we want to explore the association between severity and weather conditions: visibility, temperature, and pressure.

This is the correlation plot between the number of accidents and weather conditions.

chart.Correlation(as.matrix(accidents2[, c(2, 25:29, 31:32)]), histogram=TRUE, pch=19)

From the plot, we can see that there is a potential linear relationship between severity and weather conditions. Then, we investigate the interaction of temperature and pressure.

reg_weather = lm(severity ~ visibility_mi + temperature_f * pressure_in, data = accidents2) 
summary(reg_weather)

## 
## Call:
## lm(formula = severity ~ visibility_mi + temperature_f * pressure_in, 
##     data = accidents2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6245 -0.3832 -0.2667  0.3783  1.8275 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -18.170686   8.219341  -2.211 0.027210 *  
## visibility_mi              -0.015607   0.007020  -2.223 0.026363 *  
## temperature_f               0.554606   0.156171   3.551 0.000396 ***
## pressure_in                 0.674596   0.274756   2.455 0.014195 *  
## temperature_f:pressure_in  -0.018231   0.005216  -3.495 0.000488 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6405 on 1454 degrees of freedom
##   (30 observations deleted due to missingness)
## Multiple R-squared:  0.05344,    Adjusted R-squared:  0.05084 
## F-statistic: 20.52 on 4 and 1454 DF,  p-value: < 2.2e-16

Based on the summary output of the model, it is indicated that the main effect of temperature and its interaction with pressure have the most significance. The main effects of visibility and pressure are also significant with \(p-values\) less than the predetermined level of 0.05. The regression line is: \[E(severity) = -18.170 - 0.016visibility + 0.555temperature + 0.675pressure - 0.018temperature*pressure\]