During the first couple of months in New York, one of our group members witnessed a car accident which caused a huge traffic jam on her way to class. This experience motivates us to do some research and investigate car accidents in New York. According to the Department of Health in New York, there were 1,098 deaths each year due to unintentional motor vehicle traffic-related injuries and 12,093 hospitalizations each year due to motor vehicle traffic-related injuries.
Therefore, we are interested in exploring some factors related to car accidents. Through data analysis, we will obtain the peak time period, concentrated areas, and main causes of accidents, aiming to effectively decrease the rate of car accidents taking place and further ensure personal safety.
We used two datasets from Kaggle to conduct our analysis of the car accidents in New York City:
Motor vehicle collisions reported by the New York City Police Department from January-August 2020. Each record represents an individual collision, including the date, time and location of the accident (borough, zip code, street name, latitude/longitude), vehicles and victims involved, and contributing factors.
For the process of data cleaning, we focused on five boroughs in NY for our analysis, and we changed the borough name from all caps to only the first letter in capital. Then we separated the crash date variable into year, month, and day, and separated the crash time variable into hour, minute, and second. Also, we releveled the month and weekday into their standard name. Furthermore, we filtered out all the missing values in the borough.
These are the important variables in our analysis:
crash_data: date the car accident happened (split
into year, month, day)
crash_time: time the car accident happened during a
day (split into hour, minute, second)
borough: borough in which the car accident
happened
contributing_factor_vehicle: the contributing factor
of the car accident
vehicle_type_code: type of cars in the car
accident
weekdays: weekdays from Monday to Sunday
weekdays_type: type of weekdays (workdays and
weekends)
Lat: latitude of the accidents
Long: longitude of the accidents
This is a countrywide car accident dataset, which covers 49 states of the USA. The accident data are collected from February 2016 to Dec 2021, using multiple APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by a variety of entities, such as the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road-networks. Currently, there are about 2.8 million accident records in this dataset.
To be consistent with the other dataset, we only focus on the data points that are in 2020 and New York. In the filtered dataset, these are the variables that we are interested in for analysis:
severity: severity of the accident, a number between
1 and 4. Where 1 indicates the least impact on traffic and 4 represents
the most serious impact
wind_speed _mph: the wind speed when the accidents
happened, measured in miles per hour
First, we will focus on factors that relate to the number of accidents. We use bar charts to illustrate the number of accidents by contributing factors, street, borough and month, and day.
contr_factors =
accidents1 %>%
count(contributing_factor_vehicle_1) %>%
mutate(
contributing_factor_vehicle_1 = fct_reorder(contributing_factor_vehicle_1, n),
ranking = min_rank(desc(n))
) %>%
filter(ranking <= 10) %>%
arrange(n)
contr_factors %>%
ggplot(aes(x = contributing_factor_vehicle_1, y = n, fill = contributing_factor_vehicle_1)) +
geom_col() +
labs(title = "10 Most Common Contributing Factors of Car Accidents",
x = "Contributing Factors",
y = "Number of Car Accidents") +
coord_flip() +
theme(legend.position = "none", plot.title = element_text(hjust = 1))

We first tried to figure out which factors are the leading factors of car accidents. We constructed a bar plot illustrating the top 10 most common contributing factors of car accidents. Except accidents without specified reasons, drivers’ inattention and distraction are shown to be the top common factor resulting in car accidents. Other common reasons include following the front car too closely, failure to yield right-of-way, and so forth.
streets =
accidents1 %>%
filter(on_street_name != "NA") %>%
count(on_street_name) %>%
mutate(
on_street_name = fct_reorder(on_street_name, n),
ranking = min_rank(desc(n))
) %>%
filter(ranking <= 10) %>%
arrange(n)
streets %>%
ggplot(aes(x = on_street_name, y = n, fill = on_street_name)) +
geom_col() +
labs(title = "Top 10 Streets of Car Accidents",
x = "Street Name",
y = "Number of Car Accidents") +
coord_flip() +
theme(legend.position = "none", plot.title = element_text(hjust = 0.5))

Then, we explored the streets where most car accidents have taken place. The bar graph indicates that Belt Parkway has the most car accidents. Long Island Expressway, Brooklyn Queens Expressway, and FDR Drive also have a relatively great amount of accidents.
accidents1 %>%
filter(borough != "NA") %>%
group_by(borough) %>%
count(month) %>%
mutate(month = month.abb[month],
month = fct_relevel(month, c("Jan", "Feb", "Mar", "Apr","May","Jun", "Jul", "Aug"))) %>%
ggplot(aes(x = month, y = n, fill = borough)) +
geom_bar(stat = "identity",
position = "dodge") +
labs(
title = "Number of Car Accidents in Boroughs Over Month",
x = "Month",
y = "Number of Car Accidents",
fill = "Borough") +
theme(plot.title = element_text(hjust = 0.5))

We also investigated the overall pattern of car accidents in different boroughs over months. There tend to be most car accidents in January and February while least car accidents took place in April. Overall, the number of car accidents decreases from January to April and increases from April to July. Brooklyn seems to have the most car accidents while least car accidents occurred in Staten Island over the 8 months in New York City.
accidents1 %>%
group_by(crash_date) %>%
count(crash_date) %>%
ggplot(aes(x = crash_date, y = n, color = n)) +
geom_col() +
labs(title = "Overall Number of Accidents by Day",
x = "Day",
y = "Number of Car Accidents") +
theme(legend.position = "right", plot.title = element_text(hjust = 0.5))

This bar plot represents the number of car accidents each day over the 8 months. Most car accidents took place from January to March. The month with the least number of car accidents is April. This graph also matches the pattern demonstrated by the previous plot.
In this section, we will explore the trend of car accidents over time. We will analyze the pattern of number of car accidents followed through hour, day, and month in five boroughs.
accidents1 %>%
filter(!is.na(borough)) %>%
separate(crash_date, into = c('year', 'month', 'day'), sep = "-") %>%
mutate(
year = as.numeric(year),
month = as.numeric(month),
day = as.numeric(day)
) %>%
mutate(month = month.name[month],
month = fct_relevel(month, c("January", "February", "March", "April",
"May", "June", "July", "August"))) %>%
relocate(month) %>%
group_by(borough, month) %>%
dplyr::summarize(n_obs = n()) %>%
ggplot(aes(x = month, y = n_obs, group = 1, color = borough)) +
geom_point() +
geom_line() +
labs(
title = "Total Accident Numbers over Month by Borough",
x = "Month",
y = "Number of Accident in Each Month"
) +
theme(axis.text.x = element_text(angle = 60, hjust = 1, size = 5) ) +
facet_wrap(~ borough, nrow = 1)

This graph shows the total number of accidents in 5 boroughs from January to August in 2020. We can see that in January, February, and March, the accident numbers in these boroughs are much higher than in other months, and the accident numbers in April are the lowest for 5 boroughs. Also, Brooklyn has the largest accident number and Staten Island has the smallest accident number among all boroughs in every month in New York.
accidents1 %>%
filter(!is.na(borough)) %>%
mutate(crash_date = as.Date(crash_date),
weekday = weekdays(crash_date)) %>%
mutate(weekday = as.factor(weekday),
weekday = fct_relevel(weekday, "Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday")) %>%
relocate(weekday) %>%
group_by(borough, weekday) %>%
dplyr::summarize(n_obs = n()) %>%
ggplot(aes(x = weekday, y = n_obs, group = 1, color = borough)) +
geom_point() +
geom_line() +
theme(axis.text.x = element_text(angle = 60, hjust = 1) ) +
facet_wrap(~ borough, nrow = 1) +
labs(
title = "Total Accident Number over Weekday by Borough",
x = "Weekday",
y = "Number of Accident",
color = "Borough"
)

Next, this plot shows the total accident number over weekday by borough. We can tell that 5 boroughs have a similar trend, and the accident number is the highest on Friday and lowest on Sunday. Brooklyn has the highest daily accident number and Staten Island has the lowest in 5 boroughs.
accidents1 %>%
filter(!is.na(borough)) %>%
separate(crash_time, into = c('hour', 'minute', 'second'), sep = ":") %>%
mutate(hour = as.numeric(hour)) %>%
group_by(borough, hour) %>%
dplyr::summarize(n_obs = n()) %>%
ggplot(aes(x = hour, y = n_obs, color = borough)) +
geom_point() +
geom_line() +
labs(
title = "Total Number of Accident During 24 Hours of a Day by Borough",
x = "Hour",
y = "Number of Accident"
) +
facet_wrap(~ borough, nrow = 2)

This plot shows the total accident number during 24 hours of a day by borough. We can see that 5 boroughs have a similar daily trend. The number of accidents is relatively low from 12:00 am to 6:00 am, then increases and keeps high from 8:00 am to 5:00 pm, and drops until 12:00 am.
accidents1 %>%
group_by(hour, weekday) %>%
count(hour) %>%
mutate(weekday = as.factor(weekday),
weekday = fct_relevel(weekday, "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday" ),
hour = as.numeric(hour)) %>%
ggplot(aes(x = hour, y = n, group = 1, color = hour)) +
geom_point() +
theme(axis.text.x = element_text(hjust = 1),
legend.position = "none",
plot.title = element_text(hjust = 0.5)) +
labs(
title = "Number of Accidents in Different Parts of Day by Weekday",
x = "Hour",
y = "Number of Accidents") +
scale_x_continuous(limits = c(0, 23)) +
facet_wrap(~ weekday, nrow = 2)

By stratifying the number of car accidents by weekdays and hours, we found out the number of car accidents tends to increase and then decrease during the day. The number of accidents reached the peak from 1:00 pm to 7:00 pm. During the week, the number of accidents is higher on weekdays than on weekends, and the maximum number of accidents is higher on weekdays. The reason for this phenomenon might be that people are more likely to be involved in accidents on weekdays between 1:00 pm and 7:00 pm, when they are commuting in rush hours. On weekends, people will spend more time at home, so the number of car accidents is relatively lower.
Then, we want to know how types of cars are associated with the number of accidents. A bar plot of the number of accidents by car type clearly represents the relationship.
car_type_df = accidents1 %>%
group_by(vehicle_type_code_1) %>%
dplyr::summarize(obs = n()) %>%
arrange(desc(obs))
accidents_cartype_bar =
car_type_df %>%
filter(obs > 100) %>%
filter(!is.na(vehicle_type_code_1)) %>%
filter(obs > 500) %>%
mutate(vehicle_type = fct_reorder(vehicle_type_code_1, obs)) %>%
ggplot(aes(x = vehicle_type, y = obs, fill = vehicle_type)) +
geom_bar(stat = "identity") +
theme_classic() +
labs(title = "Number of Accidents by Car Type",
y = "Number of Accidents",
x = "Car Type") +
geom_text(size = 3, aes(label = obs), position = position_dodge(width=0.9), vjust=-0.25) +
ylim(0, 45000)
accidents_cartype_bar + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

The plot shows the frequency (>500) of accidents by car type. There are mainly two types of cars with much higher frequencies than other types: sedan and station wagon/sport utility vehicle. The frequencies of these two types in 2020 are more than 25000. Other common types are taxi, pick-up truck, and box truck. The frequencies of these three types are more than 1000. Other types are bus, bike, tractor truck diesel, and motorcycles also have a frequency higher than 500.
We also want to investigate how weather conditions make impacts on car accidents. Weather conditions and wind speed will be used to examine the relationship between weather and accidents’ severity.
accident_weather_table = accidents2 %>%
group_by(weather_condition) %>%
dplyr::summarize(obs = n()) %>%
arrange(desc(obs))
accident_weather_bar = accident_weather_table %>%
drop_na() %>%
filter(obs > 10) %>%
mutate(weather_condition = fct_reorder(weather_condition, obs)) %>%
ggplot(aes(x = weather_condition, y = obs, fill = weather_condition)) +
geom_bar(stat = "identity") +
theme_classic() +
labs(title = "Number of Accidents by Weather Condition",
y = "Number of Accidents",
x = "Weather Condition")
accident_weather_bar + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

The plot shows the frequency of accidents by weather. The most common weather condition of accidents is fair, which may deviate from common sense. There are two possible reasons. The first one is that there are many more days of fair than other weather in 2020 in New York. Therefore, it is reasonable that the frequency in fair weather is higher. The other reason is that during fair weather, people tend to behave with less caution. Thus the probability of accidents increases. This question is worth further research, which has been listed in our group’s further research plan. Other common types of weather are cloudy, mostly cloudy, light rain, partly cloudy, rain, and fog.
accidents_windspeed = accidents2 %>%
ggplot(aes(x = wind_speed_mph, y = ..density..)) +
geom_histogram(color = "darkblue", fill = "lightblue") +
labs(
title = "Number of Accidents by Wind Speed",
x = "Wind Speed(mph)"
) +
geom_density()
accidents_windspeed

Next, this plot shows the frequency of accidents by wind speed. As shown in the plot, the distribution of frequency is right-skewed. The most common wind speed is from 0 to 7.5 miles per hour. The frequency of this interval is higher than 150. From 7.5 to 12.5 miles per hour, the frequency is moderate. The frequency of car accidents with wind speed larger than 12.5 miles per hour is less than 25.
ggplot(accidents2, aes(x = wind_speed_mph, y = severity)) +
geom_violin(trim = FALSE, fill = "light blue") +
theme_bw() +
coord_flip() +
labs(
title = "Distribution of Severity vs Wind Speed",
y = "Wind Speed(mph)",
x = "Severity Level"
)

This violin plot shows the range of wind speed in each severity level. The most common severity is severity 2. For severity 2, the range of wind speed is from 0 to 20 miles per hour. For severity 3, the range is from 8 to 12.5 miles per hour. The range of wind speed of severity 4 is from 9 to 11.25 miles per hour. For severity 1, there are only 4 observations, so there is only a line for severity 1.
In this section, we will focus on the casualty of car accidents. More specifically, we will visualize the distribution of killed or injured people among different boroughs.
injured_borough_box =
accidents1 %>%
filter(!is.na(borough)) %>%
plot_ly(x = ~number_of_persons_injured, color = ~borough, type = "box", colors = "viridis") %>%
layout(title = "<b>Number of Injuries by Borough</b>",
xaxis = list(title = "Number of People Injured"))
injured_borough_box
killed_borough_box =
accidents1 %>%
filter(!is.na(borough)) %>%
plot_ly(x = ~number_of_persons_killed, color = ~borough, type = "box", colors = "viridis") %>%
layout(title = "<b>Number of People Killed by Borough</b>",
xaxis = list(title = "Number of People Killed"))
killed_borough_box
In each crash accident, Manhattan has the least serious injuries, and the other boroughs have similar injury amounts. However, Brooklyn has the largest number of injuries which is 15. The number of deaths in 5 boroughs are similar and not large. The median number of fatalities per crash is 0. However, Brooklyn also has the largest number of fatalities which is 3. Based on the information above, it can be concluded that Brooklyn has the most serious car accident fatality.
injured_killed_bar =
accidents1 %>%
dplyr::select(borough, number_of_persons_injured, number_of_persons_killed) %>%
group_by(borough) %>%
mutate(injuries_total = sum(number_of_persons_injured),
killed_total = sum(number_of_persons_killed)) %>%
dplyr::select(borough, injuries_total, killed_total) %>%
unique() %>%
filter(!is.na(borough)) %>%
pivot_longer(
injuries_total:killed_total,
names_to = "types_of_casualties",
values_to = "number_of_casualties")
injured_killed_bar_plot =
injured_killed_bar %>%
mutate(borough = as.factor(borough),
borough = fct_relevel(borough, "NA", "Brooklyn", "Queens", "Bronx", "Manhattan", "Staten island")) %>%
ggplot(aes(x = borough, y = number_of_casualties, fill = types_of_casualties)) +
geom_bar(stat = "identity",
position = "dodge",
width = 0.8) +
geom_text(aes(label = number_of_casualties),
size = 4,
position = position_dodge(width = 0.8),
vjust = -0.3) +
labs(
title = "Frequency of Casualties in Boroughs",
x = "Borough",
y = "Number of Casualties in Each Borough",
fill = "Types") +
theme_bw(base_size = 12) +
theme(axis.text = element_text(colour = 'black'),
plot.title = element_text(hjust = 0.5)) +
ylim(0, 6500)
injured_killed_bar_plot

From January 2020 to August 2020, Brooklyn had the largest total number of injuries and fatalities. Conversely, Staten Island had the lowest total number of injuries and fatalities in car accidents.
The following two plots show the number of injured and killed people separately.
casualties_table =
accidents1 %>%
group_by(borough) %>%
dplyr::summarize(injuries_total = sum(number_of_persons_injured),
killed_total = sum(number_of_persons_killed)) %>%
filter(!is.na(borough))
injuries_borough_plot =
casualties_table %>%
mutate(borough = forcats::fct_reorder(borough, injuries_total, .desc = TRUE)) %>%
ggplot(aes(x = borough, y = injuries_total, fill = borough)) +
geom_bar(stat = "identity",
width = 0.8) +
geom_text(aes(label = injuries_total),
size = 4,
position = position_dodge(width = 0.8),
vjust = -0.3) +
labs(
title = "Frequency of People Injured in Boroughs",
x = "Borough",
y = "Number of People Injured in Each Borough") +
theme_bw(base_size = 12) +
theme(axis.text = element_text(colour = 'black'),
plot.title = element_text(hjust = 0.5)) +
ylim(0, 6500)
injuries_borough_plot

killed_borough_plot =
casualties_table %>%
mutate(borough = forcats::fct_reorder(borough, killed_total, .desc = TRUE)) %>%
ggplot(aes(x = borough, y = killed_total, fill = borough)) +
geom_bar(stat = "identity",
width = 0.8) +
geom_text(aes(label = killed_total),
size = 4,
position = position_dodge(width = 0.8),
vjust = -0.3) +
labs(
title = "Frequency of People Killed in Boroughs",
x = "Borough",
y = "Number of People Killed in Each Borough") +
theme_bw(base_size = 12) +
theme(axis.text = element_text(colour = 'black'),
plot.title = element_text(hjust = 0.5)) +
ylim(0, 28)
killed_borough_plot

To test whether the proportions of car accidents in each weekday among boroughs are equal, we perform the Chi-Squared test.
H0: The proportions of car accidents on weekdays among boroughs are equal.
H1: Not all proportions of car accidents on weekdays among boroughs are equal.
week_accidents =
accidents1 %>%
dplyr::select(crash_date, borough) %>%
mutate(weekdays = weekdays(accidents1$crash_date, abbreviate = T)) %>%
filter(!is.na(borough)) %>%
mutate(weekdays = as.factor(weekdays),
weekdays = fct_relevel(weekdays, "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
table(week_accidents$borough, week_accidents$weekdays)
##
## Mon Tue Wed Thu Fri Sat Sun
## Bronx 1347 1339 1296 1448 1511 1378 1098
## Brooklyn 2337 2450 2391 2557 2758 2363 2051
## Manhattan 993 1103 1084 1184 1286 934 769
## Queens 1995 1939 2014 2011 2222 2046 1790
## Staten Island 178 221 198 209 250 205 185
chisq.test(table(week_accidents$borough, week_accidents$weekdays))
##
## Pearson's Chi-squared test
##
## data: table(week_accidents$borough, week_accidents$weekdays)
## X-squared = 73.531, df = 24, p-value = 6.303e-07
x_crit = qchisq(0.95, 24)
x_crit
## [1] 36.41503
Interpretation: At significant level \(\alpha\) = 0.05, \(p-value\) = 6.303e-07 < 0.05, so we reject the null hypothesis and conclude that there is at least one borough’s proportion of car accidents for weekdays different from others.
To test whether the proportions of car accidents in five car types among boroughs are equal, we performed the Chi-square test.
H0: Proportions of accident amounts for
five car types are equal among boroughs.
H1: Not all proportions of accident amounts
for five car types are not equal among boroughs.
five_common_cartype =
accidents1 %>%
dplyr::select(borough, vehicle_type_code_1) %>%
filter(vehicle_type_code_1 %in%
c("Sedan",
"Station Wagon/Sport Utility Vehicle",
"Taxi",
"Pick-up Truck",
"Box Truck")) %>%
count(vehicle_type_code_1, borough) %>%
pivot_wider(
names_from = "vehicle_type_code_1",
values_from = "n"
) %>%
data.matrix() %>%
subset(select = -c(borough))
rownames(five_common_cartype) <- c("Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island", "Others")
five_common_cartype %>%
knitr::kable(caption = "Table of Top Five Car Type", caption.pos = "top")
| Box Truck | Pick-up Truck | Sedan | Station Wagon/Sport Utility Vehicle | Taxi | |
|---|---|---|---|---|---|
| Bronx | 158 | 187 | 4451 | 3288 | 411 |
| Brooklyn | 309 | 417 | 7865 | 6329 | 425 |
| Manhattan | 275 | 213 | 2828 | 2279 | 749 |
| Queens | 188 | 359 | 6460 | 5721 | 250 |
| Staten Island | 7 | 49 | 847 | 450 | 4 |
| Others | 480 | 657 | 11898 | 9474 | 929 |
chisq.test(five_common_cartype)
##
## Pearson's Chi-squared test
##
## data: five_common_cartype
## X-squared = 1614.5, df = 20, p-value < 2.2e-16
Interpretation: At significant level \(\alpha\) = 0.05, the result of chi-square shows that \(\chi^2\) > \(\chi_{crit}\), so we reject the null hypothesis and conclude that there is at least one car type’s proportion of accident amounts different from others.
We want to see whether the car accident rates are the same among boroughs, so we conduct a proportion test. We obtained the population of each borough from the most recent census.
H0: The proportions of the car accidents are equal among boroughs.
H1: The proportions of the car accidents are not equal among boroughs.
url = "https://www.citypopulation.de/en/usa/newyorkcity/"
nyc_population_html = read_html(url)
population = nyc_population_html %>%
html_elements(".rname .prio2") %>%
html_text()
boro = nyc_population_html %>%
html_elements(".rname a span") %>%
html_text()
nyc_population = tibble(
borough = boro,
population = population %>% str_remove_all(",") %>% as.numeric()
)
car_accident = accidents1 %>%
filter(!is.na(borough)) %>%
count(borough) %>%
mutate(borough = str_to_title(borough))
acci_popu_boro = left_join(car_accident, nyc_population)
acci_popu_boro %>%
knitr::kable(caption = "Results Table", caption.pos = "top")
| borough | n | population |
|---|---|---|
| Bronx | 9417 | 1472654 |
| Brooklyn | 16907 | 2736074 |
| Manhattan | 7353 | 1694251 |
| Queens | 14017 | 2405464 |
| Staten Island | 1446 | 495747 |
prop.test(acci_popu_boro$n, acci_popu_boro$population)
##
## 5-sample test for equality of proportions without continuity correction
##
## data: acci_popu_boro$n out of acci_popu_boro$population
## X-squared = 1482.5, df = 4, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
## prop 1 prop 2 prop 3 prop 4 prop 5
## 0.006394577 0.006179292 0.004339971 0.005827150 0.002916810
Interpretation: From the test result, we can see that the \(p-value\) is smaller than 0.01, so we have enough evidence to conclude that the proportions of car accidents are different across boroughs.
In order to study how month is associated with the number of car accidents, we try to use an ANOVA test across months.
H0: The average numbers of accidents are equal across months.
H1: The average numbers of accidents are not equal across months.
fit_accidents =
accidents1 %>%
mutate(month = as.factor(month)) %>%
group_by(month, weekday, day) %>%
dplyr::summarize(num_accidents = n())
fit_accidents_month = lm(num_accidents ~ month, data = fit_accidents)
anova(fit_accidents_month) %>%
knitr::kable(caption = "One way anova of number of accidents and month", caption.pos = "top")
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| month | 7 | 2916412 | 416630.245 | 75.89771 | 0 |
| Residuals | 234 | 1284511 | 5489.365 | NA | NA |
Interpretation: As indicated by the result of the ANOVA test, the \(p-value\) is very small. Therefore, the null hypothesis is rejected and we can conclude that the average numbers of accidents are different across months in New York City in 2020.
From previous EDA steps, we recognize that the number of car accidents is associated to many factors, such as the time in the day, borough, and workday or weekend. Also, the severity level relates to weather conditions. In this section, we will examine all these four associations.
First, we want to examine whether the time in the day when the accident happend(day/night) relates to the number of accidents.
We divide a day into two parts, day and night, to see whether there is a linear relationship between car accidents number and day or night time of a day. We set 7:00 am to 7:00 pm as day, and 7:00 pm to 7:00 am as night.
reg_day_night_data = accidents1 %>%
mutate(hour = as.numeric(hour),
day_night = ifelse(hour %in% c(7,8,9,10,11,12,13,14,15,16,17,18), "day", "night")) %>%
group_by(crash_date, day_night) %>%
dplyr::summarize(n_obs = n())
reg_day_night = lm(n_obs ~ day_night, reg_day_night_data)
summary(reg_day_night)
##
## Call:
## lm(formula = n_obs ~ day_night, data = reg_day_night_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -140.93 -52.74 -9.99 30.65 377.07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 194.934 4.884 39.91 <2e-16 ***
## day_nightnight -80.442 6.907 -11.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 75.98 on 482 degrees of freedom
## Multiple R-squared: 0.2196, Adjusted R-squared: 0.218
## F-statistic: 135.6 on 1 and 482 DF, p-value: < 2.2e-16
From the regression result, we can see that the \(p-value\) is less than 0.01, so we have
enough evidence to say that there is a linear association between the
number of accidents and the variable day_night. The
regression line is:
\[E(accidents) = 137.669 - 72.281I(day\_night
= night)\] Therefore, there are 72 more accidents expected at
night than in the daytime. More cars on streets in the daytime may
contribute to this result, as car accidents are more likely to happen in
heavy traffic.
Second, we will figure out whether there is a linear relationship between borough and the number of accidents.
reg_borough_data = accidents1 %>%
filter(borough %in% c("Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island")) %>%
group_by(crash_date) %>%
count(borough)
reg_borough = lm(n ~ borough, data = reg_borough_data)
summary(reg_borough)
##
## Call:
## lm(formula = n ~ borough, data = reg_borough_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.864 -12.913 -1.384 8.124 124.079
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.913 1.352 28.791 < 2e-16 ***
## boroughBrooklyn 30.950 1.911 16.192 < 2e-16 ***
## boroughManhattan -8.529 1.911 -4.462 8.87e-06 ***
## boroughQueens 19.008 1.911 9.945 < 2e-16 ***
## boroughStaten Island -32.938 1.911 -17.232 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.03 on 1205 degrees of freedom
## Multiple R-squared: 0.528, Adjusted R-squared: 0.5264
## F-statistic: 337 on 4 and 1205 DF, p-value: < 2.2e-16
As the \(p-value\) of coefficients
is less than the predetermined significance level of 0.05, we have
evidence to reject the null hypothesis that the number of accidents are
the same in different boroughs. The regression model is: \[E(accidents) = 38.913 + 30.950I(borough =
Brooklyn) - 8.529I(borough = Manhattan) + 19.008I(borough = Queens) -
32.938I(borough = Staten Island)\]
From the summary of the regression result, the reference borough is
Bronx. The average number of accidents in Bronx is 38.91. In Brooklyn,
the average number of accidents is 30.95 higher than Bronx. The average
number of accidents in Manhattan is 8.53 lower than Bronx. For Queens,
the average number of accidents is 19 less than Bronx. The average
number of accidents in Staten Island is 32.94 lower than Bronx. Brooklyn
has the highest average number of accidents and Staten Island has the
least average number of accidents.
Next, we want to examine whether workdays/weekends relates to the number of accidents.
As weekdays is a categorical variable so we take
workdays as the reference category. To test whether there is a linear
association between the number of car accidents and workdays, we conduct
the hypothesis test and set \(\alpha\)
= 0.1.
reg_workdays_weekends_data =
accidents1 %>%
filter(!is.na(borough)) %>%
mutate(weekdays_type = if_else(weekday %in% c("Saturday", "Sunday"), "Weekends", "Workdays")) %>%
dplyr::select(-borough) %>%
group_by(crash_date, weekdays_type) %>%
count(crash_date, weekdays_type) %>%
mutate(weekdays_type = as.factor(weekdays_type))
reg_workdays_weekends = lm(n ~ weekdays_type, data = reg_workdays_weekends_data)
summary(reg_workdays_weekends)
##
## Call:
## lm(formula = n ~ weekdays_type, data = reg_workdays_weekends_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -145.95 -68.95 -11.45 69.22 327.22
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 209.948 6.639 31.623 <2e-16 ***
## weekdays_typeWeekends -24.165 12.433 -1.944 0.0531 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 87.32 on 240 degrees of freedom
## Multiple R-squared: 0.0155, Adjusted R-squared: 0.01139
## F-statistic: 3.778 on 1 and 240 DF, p-value: 0.05311
According to the results, \(p-value\) = 0.05311 < 0.1, so we can
reject the null and conclude that we are 90% confident that there is a
significant linear association between the number of car accidents and
weekdays. The regression line is:
\[E(accidents) = 209.948 - 24.165I(weekdays =
weekends)\] On weekends, the average number of car accidents will
decrease by approximately 24.
Finally, we want to explore the association between the number of accidents and weather conditions: visibility, temperature, and pressure.
This is the correlation plot between the number of accidents and weather conditions.
chart.Correlation(as.matrix(accidents2[, c(2, 25:29, 31:32)]), histogram=TRUE, pch=19)

From the plot, we can see that there is a potential linear relationship between severity and weather conditions. Then, we investigate the interaction of temperature and pressure.
reg_weather = lm(severity ~ visibility_mi + temperature_f * pressure_in, data = accidents2)
summary(reg_weather)
##
## Call:
## lm(formula = severity ~ visibility_mi + temperature_f * pressure_in,
## data = accidents2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6245 -0.3832 -0.2667 0.3783 1.8275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18.170686 8.219341 -2.211 0.027210 *
## visibility_mi -0.015607 0.007020 -2.223 0.026363 *
## temperature_f 0.554606 0.156171 3.551 0.000396 ***
## pressure_in 0.674596 0.274756 2.455 0.014195 *
## temperature_f:pressure_in -0.018231 0.005216 -3.495 0.000488 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6405 on 1454 degrees of freedom
## (30 observations deleted due to missingness)
## Multiple R-squared: 0.05344, Adjusted R-squared: 0.05084
## F-statistic: 20.52 on 4 and 1454 DF, p-value: < 2.2e-16
Based on the summary output of the model, it is indicated that the main effect of temperature and its interaction with pressure have the most significance. The main effects of visibility and pressure are also significant with \(p-values\) less than the predetermined level of 0.05. The regression line is: \[E(severity) = -18.170 - 0.016visibility + 0.555temperature + 0.675pressure - 0.018temperature*pressure\]
In the previous statistical analysis, we conducted a proportion test to see whether the car accidents rates are the same in each borough. The results show that the proportions of car accidents are different across boroughs. Now we are still interested in the relationship between the location and the number of accidents, so we created a map to visualize to further explore this relationship.
In order to clearly see the difference in each borough, we first cleaned the data by removing rows with missing values for the borough, latitude, and longitude variables, aggregated the latitude and longitude and kept two decimal places.
map_data =
accidents1 %>%
filter(!is.na(borough)) %>%
filter(!is.na(latitude)) %>%
filter(!is.na(longitude)) %>%
mutate(
latitude = round(latitude, digits = 2),
longitude = round(longitude, digits = 2)) %>%
group_by(latitude, longitude) %>%
count(latitude, longitude, borough) %>%
arrange(desc(n)) %>%
mutate(text_label = str_c("</b>Lat: ", latitude, "°", "</b><br>Lng: ", longitude, "°", "</b><br>Borough: ", borough, "</b><br>Number_of_accidents: ", n))
pal = colorNumeric(
palette = "YlGnBu",
domain = map_data$n)
map_data %>%
leaflet() %>%
setView(lng = -74.00666, lat = 40.71643, zoom = 10) %>%
addTiles() %>%
leaflet::addLegend("bottomright",
pal = pal,
values = ~n,
title = "Number of Car Accidents</br>",
bins = 10,
opacity = 1,
labFormat = labelFormat(suffix = " cases")
) %>%
addCircleMarkers(
lng = ~longitude,
lat = ~latitude,
color = ~pal(n),
radius = 4,
popup = ~ text_label,
fillOpacity = 0.8,
stroke = FALSE,
opacity = 1)
According to the plot, the color scale indicated the number of car accidents at each location, with warmer colors (yellow) indicating a lower number of accidents and cooler colors (blue and green) indicating a higher number of accidents. From the mapping plot, we can see that the accidents occurred more in Bronx and Manhattan from January 2020 to August 2020, followed by Brooklyn. Fewer accidents occurred in Queens and Staten Island.
All of the information can be used to identify areas of the city with higher numbers of car accidents and develop strategies for reducing the number of accidents in those areas. For example, one potential strategy might be deploying more traffic police to areas with high numbers of accidents in order to maintain traffic order and reduce congestion.
We generated an interactive map to show the locations of car accidents in New York City in 2020 and the number of injuries resulting from each accident.
accidents_2020_map = accidents1 %>%
filter(!is.na(borough)) %>%
filter(!is.na(latitude)) %>%
filter(!is.na(longitude)) %>%
mutate(
total_injured = number_of_persons_injured + number_of_pedestrians_injured + number_of_cyclist_injured + number_of_motorist_injured,
textlab = paste0("Number of Injuries: ", total_injured, "\nAddress: ", on_street_name)
)
total_injuried_map = accidents_2020_map %>%
plot_ly(
lat = ~latitude,
lon = ~longitude,
type = "scattermapbox",
mode = "markers",
alpha = 0.5,
color = ~total_injured,
text = ~textlab,
colors = "YlOrRd")
total_injuried_map %>% layout(
mapbox = list(
style = 'carto-positron',
zoom = 9,
center = list(lon = -73.9, lat = 40.7)),
title = "<b> Total Number of Injuries </b>",
legend = list(title = list(text = "Borough of Accidents", size = 9),
orientation = "h",
font = list(size = 9)))
From the plot, we can tell that the maximum injury in one accident is 30 people. However, most accidents have less than 5 injuries. A few accidents have more than 5 injuries.
This information can be used to identify areas of the city with higher numbers of injuries and develop strategies for reducing the number of injuries in those areas. For example, one potential strategy might be increasing public awareness of road safety and encouraging drivers to be more careful in areas with high numbers of injuries.
In recent years, the COVID-19 epidemic has been raging in the United States, bringing great impact to all areas of the country, and transportation is no exception. The pandemic limits people’s travel and makes the overall atmosphere of the society more chaotic where people are under greater pressure. All of the outcomes may contribute to car accidents.
At the present stage, our regression on car accidents focused on the borough, car type, time, and weather condition, and we didn’t take into account the impact of COVID-19. Therefore, we want to combine data about the pandemic such as the number of positive patients or the number of deaths to conduct our analysis to see whether there is a relationship between COVID-19 epidemic and car accidents.
The current bar chart of the number of accidents by weather condition shows the most common weather condition is fair. This finding actually is a little away from common sense. There are two possible reasons. The first one is that there are much more days of fair than other weather in 2020 in New York, so it is reasonable that the frequency in fair weather is higher. The other reason is that during fair weather, people tend to behave with less caution. Thus the probability of accidents increases. It is possible to explore this question by doing some research on weather conditions in New York in 2020.
Several limitations exist for our project. Our first dataset on car accidents in New York City only contained information from January to August of 2020, meaning the sample size is relatively limited and the dataset could be a little out-of-date. The study may not be representative of car accidents taken place in other time periods or locations, indicating a possibly low generalizability.
This dataset also does not contain a fair amount of continuous variables, making it hard to conduct related regression analysis. Most of our regression analysis involved using factors and categorical variables. This also leads to why we introduced our second dataset containing information of all car accidents in the United States from 2016 to 2021. This is a relatively complex dataset, which was also seen as one of our limitations for the data manipulation step. To better correspond to the first dataset, we only kept the observations in New York City in 2020 for further analysis. This may also result in another constraint, as the selected sample has a limited size.
Our project aims to help people better understand what commonly causes car accidents by utilizing a scientific approach such that more accidents may be avoided in the future. Based on the above findings, potential strategies for reducing the number of car accidents in New York City include:
Advising drivers to be on the alert and maintain focus when on the road, such as setting cellphones to the “Do Not Disturb” mode during the drive, which could provide them with necessary time to react and ideally so as to prevent an accident from taking place.
Encouraging people to use alternative modes of transportation, such as public transit, cycling, or walking, whenever possible. This can help to reduce the number of cars on the roads and decrease the risk of accidents.
Targeting preventive efforts such as deploying more traffic police to high-risk areas such as Brooklyn to maintain traffic order and safety in areas. This can help to enforce traffic laws, monitor dangerous driving behaviors, and intervene in potential accidents to prevent them from occurring.
Implementing measures to improve visibility on the roads, such as better lighting or clearer road signs, to reduce the number of accidents during times of low visibility.
Monitoring and responding to changes in weather conditions, such as pressure and temperature, to anticipate and mitigate the potential impact on car accidents.