NYC Traffic Collisions Analysis

This is a breakdown of every collision in New York City by location and injury. This data is collected because the NYC Council passed Local Law #11 in 2011.

The motor vehicle collision database includes the date and time, location (as borough, street names, zip code and latitude and longitude coordinates), injuries and fatalities, vehicle number and types, and related factors for all 65,500 collisions in New York City during since 2012.

Data source: NYPD-Motor-Vehicle-Collisions-Crashes
Jupyter notebook: NYPD-Motor-Vehicle-Collisions
Excel source file: NYPD_Motor_Vehicle_Collisions.csv

In [1]:
import pandas as pd
In [3]:
# Load NYPD Motor Vehicle Collisions file
df = pd.read_csv("NYPD_Motor_Vehicle_Collisions.csv")
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1542850 entries, 0 to 1542849
Data columns (total 29 columns):
DATE                             1542850 non-null object
TIME                             1542850 non-null object
BOROUGH                          1078103 non-null object
ZIP CODE                         1077920 non-null object
LATITUDE                         1351719 non-null float64
LONGITUDE                        1351719 non-null float64
LOCATION                         1351719 non-null object
ON STREET NAME                   1243443 non-null object
CROSS STREET NAME                1037526 non-null object
OFF STREET NAME                  207250 non-null object
NUMBER OF PERSONS INJURED        1542833 non-null float64
NUMBER OF PERSONS KILLED         1542819 non-null float64
NUMBER OF PEDESTRIANS INJURED    1542850 non-null int64
NUMBER OF PEDESTRIANS KILLED     1542850 non-null int64
NUMBER OF CYCLIST INJURED        1542850 non-null int64
NUMBER OF CYCLIST KILLED         1542850 non-null int64
NUMBER OF MOTORIST INJURED       1542850 non-null int64
NUMBER OF MOTORIST KILLED        1542850 non-null int64
CONTRIBUTING FACTOR VEHICLE 1    1538826 non-null object
CONTRIBUTING FACTOR VEHICLE 2    1336790 non-null object
CONTRIBUTING FACTOR VEHICLE 3    99648 non-null object
CONTRIBUTING FACTOR VEHICLE 4    20695 non-null object
CONTRIBUTING FACTOR VEHICLE 5    5274 non-null object
UNIQUE KEY                       1542850 non-null int64
VEHICLE TYPE CODE 1              1537861 non-null object
VEHICLE TYPE CODE 2              1291281 non-null object
VEHICLE TYPE CODE 3              128943 non-null object
VEHICLE TYPE CODE 4              48520 non-null object
VEHICLE TYPE CODE 5              10386 non-null object
dtypes: float64(4), int64(7), object(18)
memory usage: 341.4+ MB
In [5]:
# Let's quickly analyze what we've got here ^
# 1.5 Million records
# 1 Million with Borough and Zip Code info >> in other words, we're missing 500k location records
# In case we have Location, we can do reverse Geocoding with GeoPY library and find out address (Borough + Zipcode)

CLEANING

In [6]:
# Column header titles cleaning/renaming

# Number of persons injured is the total of injured (pedestrians + cyclists + motorists)
# If the number is 0, it means 0 injures and 0 deaths in an incident, but it's still a record

df.rename(columns = {'ZIP CODE'          : 'ZIP_CODE',
                       'ON STREET NAME'    : 'STREET_ON',
                       'CROSS STREET NAME' : 'STREET_CROSS',
                       'OFF STREET NAME'   : 'STREET_OFF',
                       'NUMBER OF PERSONS INJURED'     : 'TOTAL_INJURED',
                       'NUMBER OF PERSONS KILLED'      : 'TOTAL_KILLED',
                       'NUMBER OF PEDESTRIANS INJURED' : 'PED_INJURED',
                       'NUMBER OF PEDESTRIANS KILLED'  : 'PED_KILLED',
                       'NUMBER OF CYCLIST INJURED'     : 'CYC_INJURED',
                       'NUMBER OF CYCLIST KILLED'      : 'CYC_KILLED',
                       'NUMBER OF MOTORIST INJURED'    : 'MOTO_INJURED',
                       'NUMBER OF MOTORIST KILLED'     : 'MOTO_KILLED',
                       'CONTRIBUTING FACTOR VEHICLE 1' : 'VEH_FACTOR_1',
                       'CONTRIBUTING FACTOR VEHICLE 2' : 'VEH_FACTOR_2',
                       'CONTRIBUTING FACTOR VEHICLE 3' : 'VEH_FACTOR_3',
                       'CONTRIBUTING FACTOR VEHICLE 4' : 'VEH_FACTOR_4',
                       'CONTRIBUTING FACTOR VEHICLE 5' : 'VEH_FACTOR_5',
                       'UNIQUE KEY' : 'UNIQUE_KEY',
                       'VEHICLE TYPE CODE 1' : 'VEH_TYPE_1',
                       'VEHICLE TYPE CODE 2' : 'VEH_TYPE_2',
                       'VEHICLE TYPE CODE 3' : 'VEH_TYPE_3',
                       'VEHICLE TYPE CODE 4' : 'VEH_TYPE_4',
                       'VEHICLE TYPE CODE 5' : 'VEH_TYPE_5'},
           inplace = True) 
In [7]:
# Missing values in columns
df.isna().sum()
Out[7]:
DATE                   0
TIME                   0
BOROUGH           464747
ZIP_CODE          464930
LATITUDE          191131
LONGITUDE         191131
LOCATION          191131
STREET_ON         299407
STREET_CROSS      505324
STREET_OFF       1335600
TOTAL_INJURED         17
TOTAL_KILLED          31
PED_INJURED            0
PED_KILLED             0
CYC_INJURED            0
CYC_KILLED             0
MOTO_INJURED           0
MOTO_KILLED            0
VEH_FACTOR_1        4024
VEH_FACTOR_2      206060
VEH_FACTOR_3     1443202
VEH_FACTOR_4     1522155
VEH_FACTOR_5     1537576
UNIQUE_KEY             0
VEH_TYPE_1          4989
VEH_TYPE_2        251569
VEH_TYPE_3       1413907
VEH_TYPE_4       1494330
VEH_TYPE_5       1532464
dtype: int64
In [8]:
# Borough and Zipcode are missing ~500k records >> ~30% which is significant and we can't disregard it
# I'll assign missing Borough records to NYC. It will be 5 boroughs and NYC to collect what's unassigned.

# Remove Total Injured and Total Killed NaN values
# TOTAL INJURED and TOTAL KILLED are > 0, otherwise it's justa a record, so let's keep only > 0 records
In [8]:
# Borough and Zipcode are missing ~500k records >> ~30% which is significant and we can't disregard it
# I'll assign missing Borough records to NYC. It will be 5 borougs and NYC to collect what's unassigned

# Remove Total Injured and Total Killed NaN values
# TOTAL INJURED and TOTAL KILLED are > 0, otherwise it's just a a record, so let's keep only > 0 records
In [9]:
# Fill all blank values in column Borough
# If a value is NaN it will be NYC
df.loc[df['BOROUGH'].isnull(), 'BOROUGH'] = 'NYC'
In [10]:
# Let's check it... BOROUGH should have 0 NaN values
df.isna().sum()
Out[10]:
DATE                   0
TIME                   0
BOROUGH                0
ZIP_CODE          464930
LATITUDE          191131
LONGITUDE         191131
LOCATION          191131
STREET_ON         299407
STREET_CROSS      505324
STREET_OFF       1335600
TOTAL_INJURED         17
TOTAL_KILLED          31
PED_INJURED            0
PED_KILLED             0
CYC_INJURED            0
CYC_KILLED             0
MOTO_INJURED           0
MOTO_KILLED            0
VEH_FACTOR_1        4024
VEH_FACTOR_2      206060
VEH_FACTOR_3     1443202
VEH_FACTOR_4     1522155
VEH_FACTOR_5     1537576
UNIQUE_KEY             0
VEH_TYPE_1          4989
VEH_TYPE_2        251569
VEH_TYPE_3       1413907
VEH_TYPE_4       1494330
VEH_TYPE_5       1532464
dtype: int64
In [11]:
# Remove NaN from TOTAL INJURED
df = df.dropna(axis=0, subset=['TOTAL_INJURED'])
In [12]:
# Remove NaN from TOTAL KILLED
df = df.dropna(axis=0, subset=['TOTAL_KILLED'])
In [13]:
# Keep only > 0 values as df1
df1 = df[(df['TOTAL_INJURED'] > 0)]
In [14]:
# Keep only non-zero values as df2
df2 = df[(df['TOTAL_KILLED'] > 0)]
In [15]:
# Concatenate df1 and df2 and put it back as df; 0 values are now out
df = pd.concat([df1, df2])
In [16]:
# Combine DATE and TIME column to transform Series to DateTime needed for further analysis
df['DATE'] = df['DATE'] + ' ' + df['TIME']
In [17]:
# Convert string to DateTime
df['DATE'] = pd.to_datetime(df.DATE)
In [18]:
# Year filter
df['DATE_YEAR'] = pd.to_datetime(df['DATE']).dt.year
In [28]:
# Quarter filter
df['DATE_QUARTER'] = pd.to_datetime(df['DATE']).dt.quarter
In [19]:
# Month filter
df['DATE_MONTH'] = pd.to_datetime(df['DATE']).dt.month
In [20]:
# Day of the week filter
df['WEEKDAY'] = pd.to_datetime(df['DATE']).dt.weekday
In [21]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 296563 entries, 7 to 1541976
Data columns (total 32 columns):
DATE             296563 non-null datetime64[ns]
TIME             296563 non-null object
BOROUGH          296563 non-null object
ZIP_CODE         204446 non-null object
LATITUDE         261996 non-null float64
LONGITUDE        261996 non-null float64
LOCATION         261996 non-null object
STREET_ON        252167 non-null object
STREET_CROSS     207731 non-null object
STREET_OFF       26534 non-null object
TOTAL_INJURED    296563 non-null float64
TOTAL_KILLED     296563 non-null float64
PED_INJURED      296563 non-null int64
PED_KILLED       296563 non-null int64
CYC_INJURED      296563 non-null int64
CYC_KILLED       296563 non-null int64
MOTO_INJURED     296563 non-null int64
MOTO_KILLED      296563 non-null int64
VEH_FACTOR_1     294636 non-null object
VEH_FACTOR_2     205290 non-null object
VEH_FACTOR_3     34657 non-null object
VEH_FACTOR_4     8965 non-null object
VEH_FACTOR_5     2559 non-null object
UNIQUE_KEY       296563 non-null int64
VEH_TYPE_1       293383 non-null object
VEH_TYPE_2       203153 non-null object
VEH_TYPE_3       39470 non-null object
VEH_TYPE_4       14346 non-null object
VEH_TYPE_5       4178 non-null object
DATE_YEAR        296563 non-null int64
DATE_MONTH       296563 non-null int64
WEEKDAY          296563 non-null int64
dtypes: datetime64[ns](1), float64(4), int64(10), object(17)
memory usage: 74.7+ MB
In [22]:
# We have 294,934 relevant records instead of 1.5 million and our file is 68 MB from 340 MB at the beginning
# This file is now even readable with Excel

Data Analysis & Visualisation

In [23]:
import matplotlib.pyplot as plt
%matplotlib inline
In [35]:
# Year 2012 starts in July and for that reason it's incomplete and we can't use it in our analysis. 
# Let's filter out 2012 and leave 2019 just as a reference for a trend (today is mid-August 2019)
df = df[(df['DATE'] > '2013-01-01')]

Injured per year

In [41]:
plt.figure(figsize=(20, 25)).subplots_adjust(hspace = 0.4)

# Total number of PERSONS injured
plt.subplot(4, 2 ,1)
df.groupby('DATE_YEAR').TOTAL_INJURED.sum().plot.bar()
plt.title('Total number of PERSONS INJURED', fontsize=16)
plt.xlabel('Year', fontsize=13)

# Total number of MOTORISTS injured
plt.subplot(4, 2, 2)
df.groupby('DATE_YEAR').MOTO_INJURED.sum().plot.bar()
plt.title('Total number of MOTORISTS INJURED', fontsize=16)
plt.xlabel('Year', fontsize=13)

# Total number of CYCLISTS injury
plt.subplot(4, 2 ,3)
df.groupby('DATE_YEAR').CYC_INJURED.sum().plot.bar()
plt.title('Total number of CYCLISTS INJURED', fontsize=16)
plt.xlabel('Year', fontsize=13)

# Total number of PEDESTRIANS injured
plt.subplot(4, 2, 4)
df.groupby('DATE_YEAR').PED_INJURED.sum().plot.bar()
plt.title('Total number of PEDESTRIANS INJURED', fontsize=16)
plt.xlabel('Year', fontsize=13)

plt.show()

Killed per year

In [43]:
plt.figure(figsize=(20, 25)).subplots_adjust(hspace = 0.4)

# Total number of PERSONS killed
plt.subplot(4, 2 ,1)
df.groupby('DATE_YEAR').TOTAL_KILLED.sum().plot.bar()
plt.title('Total number of PERSONS KILLED', fontsize=16)
plt.xlabel('Year', fontsize=13)

# TTotal number of MOTORISTS killed
plt.subplot(4, 2, 2)
df.groupby('DATE_YEAR').MOTO_KILLED.sum().plot.bar()
plt.title('Total number of MOTORISTS KILLED', fontsize=16)
plt.xlabel('Year', fontsize=13)

# Total number of CYCLISTS killed
plt.subplot(4, 2 ,3)
df.groupby('DATE_YEAR').CYC_KILLED.sum().plot.bar()
plt.title('Total number of CYCLISTS KILLED', fontsize=16)
plt.xlabel('Year', fontsize=13)

# Total number of PEDESTRIANS killed
plt.subplot(4, 2, 4)
df.groupby('DATE_YEAR').PED_KILLED.sum().plot.bar()
plt.title('Total number of PEDESTRIANS KILLED', fontsize=16)
plt.xlabel('Year', fontsize=13)

plt.show()

Number of people injured and killed per borough

In [74]:
fig, ax = plt.subplots(1, figsize=(25, 15))

plt.subplot(2, 2 ,1)
df.groupby('BOROUGH').TOTAL_INJURED.sum().sort_values(ascending=False).plot.bar()
plt.title('Number of people injured per borough', fontsize=18)
plt.xlabel('Borough,   *NYC = unknown location incidents', fontsize=14)

plt.subplot(2, 2 ,2)
df.groupby('BOROUGH').TOTAL_KILLED.sum().sort_values(ascending=False).plot.bar()
plt.title('Number of people killed per borough', fontsize=18)
plt.xlabel('Borough,   *NYC = unknown location incidents', fontsize=14)

plt.show()
In [ ]:
 

Per quarter analysis

In [73]:
# Total number of injured and killed per quarter
fig, ax = plt.subplots(1, figsize=(25, 15))

plt.subplot(2, 2 ,1)
df.groupby('DATE_QUARTER').TOTAL_INJURED.sum().plot.bar()
plt.title('Total number of PERSONS INJURED', fontsize=18)
plt.xlabel('Quarter', fontsize=14)

plt.subplot(2, 2 ,2)
df.groupby('DATE_QUARTER').TOTAL_KILLED.sum().plot.bar()
plt.title('Total number of PERSONS KILLED', fontsize=18)
plt.xlabel('Quarter', fontsize=14)

plt.show()

Day of the week analysis

In [76]:
# Total number of injured and killed per quarter
fig, ax = plt.subplots(1, figsize=(25, 15))
plt.subplot(2, 2 ,1)
df.groupby('WEEKDAY').TOTAL_INJURED.sum().plot.bar()
plt.title('Total number of PERSONS INJURED per day of the week', fontsize=18)
plt.xlabel('Weekday,    0 = Sunday', fontsize=14)

plt.subplot(2, 2 ,2)
df.groupby('WEEKDAY').TOTAL_KILLED.sum().plot.bar()
plt.title('Total number of PERSONS KILLED per day of the week', fontsize=18)
plt.xlabel('Weekday,    0 = Sunday', fontsize=14)

plt.show()

Filling Zip code & Borough data – Reverse Geocoding

  • We can conduct the reverse Geocoding to obtain the address. All we need are the coordinates from column LOCATION (40.869335, -73.8255)
In [31]:
#--------------------------------------------------------------------------------
# Example code that works:
# from geopy.geocoders import Nominatim
# geolocator = Nominatim(user_agent="geoapiExercises")
# from tqdm import tqdm
# tqdm.pandas()
# geolocator = Nominatim(user_agent="specify_your_app_name_here")
# from geopy.extra.rate_limiter import RateLimiter
# geocode = RateLimiter(geolocator.geocode, min_delay_seconds=0.0, max_retries=2, error_wait_seconds=5.0, swallow_exceptions=True, return_value_on_exception=None)
# df['ADDRESS'] = df['LOCATION'].progress_apply(geocode)

# The down side: it will return only ~1,000 addresses per day
# With GeoPY is possible to fill all NaN values in ZIP CODE and BOROUGH
# Example: 
# Input: 40.88939, -73.89839 
# Output: Broadway, Fieldston, The Bronx, Bronx County, NYC, New York, 10463, USA
#--------------------------------------------------------------------------------
New York City traffic collision analysis by Milan Tomin
New York City traffic collision analysis by Milan Tomin
New York City traffic collision analysis by Milan Tomin
New York City traffic collision analysis by Milan Tomin