alt text

In this tutorial, we will load and clean UCR, Uniform Crime Report data for 2017 for the city of Houston, Tx.

The UCR data contained in these reports are presented in a monthly breakdown of Part I crimes for which HPD wrote police reports. The data is furthered broken down by police districts and beats and shows descriptions of the following types of crimes:

  • Murder
  • Rape
  • Robbery
  • Aggravated assault
  • Burglary
  • Theft
  • Auto theft

Breakdown of the data

Column Value Description example
beat string Territory that a police officer patrols 10H10
date DateTime date when the offense took place 4/11/2017 0:00
hour int time frame when the offense took place 15
number of offenses int if multiple offenses took place 1
block range str range value of street 2300-2399
offense type str type of offense form a list Theft
premise str Location type where the offense took place Restaurant or Cafeteria Parking Lot
street name str Name of the street Canal
type str Type of street ST
suffix str Direction of the street facing blank

Tools & libraries

  • Jupyter notebook
  • pandas
  • numpy
  • seaborn
  • matplotlib
  • glob
  • os

Step 1: Download dataset

Go to https://www.houstontx.gov/police/cs/crime-stats-archives.htm and download the excel files for 2017 and save them in your working directory.

It should look something similar to this:

notebooks/

data/
    clean_data/
        crime/
    crime_data/
        2017/
            jan17.xls
            feb17.xls
            ...
  • notebook/ directory is where our jupyter notebook will be
  • data/clean_data/ is where we will save our clean data
  • data/crime_data/2017/ is where the excel files given the year will reside.

Now that we have our data lets start analyzing.

Step 2 Load Data

Now we will create a notebook and analyze our data

Load libraries

import pandas as pd
import glob, os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Combine data and create the data frame

Since we are in a nested directory, we will create a path to our data.

year = '2017'
data_directory = os.path.join('..','data','crime_data/{}'.format(year))
data_directory_saves = os.path.join( '..','data','clean_data','crime/')
  • data_directory creates a path to our crime data folder given the year
  • data_directory_saves also creates a path but it will be for our finalized clean data

Combine multiple files

all_files = glob.glob(os.path.join(data_directory, "*.xls")) 
df_from_each_file = (pd.read_excel(f) for f in all_files)
df = pd.concat(df_from_each_file, ignore_index=True)
  • We use glob to find all the files from our data_directory that match the excel extension .xls
  • create a generator object that reads each excel files into a dataframe
  • concat each dataframe into one

Inspect data

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119727 entries, 0 to 119726
Data columns (total 13 columns):
# offenses      40378 non-null float64
Beat            119727 non-null object
Block Range     79349 non-null object
BlockRange      40378 non-null object
Date            119727 non-null object
Hour            119727 non-null int64
Offense Type    119727 non-null object
Offenses        79349 non-null float64
Premise         119240 non-null object
Street Name     79349 non-null object
StreetName      40378 non-null object
Suffix          119727 non-null object
Type            119727 non-null object
dtypes: float64(2), int64(1), object(10)
memory usage: 11.9+ MB

We can see that we have many columns that have similar names, let’s check for missing values.

Step 3 Clean data

Check and display missing values

df.apply(lambda x: sum(x.isnull()))
# offenses      79349
Beat                0
Block Range     40378
BlockRange      79349
Date                0
Hour                0
Offense Type        0
Offenses        40378
Premise           487
Street Name     40378
StreetName      79349
Suffix              0
Type                0
dtype: int64

Display missing values

Create a heatmap of our dataset

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
plt.title('Null dataset display')
plt.show()

Results

These columns all have mutual data missing - # offenses & offenses - Block Range & BlockRange - Street Name & StreetName

Combine similar columns

df['BlockRange'] = pd.concat([df['Block Range'].dropna(),
                              df['BlockRange'].dropna()]).reindex_like(df)

df['StreetName'] = pd.concat([df['Street Name'].dropna(),
                              df['StreetName'].dropna()]).reindex_like(df)

df['# offenses'] = pd.concat([df['# offenses'].dropna(),
                              df['Offenses'].dropna()]).reindex_like(df)
  • concat similar columns
  • only selecting non-nan values
  • keeping similar index

We now see the dataset filled and old columns will be dropped later.

Lowercase, drop & null value columns

  • Lowercase columns make it better to work with
# Map the lowering function to all column names
df.columns = map(str.lower, df.columns)
  • renaming columns again makes it easier to work on, especially if the old columns contain a space.. ugh!
df.rename(columns={
    '# offenses': 'num_offenses',
     'offense type': 'offense_type',
    'blockrange': 'block_range',
    'streetname': 'street_name'
}, inplace=True)
  • here we drop our old columns and some unused

  • using inplace=True allowes us to do it in one line

df.drop(['suffix','street name','block range','offenses'],axis=1, inplace=True)

Let’s display or null values now

df.apply(lambda x: sum(x.isnull()))

num_offenses      0
beat              0
block_range       0
date              0
hour              0
offense_type      0
premise         487
street_name       0
type              0
dtype: int64
  • we see here that premise has around 487 null values
  • for now lets replace them with UNK
df.premise.fillna('UNK',inplace=True)
df.head()
num_offenses beat block_range date hour offense_type premise street_name type
0 1 10H10 200-299 2017-04-10 00:00:00 15 Burglary Residence or House CLIFTON -
1 1 10H10 2300-2399 2017-04-11 00:00:00 15 Theft Restaurant or Cafeteria Parking Lot CANAL ST
2 1 10H10 2300-2399 2017-04-11 00:00:00 17 Theft Restaurant or Cafeteria Parking Lot CANAL ST
3 1 10H10 4600-4699 2017-04-12 00:00:00 9 Burglary Miscellaneous Business (Non-Specific) CANAL ST
4 1 10H10 100-199 2017-04-12 00:00:00 19 Theft Other, Unknown, or Not Listed ADAM LN

Step 4 Clean.. some more

Now we will check some column and validate the data

num_offenses

df.num_offenses.dtype

dtype('float64')
df.num_offenses.value_counts(dropna=False)

1.0     117727
2.0       1661
3.0        227
4.0         70
5.0         21
6.0         10
7.0          7
10.0         2
8.0          2
Name: num_offenses, dtype: int64
  • We can see that our values are showing up as float
  • let’s change it to int
df.num_offenses = df.num_offenses.astype('int')
df.num_offenses.dtype
dtype('int64')

beat

len(df.beat.value_counts(dropna='False'))
237
  • we can see that we have 237 values for beat
  • that is almost double
  • let’s check
df.beat.unique()

...
      '21I10 ', '21I20 ', '21I40 ', '21I50 ', '21I60 ', '21I70 ',
       '23J50 ', '24C10 ', '24C20 ', '24C30 ', '24C40 ', '24C50 ',
       '24C60 ', '2A10  ', '2A20  ', '2A30  ', '2A40  ', '2A50  ',
       '2A60  ', '3B10  ', '3B30  ', '3B40  ', '3B50  ', '4F10  ',
       '4F20  ', '4F30  ', '5F10  ', '5F20  ', '5F30  ', '5F40  ',
       '6B10  ', '6B20  ', '6B30  ', '6B40  ', '6B50  ', '6B60  ',
       '7C10  ', '7C20  ', '7C30  ', '7C40  ', '7C50  ', '8C10  ',
       '8C20  ', '8C30  ', '8C40  ', '8C50  ', '8C60  ', '9C10  ',
       '9C20  ', '9C30  ', '9C40  ', 'UNK   ']
  • Many values have extra space
  • lets strip empty space
df.beat = df.beat.str.strip()
len(df.beat.unique())
120

date

df['date'].dtype
dtype('O')
  • lets change it to datetime
df['date'] = pd.to_datetime(df['date'])
df['date'].dtype
dtype('<M8[ns]')

offense_type

df.offense_type.value_counts(dropna=False)

Theft                        61411
Burglary                     15713
Aggravated Assault           11156
Robbery                       8956
AutoTheft                     6649
Theft                         6011
Auto Theft                    3874
Burglary                      1371
Rape                          1239
Aggravated Assault            1158
AutoTheft                      978
Robbery                        822
Murder                         235
Rape                           129
Murder                          23
1                                2
Name: offense_type, dtype: int64

We can see that we have multiple versions of the same value, let’s check deeper.

df.offense_type.unique()
array(['Burglary', 'Theft', 'Robbery', 'Auto Theft', 'Aggravated Assault',
       'Rape', 'Murder', 'AutoTheft', 1, 'Burglary                 ',
       'Robbery                  ', 'Theft                    ',
       'AutoTheft                ', 'Aggravated Assault       ',
       'Rape                     ', 'Murder                   '],
      dtype=object)
  • we see that empty space
df.offense_type = df.offense_type.str.strip()
df.offense_type.unique()

array(['Burglary', 'Theft', 'Robbery', 'Auto Theft', 'Aggravated Assault',
       'Rape', 'Murder', 'AutoTheft', nan], dtype=object)
df.offense_type.value_counts(dropna=False)

Theft                 67422
Burglary              17084
Aggravated Assault    12314
Robbery                9778
AutoTheft              7627
Auto Theft             3874
Rape                   1368
Murder                  258
NaN                       2
Name: offense_type, dtype: int64
  • we have two versions of “auto theft”
  • let’s replace the incorrect
df.offense_type.replace({'AutoTheft':'Auto Theft'}, inplace=True)
df.offense_type.value_counts(dropna=False)

Theft                 67422
Burglary              17084
Aggravated Assault    12314
Auto Theft            11501
Robbery                9778
Rape                   1368
Murder                  258
NaN                       2
Name: offense_type, dtype: int64

street_name

len(df.street_name.value_counts(dropna=False))
21587
df.street_name = df.street_name.str.strip()
len(df.street_name.value_counts(dropna=False))
18672

Organize columns

# list of column names in ideal order
cols = ['date','hour','beat','offense_type','block_range','street_name','premise','num_offenses','type']

df = df.reindex(columns=cols)
df.head()
date hour beat offense_type block_range street_name premise num_offenses type
0 2017-04-10 00:00:00 15 10H10 Burglary 200-299 CLIFTON Residence or House 1 -
1 2017-04-11 00:00:00 15 10H10 Theft 2300-2399 CANAL Restaurant or Cafeteria Parking Lot 1 ST
2 2017-04-11 00:00:00 17 10H10 Theft 2300-2399 CANAL Restaurant or Cafeteria Parking Lot 1 ST
3 2017-04-12 00:00:00 9 10H10 Burglary 4600-4699 CANAL Miscellaneous Business (Non-Specific) 1 ST
4 2017-04-12 00:00:00 19 10H10 Theft 100-199 ADAM Other, Unknown, or Not Listed 1 LN

Check for null values

  • 9 street_names
  • 2 offense_type
df.apply(lambda x: sum(x.isnull()))
date            0
hour            0
beat            0
offense_type    2
block_range     0
street_name     9
premise         0
num_offenses    0
type            0
dtype: int64

Drop null street_name

df.dropna(subset=['street_name'],inplace=True)

investigate offense_type

df = df.reset_index(drop=True)  # reset index

df[df.isnull().any(axis=1)]  # display null value rows
date hour beat offense_type block_range street_name premise num_offenses type
35772 2017-02-15 23 20G40 NaN 1400-1499 DAIRY ASHFORD Apartment 1 RD
38352 2016-12-31 1 6B20 NaN 4200-4299 OAK SHADOWS Residence or House 1 DR

Let’s analyze all the crimes on these two locations.

Location 1

da = df.street_name =='DAIRY ASHFORD'
br = df.block_range == '1400-1499'

df[da&br].offense_type.value_counts()


Theft       5
Burglary    1
Name: offense_type, dtype: int64
  • Change the null value to Theft by index location
df.iloc[35772, df.columns.get_loc('offense_type')] = 'Theft'

Location 2

br = df.block_range == '4200-4299'

os = df.street_name =='OAK SHADOWS'
df[os&br].offense_type.value_counts()

Aggravated Assault    1
Robbery               1
Name: offense_type, dtype: int64
df[os&br]
date hour beat offense_type block_range street_name premise num_offenses type
38352 2016-12-31 00:00:00 1 6B20 nan 4200-4299 OAK SHADOWS Residence or House 1 DR
52209 2017-07-20 00:00:00 22 14D10 Robbery 4200-4299 OAK SHADOWS Service or Gas Station 1 DR
79668 2017-03-09 00:00:00 13 6B20 Aggravated Assault 4200-4299 OAK SHADOWS Residence or House 1 DR
  • we can see that two locations happen on the same beat and premise
  • change the null value to AA
df.iloc[38352, df.columns.get_loc('offense_type')] = 'Aggravated Assault'

Results

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119718 entries, 0 to 119717
Data columns (total 9 columns):
date            119718 non-null datetime64[ns]
hour            119718 non-null int64
beat            119718 non-null object
offense_type    119718 non-null object
block_range     119718 non-null object
street_name     119718 non-null object
premise         119718 non-null object
num_offenses    119718 non-null int64
type            119718 non-null object
dtypes: datetime64[ns](1), int64(2), object(6)
memory usage: 8.2+ MB

Step 4: save data

check our save directory

data_directory_saves

'../data/clean_data/crime/'
df.to_csv(data_directory_saves+'crime_2017.csv')

Sources