Crime Analysis
In this tutorial, we will load and clean UCR, Uniform Crime Report data for 2017 for the city of Houston, Tx.
The UCR data contained in these reports are presented in a monthly breakdown of Part I crimes for which HPD wrote police reports. The data is furthered broken down by police districts and beats and shows descriptions of the following types of crimes:
- Murder
- Rape
- Robbery
- Aggravated assault
- Burglary
- Theft
- Auto theft
Breakdown of the data
Column | Value | Description | example |
---|---|---|---|
beat | string |
Territory that a police officer patrols | 10H10 |
date | DateTime |
date when the offense took place | 4/11/2017 0:00 |
hour | int |
time frame when the offense took place | 15 |
number of offenses | int |
if multiple offenses took place | 1 |
block range | str |
range value of street | 2300-2399 |
offense type | str |
type of offense form a list | Theft |
premise | str |
Location type where the offense took place | Restaurant or Cafeteria Parking Lot |
street name | str |
Name of the street | Canal |
type | str |
Type of street | ST |
suffix | str |
Direction of the street facing | blank |
Tools & libraries
- Jupyter notebook
- pandas
- numpy
- seaborn
- matplotlib
- glob
- os
Step 1: Download dataset
Go to https://www.houstontx.gov/police/cs/crime-stats-archives.htm and download the excel files for 2017 and save them in your working directory.
It should look something similar to this:
|
notebook/
directory is where our jupyter notebook will bedata/clean_data/
is where we will save our clean datadata/crime_data/2017/
is where the excel files given the year will reside.
Now that we have our data lets start analyzing.
Step 2 Load Data
Now we will create a notebook and analyze our data
Load libraries
|
Combine data and create the data frame
Since we are in a nested directory, we will create a path to our data.
|
data_directory
creates a path to our crime data folder given the yeardata_directory_saves
also creates a path but it will be for our finalized clean data
Combine multiple files
|
- We use
glob
to find all the files from ourdata_directory
that match the excel extension.xls
- create a generator object that reads each excel files into a dataframe
- concat each dataframe into one
Inspect data
|
We can see that we have many columns that have similar names, let’s check for missing values.
Step 3 Clean data
Check and display missing values
|
|
Display missing values
Create a heatmap of our dataset
|
Results
These columns all have mutual data missing
- # offenses
& offenses
- Block Range
& BlockRange
- Street Name
& StreetName
Combine similar columns
|
- concat similar columns
- only selecting non-nan values
- keeping similar index
We now see the dataset filled and old columns will be dropped later.
Lowercase, drop & null value columns
- Lowercase columns make it better to work with
|
- renaming columns again makes it easier to work on, especially if the old columns contain a space.. ugh!
|
here we drop our old columns and some unused
using
inplace=True
allowes us to do it in one line
|
Let’s display or null values now
|
- we see here that
premise
has around 487 null values - for now lets replace them with
UNK
|
|
num_offenses | beat | block_range | date | hour | offense_type | premise | street_name | type | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 10H10 | 200-299 | 2017-04-10 00:00:00 | 15 | Burglary | Residence or House | CLIFTON | - |
1 | 1 | 10H10 | 2300-2399 | 2017-04-11 00:00:00 | 15 | Theft | Restaurant or Cafeteria Parking Lot | CANAL | ST |
2 | 1 | 10H10 | 2300-2399 | 2017-04-11 00:00:00 | 17 | Theft | Restaurant or Cafeteria Parking Lot | CANAL | ST |
3 | 1 | 10H10 | 4600-4699 | 2017-04-12 00:00:00 | 9 | Burglary | Miscellaneous Business (Non-Specific) | CANAL | ST |
4 | 1 | 10H10 | 100-199 | 2017-04-12 00:00:00 | 19 | Theft | Other, Unknown, or Not Listed | ADAM | LN |
Step 4 Clean.. some more
Now we will check some column and validate the data
num_offenses
|
|
- We can see that our values are showing up as
float
- let’s change it to
int
|
|
beat
|
- we can see that we have 237 values for beat
- that is almost double
- let’s check
|
- Many values have extra space
- lets strip empty space
|
|
date
|
- lets change it to datetime
|
|
offense_type
|
We can see that we have multiple versions of the same value, let’s check deeper.
|
- we see that empty space
|
|
|
- we have two versions of “auto theft”
- let’s replace the incorrect
|
|
street_name
|
|
|
Organize columns
|
date | hour | beat | offense_type | block_range | street_name | premise | num_offenses | type | |
---|---|---|---|---|---|---|---|---|---|
0 | 2017-04-10 00:00:00 | 15 | 10H10 | Burglary | 200-299 | CLIFTON | Residence or House | 1 | - |
1 | 2017-04-11 00:00:00 | 15 | 10H10 | Theft | 2300-2399 | CANAL | Restaurant or Cafeteria Parking Lot | 1 | ST |
2 | 2017-04-11 00:00:00 | 17 | 10H10 | Theft | 2300-2399 | CANAL | Restaurant or Cafeteria Parking Lot | 1 | ST |
3 | 2017-04-12 00:00:00 | 9 | 10H10 | Burglary | 4600-4699 | CANAL | Miscellaneous Business (Non-Specific) | 1 | ST |
4 | 2017-04-12 00:00:00 | 19 | 10H10 | Theft | 100-199 | ADAM | Other, Unknown, or Not Listed | 1 | LN |
Check for null values
- 9
street_names
- 2
offense_type
|
|
Drop null street_name
|
investigate offense_type
|
date | hour | beat | offense_type | block_range | street_name | premise | num_offenses | type | |
---|---|---|---|---|---|---|---|---|---|
35772 | 2017-02-15 | 23 | 20G40 | NaN | 1400-1499 | DAIRY ASHFORD | Apartment | 1 | RD |
38352 | 2016-12-31 | 1 | 6B20 | NaN | 4200-4299 | OAK SHADOWS | Residence or House | 1 | DR |
Let’s analyze all the crimes on these two locations.
Location 1
|
- Change the null value to Theft by index location
|
Location 2
|
|
date | hour | beat | offense_type | block_range | street_name | premise | num_offenses | type | |
---|---|---|---|---|---|---|---|---|---|
38352 | 2016-12-31 00:00:00 | 1 | 6B20 | nan | 4200-4299 | OAK SHADOWS | Residence or House | 1 | DR |
52209 | 2017-07-20 00:00:00 | 22 | 14D10 | Robbery | 4200-4299 | OAK SHADOWS | Service or Gas Station | 1 | DR |
79668 | 2017-03-09 00:00:00 | 13 | 6B20 | Aggravated Assault | 4200-4299 | OAK SHADOWS | Residence or House | 1 | DR |
- we can see that two locations happen on the same beat and premise
- change the null value to AA
|
Results
|
Step 4: save data
check our save directory
|
|