Introduction

This project aims to build a machine learning model for assault prediction in Chicago, IL using crime data in 2022. The assault events used as the dependent variable in the model’s preparation is extracted from Chicago Data Portal’s 2022 crime data, and multiple independent variables used in the model are from the same site and using the year that is closest to 2022.

Data Wrangling

Crime Data Gathering

This uses the Socrata package for data available through api. For the Chicago boundary, this project intentionally omit the O’Hare airport part for cleaner organization.

visualizing 2022 Assault Location

Map on the left is the actual spots of assault incidents and map on the right is the density of assaults with contours overlaid on Chicago boundary.

# Define your custom color gradient with 3 colors
my_colors <- c("#033E56", "#518984", "#9DC9A3", "#F5CD42", "#F5A70A", "#D05A30")

# Uses grid.arrange to organize independent plots
grid.arrange(
  ncol = 2,
  
  # Plot 1: assault overlaid on Chicago boundary
  ggplot() + 
    geom_sf(data = chicagoBoundary, fill = "#e9e9e9", col = "grey") +  
    geom_sf(data = assault_clipped, size = 0.05, color = "#F5A70A") + 
    labs(title = "Assaults in Chicago in 2022") +  
    theme_void(), 
  
  # Plot 2: Density of assaults with contours overlaid on Chicago boundary
  ggplot() + 
    geom_sf(data = chicagoBoundary, fill = "#e9e9e9", col = "grey") + 
    stat_density2d(data = data.frame(st_coordinates(assault_clipped)),  # Compute 2D kernel density estimate
                   aes(X, Y, fill = ..level.., alpha = ..level..),  # Define aesthetics for density contours
                   size = 0.01, bins = 60, geom = 'polygon') +  # Set size and number of bins for contours
    scale_fill_gradientn(colors = my_colors, na.value = "#e9e9e9",
                      name = "Kernal Density") +
    scale_alpha(range = c(0.00, 0.25), guide = FALSE) +  # Set transparency range for contours
    labs(title = "Density of Assaults",
       caption = "Data: Chicago Data Portal Crimes 2022") +  # Set plot title
    theme_void() + theme(legend.position = "none")  # Use a blank theme and remove legend
)

Creating a fishnet grid

Set grid cells with a cell size of 500 meters and aggregate assault points into it.

# Note the `.[chicagoBoundary] %>% ` line. This is needed to clip the grid to our data
fishnet <- 
st_make_grid(chicagoBoundary,
               cellsize = 500, 
               square = TRUE) %>%
  .[chicagoBoundary] %>%            # fast way to select intersecting polygons
  st_sf() %>%   mutate(uniqueID = 1:n())


# add a value of 1 to each crime, sum them with aggregate
crime_net <- 
  dplyr::select(assault_clipped) %>% 
  mutate(countAssault = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countAssault = replace_na(countAssault, 0),
         uniqueID = 1:n(),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

ggplot() +
  geom_sf(data = crime_net, aes(fill = countAssault), color = NA) +
  scale_fill_gradientn(colors = my_colors, na.value = "#e9e9e9",
                      name = "Count of Assaults") +
  labs(title = "Count of Assaults for the fishnet",
       caption = "Data: Chicago Data Portal Crimes 2022") +
  theme_void()

Use histogram to show the distribution of assault incidents for each grid cell.

ggplot(crime_net, aes(x = countAssault)) +
  geom_histogram( fill="#F5A70A", color="#e9e9e9") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme_minimal() + 
  labs(title = "Distribution of Assaults in Chicago 2022",
       caption = "Data: Chicago Data Portal Crimes 2022") +
  xlab("Assault Incidents") +
  ylab("Count") 

Modeling Spatial Features

Independent Variables Gathering

Read data for the independent variables in the prediction model:
1. Abandoned Vehicles
2. Abandoned Buildings 3. Graffiti
4. Street Lights Out
5. Sanitation Complaints
6. Liquor Retails
7. Business Licenses
8. Liquor and Public Places Licenses
9. Shotspotter Alert

Aggregate each feature to fishnet

This part count the number of points for each independent variable within each grid.

# Join data with the fishnet grid based on spatial intersection

vars_net <- 
  rbind(abandonCars,abandonBuildings, graffiti, streetLightsOut, sanitation, liquorRetail, businessLicense, liquorLicense, shotSpotter) %>%
  st_join(., fishnet, join=st_within) %>% # Perform spatial join with fishnet grid, keeping only points within grid cells
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>% # Group data by uniqueID and Legend
  summarize(count = n()) %>% # Calculate count of points within each group (grid cell)
    left_join(fishnet, ., by = "uniqueID") %>% # Left join fishnet grid with summarized counts, using uniqueID as key
    spread(Legend, count, fill=0) %>% # Spread Legend values into separate columns, filling missing values with 0
    #st_sf() %>%
    dplyr::select(-`<NA>`) %>% # Remove columns with NAs (generated during spreading)
    #na.omit() %>%
    ungroup() # Ungroup the data frame (remove grouping structure)

# plot
vars_net.long <- 
  gather(vars_net, Variable, value, -geometry, -uniqueID)

plot1 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Abandoned_Cars"), aes(fill=value), colour=NA) +
  scale_fill_gradientn(colors = my_colors, na.value = "#e9e9e9",
                      name = "Count") + 
  labs(title = "Abandoned Cars") +
  theme_void()


plot2 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Abandoned_Buildings"), aes(fill=value), colour=NA) +
  scale_fill_gradientn(colors = my_colors, na.value = "#e9e9e9",
                      name = "Count") + 
  labs(title = "Abandoned Buildings") +
  theme_void()

plot3 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Graffiti"), aes(fill=value), colour=NA) +
  scale_fill_gradientn(colors = my_colors, na.value = "#e9e9e9",
                      name = "Count") + 
  labs(title = "Graffiti") +
  theme_void()

plot4 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Street_Lights_Out"), aes(fill=value), colour=NA) +
  scale_fill_gradientn(colors = my_colors, na.value = "#e9e9e9",
                      name = "Count") + 
  labs(title = "Street Lights Out") +
  theme_void()


plot5 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Sanitation"), aes(fill=value), colour=NA) +
  scale_fill_gradientn(colors = my_colors, na.value = "#e9e9e9",
                      name = "Count") + 
  labs(title = "Sanitation") +
  theme_void()

plot6 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Liquor_Retail"), aes(fill=value), colour=NA) +
  scale_fill_gradientn(colors = my_colors, na.value = "#e9e9e9",
                      name = "Count") + 
  labs(title = "Liquor Retail") +
  theme_void()

plot7 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Business_License"), aes(fill=value), colour=NA) +
  scale_fill_gradientn(colors = my_colors, na.value = "#e9e9e9",
                      name = "Count") + 
  labs(title = "Business License") +
  theme_void()


plot8 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Liquor_License"), aes(fill=value), colour=NA) +
  scale_fill_gradientn(colors = my_colors, na.value = "#e9e9e9",
                      name = "Count") + 
  labs(title = "Liquor License") +
  theme_void()

plot9 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Shot_Spotter"), aes(fill=value), colour=NA) +
  scale_fill_gradientn(colors = my_colors, na.value = "#e9e9e9",
                      name = "Count") + 
  labs(title = "Shot Spotter") +
  theme_void()

grid.arrange(
  plot1, plot2, plot3, plot4,
  plot5, plot6, plot7, plot8, plot9,
  ncol = 3  
) 

Nearest Neighbor Feature

This part calculates the nearest neighbors (NN) of each independent variable to the centroid of fishnet grid cells. The number of nearest neighbors used for each variable depends on the quantity of the original data.

# Convenience aliases to reduce the length of function names
st_c    <- st_coordinates  # Alias for st_coordinates function
st_coid <- st_centroid     # Alias for st_centroid function

# Create nearest neighbor (NN) relationship to fishnet grid cells
vars_net <-
  vars_net %>%
    mutate(
      Abandoned_Cars.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(abandonCars),3),
      Abandoned_Buildings.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(abandonBuildings),1),
      Graffiti.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(graffiti),10),
      Street_Lights_Out.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(streetLightsOut),5),
      Sanitation.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(sanitation),3),
      Liquor_Retail.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(liquorRetail),1),
      Business_License.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(businessLicense),3),
      Liquor_License.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(liquorLicense),1),
      Shot_Spotter.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(shotSpotter),3))

## Visualize the NN feature
vars_net.long.nn <- 
  dplyr::select(vars_net, ends_with(".nn")) %>%
    gather(Variable, value, -geometry)


p1.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Abandoned_Cars.nn"), aes(fill=value), colour=NA) +
  scale_fill_gradientn(colors = my_colors, na.value = "#e9e9e9",
                      name = "Distance") + 
  labs(title = "Abandoned Car NN Distance") +
  theme_void()


p2.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Abandoned_Buildings.nn"), aes(fill=value), colour=NA) +
  scale_fill_gradientn(colors = my_colors, na.value = "#e9e9e9",
                      name = "Distance") + 
  labs(title = "Abandoned Building NN Distance") +
  theme_void()

p3.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Graffiti.nn"), aes(fill=value), colour=NA) +
  scale_fill_gradientn(colors = my_colors, na.value = "#e9e9e9",
                      name = "Distance") + 
  labs(title = "Graffiti NN Distance") +
  theme_void()

p4.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Street_Lights_Out.nn"), aes(fill=value), colour=NA) +
  scale_fill_gradientn(colors = my_colors, na.value = "#e9e9e9",
                      name = "Distance") + 
  labs(title = "Street Lights Out NN Distance") +
  theme_void()


p5.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Sanitation.nn"), aes(fill=value), colour=NA) +
  scale_fill_gradientn(colors = my_colors, na.value = "#e9e9e9",
                      name = "Distance") + 
  labs(title = "Sanitation NN Distance") +
  theme_void()

p6.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Liquor_Retail.nn"), aes(fill=value), colour=NA) +
  scale_fill_gradientn(colors = my_colors, na.value = "#e9e9e9",
                      name = "Distance") + 
  labs(title = "Liquor Retail NN Distance") +
  theme_void()

p7.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Business_License.nn"), aes(fill=value), colour=NA) +
  scale_fill_gradientn(colors = my_colors, na.value = "#e9e9e9",
                      name = "Distance") + 
  labs(title = "Business License NN Distance") +
  theme_void()


p8.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Liquor_License.nn"), aes(fill=value), colour=NA) +
  scale_fill_gradientn(colors = my_colors, na.value = "#e9e9e9",
                      name = "Distance") + 
  labs(title = "Liquor License NN Distance") +
  theme_void()

p9.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Shot_Spotter.nn"), aes(fill=value), colour=NA) +
  scale_fill_gradientn(colors = my_colors, na.value = "#e9e9e9",
                      name = "Distance") + 
  labs(title = "Shot Spotter NN Distance") +
  theme_void()

grid.arrange(
  p1.nn, p3.nn, p2.nn, p4.nn,
  p5.nn, p6.nn, p7.nn, p8.nn, p9.nn,
  ncol = 3  
)

Join crime and areal data to fishnet

Since the counts were aggregated to each cell by uniqueID we can use that to join the counts to the fishnet. Besides, Use spatial joins to join centroids of fishnets to polygon for neighborhoods and districts.

## important to drop the geometry from joining features
final_net <-
  left_join(crime_net, st_drop_geometry(vars_net), by="uniqueID") 

final_net <-
  st_centroid(final_net) %>%
    st_join(dplyr::select(neighborhoods, name), by = "uniqueID") %>%
    st_join(dplyr::select(policeDistricts, District), by = "uniqueID") %>%
      st_drop_geometry() %>%
      full_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()

# for live demo
# mapview::mapview(final_net, zcol = "District")

Local Moran’s I for fishnet grid cells

Using {spdep} package to to build neighborhood weights and list to calculate local Moran’s I.

we can get high local moran’s I values in the case of either high values near other high values or low values near other low values.

In the code below, we examine the local moran’s I value, the p-value, and extracts hotspots based on the p-value.

## generates warnings from PROJ issues
## {spdep} to make polygon to neighborhoods... 
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
## ... and neighborhoods to list of weigths
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)

# join local Moran's I results to fishnet
local_morans <- localmoran(final_net$countAssault, final_net.weights, zero.policy=TRUE) %>% 
  as.data.frame()

final_net.localMorans <- 
  cbind(local_morans, as.data.frame(final_net)) %>% 
  st_sf() %>%
  dplyr::select(Assault_Count = countAssault, 
                Local_Morans_I = Ii, 
                P_Value = `Pr(z != E(Ii))`) %>%
  mutate(Significant_Hotspots = ifelse(P_Value <= 0.05, 1, 0)) %>%
  gather(Variable, Value, -geometry)

# This is just for plotting
vars <- unique(final_net.localMorans$Variable)
varList <- list()

for(i in vars){
  varList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(final_net.localMorans, Variable == i), 
              aes(fill = Value), colour=NA) +
      scale_fill_gradientn(colors = my_colors, na.value = "#e9e9e9",
                      name = "") +
      labs(title=i) +
      theme_void()}

grid.arrange(varList[[1]], varList[[2]], varList[[3]], varList[[4]],
  ncol = 2, top = "Local Morans I statistics, Assaults")

NN distance to hotspot

Now we will calculate distance to that nearest hotspot and plot NN distance to hot spot

# generates warning from NN
# final_net <- final_net %>% 
#   mutate(assault.isSig.dist = 
#            nn_function(st_c(st_coid(final_net)),
#                        st_c(st_coid(filter(final_net, 
#                                            hotspot == 1))), 
#                        k = 1))

final_net <-
  final_net %>% 
  mutate(assault.isSig = ifelse(localmoran(final_net$countAssault, 
                             final_net.weights)[,5] <= 0.001, 1, 0)) %>%
  mutate(assault.isSig.dist = nn_function(st_coordinates(st_centroid(final_net)),
                                          st_coordinates(st_centroid(
                                            filter(final_net, assault.isSig == 1))), 1))

ggplot() +
      geom_sf(data = final_net, aes(fill=assault.isSig.dist), colour=NA) +
      scale_fill_gradientn(colors = my_colors, na.value = "#e9e9e9",
                      name = "NN Distance") +
      labs(title="Assault NN Distance") +
      theme_void()

Coorelation analysis

final_net2 <- st_drop_geometry(final_net) %>%
  dplyr::select(-c(uniqueID, cvID, name, District, Business_License, Liquor_License,Shot_Spotter.nn, assault.isSig))
 

final_net_long <- gather(final_net2, Variable, Value, -countAssault)

# Create scatterplot with linear regression lines for each variable
ggplot(final_net_long, aes(Value, countAssault)) + 
  geom_point(size = .25, color = "#033E56") +
  geom_smooth(method = "lm", se = FALSE, color = "#D05A30", size = 0.8) +
  facet_wrap(~Variable, ncol = 4, scales = "free") + 
  labs(title = "countAssault as a function of continuous variables") + 
  theme_minimal()

Modeling and CV

Random K fold CV

## define the variables we want
reg.vars <- c("Abandoned_Buildings", "Abandoned_Cars", "Business_License.nn","Graffiti", 
                 "Liquor_License.nn", "Liquor_Retail.nn", "Street_Lights_Out", "Sanitation", 
                 "Shot_Spotter")

reg.ss.vars <- c("Abandoned_Buildings", "Abandoned_Cars", "Business_License.nn","Graffiti", 
                 "Liquor_License.nn", "Liquor_Retail.nn", "Street_Lights_Out", "Sanitation", 
                 "Shot_Spotter", "lmI", "lmZ", "assault.isSig.dist")

reg.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countAssault",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = cvID, countAssault, Prediction, geometry)

reg.ss.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countAssault",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = cvID, countAssault, Prediction, geometry)

Spatial LOGO CV

Leave One Group Out CV on spatial features.

reg.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countAssault",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = name, countAssault, Prediction, geometry)

reg.ss.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countAssault",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = name, countAssault, Prediction, geometry)

Compare regressions

Comparing the error across different regressions, we see that adding spatial process features improve the model.

reg.summary <- 
  rbind(
    mutate(reg.cv,           Error = Prediction - countAssault,
                             Regression = "Random k-fold CV: Just Risk Factors"),
                             
    mutate(reg.ss.cv,        Error = Prediction - countAssault,
                             Regression = "Random k-fold CV: Spatial Process"),
    
    mutate(reg.spatialCV,    Error = Prediction - countAssault,
                             Regression = "Spatial LOGO-CV: Just Risk Factors"),
                             
    mutate(reg.ss.spatialCV, Error = Prediction - countAssault,
                             Regression = "Spatial LOGO-CV: Spatial Process")) %>%
    st_sf() 

error_by_reg_and_fold <- 
  reg.summary %>%
    group_by(Regression, cvID) %>% 
    summarize(Mean_Error = mean(Prediction - countAssault, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>% ungroup()

st_drop_geometry(error_by_reg_and_fold) %>%
  group_by(Regression) %>% 
    summarize(Mean_MAE = round(mean(MAE), 2),
              SD_MAE = round(sd(MAE), 2)) %>%
  kable(col.name=c("Regression", 'Mean Absolute Error','Standard Deviation MAE')) %>%
    kable_styling(bootstrap_options = "striped", full_width = T, position = "left") %>%
    footnote(general_title = "\n", general = "Table 1")
Regression Mean Absolute Error Standard Deviation MAE
Random k-fold CV: Just Risk Factors 1.45 1.18
Random k-fold CV: Spatial Process 1.31 0.98
Spatial LOGO-CV: Just Risk Factors 4.13 4.20
Spatial LOGO-CV: Spatial Process 3.38 3.48

Table 1

The small multiple maps below visualizes the distribution of errors by regression. For the random k-fold cv method, the errors are pretty evenly distributed across space. However, for spatial logo-cv method, higher errors occur at spatial hotspots.

error_by_reg_and_fold %>% 
  ggplot() +
  geom_sf(aes(fill=MAE), color="transparent") +
  scale_fill_gradientn(colors = my_colors, na.value = "#e9e9e9",
                      name = "MAE") + 
  facet_wrap(~Regression, ncol = 4) +
   labs(title = "MAE by Fold and Regression",
       caption = "Data: Chicago Data Portal") +
  theme_void() 

Raw errors by race context

Chicago remains a very segregated city and there’s clear racial boundary between places. In particular, northern and northeastern part of Chicago are majority white while the remaining parts are all made up of racial minorities.

tracts22 <- 
  get_acs(geography = "tract", 
          variables = c("B02001_001E", # total population
            "B02001_002E"), # white population
          year=2022, state=17, county=031, 
          geometry=TRUE, output="wide") %>%
  st_transform('ESRI:102271') %>% 
  rename(TotalPop = B02001_001E, 
         Whites = B02001_002E) %>% 
  mutate(pctWhite = ifelse(TotalPop > 0, Whites / TotalPop, 0), 
         majority = ifelse(pctWhite > 0.5, "White Dominant", "Not White Dominant")) %>%   .[neighborhoods,]
ggplot() + geom_sf(data = na.omit(tracts22), aes(fill = majority), color = NA) +
    scale_fill_manual(values = c("#033E56", "#D05A30"), name="Race Context") +
    labs(title = "Race Context ",
         caption = "Data: American Community Survey 2021") +
  theme_void() 

In summary, the model works pretty good and do not overpredict area that are mostly minority groups.

joinrace <- st_centroid(reg.summary) %>% 
  st_intersection(tracts22 %>%dplyr:: select(pctWhite)) %>% 
  st_drop_geometry() %>% 
  group_by(cvID) %>% 
  summarize(meanMajor = mean(pctWhite))


reg.summary <- reg.summary %>% 
  left_join(joinrace, by = "cvID") 


reg.summary %>% 
  mutate(raceContext = ifelse(meanMajor > .5, "White Dominant Neighborhood", "Not White Dominant Neighborhood")) %>% 
  st_drop_geometry() %>% 
  group_by(Regression, raceContext) %>%
  summarize(mean.Error = mean(Error, na.rm = T)) %>%
  spread(raceContext, mean.Error) %>%
  kable() %>%
    kable_styling(bootstrap_options = "striped", full_width = T, position = "left") %>%
    footnote(general_title = "\n", general = "Table 2")
Regression Not White Dominant Neighborhood White Dominant Neighborhood
Random k-fold CV: Just Risk Factors 0.0042561 3.1644085
Random k-fold CV: Spatial Process 0.0016994 1.2339475
Spatial LOGO-CV: Just Risk Factors -0.4895975 1.0581259
Spatial LOGO-CV: Spatial Process -0.1057703 0.1807363

Table 2

Density vs predictions

Kernel density map

The spatstat function gets us kernal density estimates with varying search radii. The code block below creates a Kernel density map of assaults with a 1000, 1500, and 2000 foot search radius.

# demo of kernel width
assault_ppp <- as.ppp(st_coordinates(assault_clipped), W = st_bbox(final_net))
assault_KD.1000 <- density.ppp(assault_ppp, 1000)
assault_KD.1500 <- density.ppp(assault_ppp, 1500)
assault_KD.2000 <-density.ppp(assault_ppp, 2000)
assault_KD.df <- rbind(
  mutate(data.frame(rasterToPoints(mask(raster(assault_KD.1000), as(neighborhoods, 'Spatial')))), Legend = "1000 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(assault_KD.1500), as(neighborhoods, 'Spatial')))), Legend = "1500 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(assault_KD.2000), as(neighborhoods, 'Spatial')))), Legend = "2000 Ft.")) 

assault_KD.df$Legend <- factor(assault_KD.df$Legend, levels = c("1000 Ft.", "1500 Ft.", "2000 Ft."))

ggplot(data=assault_KD.df, aes(x=x, y=y)) +
  geom_raster(aes(fill=layer)) + 
  facet_wrap(~Legend) +
  coord_sf(crs=st_crs(final_net)) + 
  scale_fill_gradientn(colors = my_colors, na.value = "#e9e9e9",
                      name = "Density") +
  labs(title = "Kernel density with 3 different search radii") +
  theme_void()

Get 2023 crime data

Let’s see how our model performed relative to KD on the following year’s data.

assault_KDE_sum <- as.data.frame(assault_KD.1000) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) 
kde_breaks <- classIntervals(assault_KDE_sum$value, 
                             n = 5, "fisher")
assault_KDE_sf <- assault_KDE_sum %>%
  mutate(label = "Kernel Density",
         Risk_Category = classInt::findCols(kde_breaks),
         Risk_Category = case_when(
           Risk_Category == 5 ~ "5th",
           Risk_Category == 4 ~ "4th",
           Risk_Category == 3 ~ "3rd",
           Risk_Category == 2 ~ "2nd",
           Risk_Category == 1 ~ "1st")) %>%
  cbind(
    aggregate(
      dplyr::select(assault_clipped23) %>% mutate(assaultCount = 1), ., sum) %>%
    mutate(assaultCount = replace_na(assaultCount, 0))) %>%
  dplyr::select(label, Risk_Category, assaultCount)
ml_breaks <- classIntervals(reg.ss.spatialCV$Prediction, 
                             n = 5, "fisher")
assault_risk_sf <-
  reg.ss.spatialCV %>%
  mutate(label = "Risk Predictions Spatial",
         Risk_Category =classInt::findCols(ml_breaks),
         Risk_Category = case_when(
           Risk_Category == 5 ~ "5th",
           Risk_Category == 4 ~ "4th",
           Risk_Category == 3 ~ "3rd",
           Risk_Category == 2 ~ "2nd",
           Risk_Category == 1 ~ "1st")) %>%
  cbind(
    aggregate(
      dplyr::select(assault_clipped23) %>% mutate(assaultCount = 1), ., sum) %>%
      mutate(assaultCount = replace_na(assaultCount, 0))) %>%
  dplyr::select(label,Risk_Category, assaultCount)



ml_breaks_kfold <- classIntervals(reg.ss.cv$Prediction, 
                             n = 5, "fisher")
assault_risk_sf_kfold <-
  reg.ss.cv %>%
  mutate(label = "Risk Predictions K-fold",
         Risk_Category =classInt::findCols(ml_breaks_kfold),
         Risk_Category = case_when(
           Risk_Category == 5 ~ "5th",
           Risk_Category == 4 ~ "4th",
           Risk_Category == 3 ~ "3rd",
           Risk_Category == 2 ~ "2nd",
           Risk_Category == 1 ~ "1st")) %>%
  cbind(
    aggregate(
      dplyr::select(assault_clipped23) %>% mutate(assaultCount = 1), ., sum) %>%
      mutate(assaultCount = replace_na(assaultCount, 0))) %>%
  dplyr::select(label,Risk_Category, assaultCount)

The overall prediction is good.

rbind(assault_KDE_sf, assault_risk_sf_kfold, assault_risk_sf) %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    facet_wrap(~label, ) +
    geom_sf(data = sample_n(assault_clipped23, 3000), size = .03, colour = "white") +
    scale_fill_manual(values = c("#033E56", "#518984", "#9DC9A3", "#F5CD42", "#F5A70A"),
                     name = "Risk Category",
                     labels = labels) +
    labs(title="Comparison of Kernel Density and Risk Predictions",
         subtitle="2022 assault risk predictions; 2023 assault") + 
  theme_void()

The kernel density model edges out both risk prediction model for highest risk areas. However, the risk prediction model does a better job in capturing assault incidents in lower-risk areas that the kernel density model had ignored.

rbind(assault_KDE_sf, assault_risk_sf_kfold, assault_risk_sf) %>%
  st_drop_geometry() %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countAssault = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Pcnt_of_test_set_crimes = countAssault / sum(countAssault)) %>%
    ggplot(aes(Risk_Category,Pcnt_of_test_set_crimes)) +
      geom_bar(aes(fill=label), position="dodge", stat="identity") +
      scale_fill_manual(values = c("#033E56", "#9DC9A3", "#F5CD42"),
                     name = "Model",
                     labels = labels) +
      labs(title = "Risk prediction vs. Kernel density, 2023 Assaults",
           y = "% of Test Set Assault (per model)",
           x = "Risk Category") +
  theme_minimal() +
      theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

Summary

This algorithm does a good job in doing predictions but I don’t recommend it to be put into production.The main reason is that predicting for police activity is controversial and will also deviate from what is expected.

There are still places where this model can be further improved. For example, some of the independent variables are not in the same year as the crime data because the Chicago data portal doesn’t have the most up-to-date datasets.