Introduction

This project aims to build a machine learning model for recidivism prediction in Georgia using NIJ’s Recidivism Challenge Full Dataset in 2021 by Georgia Department of Community Supervision and Georgia Crime Information Center from the Office of Justice Programs Portal. This project will also include the analysis of the performance of the model and whether it will overpredit recidivism in certain social groups.

Data Exploration

The trade-off between sensitivity and specificity

In recidivism prediction, sensitivity prioritizes correctly identifying individuals who will reoffend, potentially leading to more false positives. This might allocate resources to low-risk individuals. On the other hand, specificity prioritizes correctly identifying individuals who won’t reoffend, which could result in more false negatives, allowing high-risk individuals to go undetected.

The tradeoff of recidivism prediction involves balancing resources and the risk of misclassification. If the model prioritizes sensitivity, more innocent people are going to be unfairly monitored or retained in prison, which limits individual rights, increases social and economic costs, and questions the justice of the current legal system. If the model prioritizes specificity, more individuals with a risk of recidivism are going to be released, which increases the safety concerns in the community, causes social unrest, and increases the money and efforts needed for recidivism those people in the future but reduces unfair treatments.

Variables Examination

The first part of this project focuses on examing available variables in the dataset. Firstly, we specify the outcome variable (recidivism within 3 years) to be a binary variable: 1 means recidivism observed and 0 means no recidivism.

# filter the crime data under 3 years

Recidivism <- Recidivism %>%
  mutate(Recidivism_numeric = ifelse(recidivism_within_3years == "true", 1, 0))
# glimpse(Recidivism)

Continuous variables

There are 8 continuous variables in the dataset (shown below). The bar chart below divides the data into two groups (recidivism within 3 years and no recidivism) and calculates the average value for each variable.

Recidivism %>%
  dplyr::select(recidivism_within_3years, jobs_per_year, percent_days_employed, supervision_risk_score_first, avg_days_per_drugtest, drugtests_thc_positive, drugtests_cocaine_positive, drugtests_meth_positive, drugtests_other_positive, residence_puma) %>%
  gather(Variable, value, -recidivism_within_3years) %>%
    ggplot(aes(recidivism_within_3years, value, fill=recidivism_within_3years)) + 
      geom_bar(position = "dodge", stat = "summary", fun = "mean") + 
      facet_wrap(~Variable, scales = "free", ncol = 3) +
      scale_fill_manual(values = palette2) +
      labs(x="Recidivism within 3 years", y="Value", 
           title = "Feature associations with the likelihood of recidivism within 3 years",
           subtitle = "(Continous outcomes)") +
      theme_minimal() + theme(legend.position = "none")

The line chart below contains the same 8 continuous variables but focuses on the their distribution. Only three of them show a significant distribution difference between the recidivism and no recidivism group.

Recidivism %>%
    dplyr::select(recidivism_within_3years, jobs_per_year, percent_days_employed, supervision_risk_score_first, avg_days_per_drugtest, drugtests_thc_positive, drugtests_cocaine_positive, drugtests_meth_positive, drugtests_other_positive, residence_puma) %>%
    gather(Variable, value, -recidivism_within_3years) %>%
    ggplot() + 
    geom_density(aes(value, color=recidivism_within_3years), fill = "transparent", linewidth = 0.55) + 
    facet_wrap(~Variable, scales = "free", ncol = 3) +
    scale_colour_manual(values = palette2, name = "Recidivism") +
    labs(x="Value", y="Density",
         title = "Feature Distribution with the likelihood of recidivism within 3 years",
         subtitle = "(Continous outcomes)") +
      theme_minimal()

Categorical variables

Recidivism %>%
    dplyr::select(recidivism_within_3years, gender,race, age_at_release, gang_affiliated, supervision_level_first, education_level, education_level, dependents, prison_offense, prison_years, prior_arrest_episodes, prior_conviction_episodes, residence_changes) %>%
    gather(Variable, value, -recidivism_within_3years) %>%
    count(Variable, value, recidivism_within_3years) %>%
      ggplot(., aes(value, n, fill = recidivism_within_3years)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = palette2) +
        labs(x="recidivism_within_3years", y="Value",
             title = "Feature associations with the likelihood of recidivism within 3 years",
             subtitle = "Categorical features") +
        theme_minimal() +
        theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")

Create A Logistic Regression Model

The original dataset is first split into 50% train dataset and 50% test dataset. Firstly, have an overview of model’s coefficient.

set.seed(3456)
trainIndex <- createDataPartition(Recidivism$recidivism_within_3years, p = .50,
                                  list = FALSE,
                                  times = 1)
RecidivismTrain <- Recidivism[ trainIndex,]
RecidivismTest  <- Recidivism[-trainIndex,]


RecidivismModel <- glm(Recidivism_numeric ~ .,
                  data=RecidivismTrain %>% 
                    dplyr::select(Recidivism_numeric,jobs_per_year,percent_days_employed, supervision_risk_score_first, prison_years, condition_cog_ed,condition_mh_sa, condition_other, gang_affiliated,prior_arrest_episodes_violent, prior_arrest_episodes_property, violations, violations_instruction, violations_failtoreport, violations_1, delinquency_reports,program_attendances, program_unexcusedabsences, residence_changes,  prior_arrest_episodes_drug,prior_arrest_episodes,prior_revocations_parole,prior_revocations_probation, avg_days_per_drugtest,drugtests_thc_positive,drugtests_cocaine_positive,drugtests_meth_positive,drugtests_other_positive,gender,race, age_at_release, residence_puma,education_level, dependents, prison_offense),
                  family="binomial" (link="logit"))

Recidivism_sum <- summary(RecidivismModel)

coefficients_table <- as.data.frame(Recidivism_sum$coefficients)

coefficients_table$significance <- ifelse(coefficients_table$`Pr(>|z|)` < 0.001, '***',
                                         ifelse(coefficients_table$`Pr(>|z|)` < 0.01, '**',
                                                ifelse(coefficients_table$`Pr(>|z|)` < 0.05, '*',
                                                       ifelse(coefficients_table$`Pr(>|z|)` < 0.1, '.', ''))))

coefficients_table$p_value <- paste0(round(coefficients_table$`Pr(>|z|)`, digits = 3), coefficients_table$significance)

coefficients_table %>%
  select(-significance, -`Pr(>|z|)`) %>% 
  kable(align = "r") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))  %>%
  footnote(general_title = "\n", general = "Table 1")
Estimate Std. Error z value p_value
(Intercept) -0.8784461 0.1890311 -4.6470989 0***
jobs_per_year 0.4623796 0.0392253 11.7877987 0***
percent_days_employed -1.9555342 0.0775564 -25.2143659 0***
supervision_risk_score_first 0.0336102 0.0130368 2.5781063 0.01**
prison_yearsGreater than 2 to 3 years -0.1800728 0.0722036 -2.4939600 0.013*
prison_yearsLess than 1 year 0.1599014 0.0669325 2.3889942 0.017*
prison_yearsMore than 3 years -0.0195576 0.0741412 -0.2637887 0.792
condition_cog_edtrue -0.0183859 0.0530013 -0.3468954 0.729
condition_mh_satrue 0.3588647 0.0566231 6.3377811 0***
condition_othertrue -0.0310762 0.0598448 -0.5192806 0.604
gang_affiliatedfalse 0.5651538 0.0754423 7.4912065 0***
gang_affiliatedtrue 1.3158422 0.1027182 12.8102178 0***
prior_arrest_episodes_violent1 0.0737675 0.0598217 1.2331224 0.218
prior_arrest_episodes_violent2 0.2675999 0.0779423 3.4333093 0.001***
prior_arrest_episodes_violent3 or more 0.2134832 0.0828295 2.5773807 0.01**
prior_arrest_episodes_property1 0.1966578 0.0713686 2.7555232 0.006**
prior_arrest_episodes_property2 0.4086290 0.0797314 5.1250697 0***
prior_arrest_episodes_property3 0.3720791 0.0924320 4.0254349 0***
prior_arrest_episodes_property4 0.4608734 0.1071962 4.2993428 0***
prior_arrest_episodes_property5 or more 0.7974360 0.0898633 8.8738806 0***
violationstrue 0.3955063 0.0932012 4.2435758 0***
violations_instructiontrue 0.1925954 0.0680219 2.8313752 0.005**
violations_failtoreporttrue 0.0501159 0.0941792 0.5321339 0.595
violations_1true -0.1683558 0.0777116 -2.1664182 0.03*
delinquency_reports1 0.5213898 0.1229419 4.2409457 0***
delinquency_reports2 -0.1407398 0.1107683 -1.2705779 0.204
delinquency_reports3 -0.3751349 0.1111540 -3.3749116 0.001***
delinquency_reports4 or more -0.5632829 0.0740423 -7.6075870 0***
program_attendances1 -0.1436903 0.1266184 -1.1348293 0.256
program_attendances10 or more -0.3775369 0.0834174 -4.5258748 0***
program_attendances2 -0.0596856 0.1357218 -0.4397640 0.66
program_attendances3 0.0042865 0.1501560 0.0285468 0.977
program_attendances4 -0.1218934 0.1382853 -0.8814635 0.378
program_attendances5 0.1743027 0.1112211 1.5671731 0.117
program_attendances6 -0.1434023 0.0762256 -1.8812877 0.06.
program_attendances7 -0.1351295 0.1299037 -1.0402281 0.298
program_attendances8 -0.0211937 0.1711855 -0.1238053 0.901
program_attendances9 -0.1498123 0.1726546 -0.8676992 0.386
program_unexcusedabsences1 0.1858447 0.0966956 1.9219568 0.055.
program_unexcusedabsences2 0.1653455 0.1181284 1.3997104 0.162
program_unexcusedabsences3 or more 0.1105294 0.0921730 1.1991513 0.23
residence_changes1 0.1158049 0.0585297 1.9785668 0.048*
residence_changes2 0.1232990 0.0729418 1.6903752 0.091.
residence_changes3 or more 0.3785775 0.0800881 4.7270157 0***
prior_arrest_episodes_drug1 0.0540949 0.0697180 0.7759096 0.438
prior_arrest_episodes_drug2 0.2197247 0.0786714 2.7929429 0.005**
prior_arrest_episodes_drug3 0.3162809 0.0921623 3.4317803 0.001***
prior_arrest_episodes_drug4 0.1943453 0.1055838 1.8406732 0.066.
prior_arrest_episodes_drug5 or more 0.2744439 0.0959939 2.8589727 0.004**
prior_arrest_episodes1 0.2282423 0.0757574 3.0128065 0.003**
prior_arrest_episodes2 0.4121426 0.0835732 4.9315139 0***
prior_arrest_episodes3 0.4568621 0.0907495 5.0343185 0***
prior_arrest_episodes4 0.6543624 0.0999763 6.5451779 0***
prior_arrest_episodes5 or more 0.7660882 0.0851203 9.0000597 0***
prior_revocations_paroletrue 0.4591914 0.0885278 5.1869726 0***
prior_revocations_probationtrue -0.1419800 0.0696072 -2.0397316 0.041*
avg_days_per_drugtest 0.0003659 0.0002158 1.6959089 0.09.
drugtests_thc_positive 0.9245708 0.2111671 4.3783858 0***
drugtests_cocaine_positive 0.1892071 0.3970664 0.4765124 0.634
drugtests_meth_positive 3.1605122 0.6077703 5.2001757 0***
drugtests_other_positive 1.4141523 0.6642662 2.1288941 0.033*
raceWHITE 0.1466630 0.0555534 2.6400349 0.008**
age_at_release23-27 -0.2120736 0.1021573 -2.0759511 0.038*
age_at_release28-32 -0.5169631 0.1089386 -4.7454519 0***
age_at_release33-37 -0.6842437 0.1167416 -5.8611811 0***
age_at_release38-42 -0.8624687 0.1273557 -6.7721259 0***
age_at_release43-47 -0.9273167 0.1334014 -6.9513269 0***
age_at_release48 or older -1.3359129 0.1346790 -9.9192379 0***
residence_puma 0.0104192 0.0033690 3.0926963 0.002**
education_levelHigh School Diploma 0.1679275 0.0674960 2.4879620 0.013*
education_levelLess than HS diploma 0.0147698 0.0704998 0.2095010 0.834
dependents1 0.1038193 0.0668149 1.5538342 0.12
dependents2 0.0696932 0.0711952 0.9789038 0.328
dependents3 or more 0.0967184 0.0626891 1.5428272 0.123
prison_offenseDrug 0.0083215 0.0858005 0.0969864 0.923
prison_offenseOther 0.1470066 0.0992806 1.4807188 0.139
prison_offenseProperty 0.0501837 0.0889608 0.5641098 0.573
prison_offenseViolent/Non-Sex 0.1116873 0.0989390 1.1288502 0.259
prison_offenseViolent/Sex -0.0801199 0.1629383 -0.4917194 0.623

Table 1

Fitting model for pseudo r2

RecidivismModel_pseudor2_df <- data.frame(metric = names(RecidivismModel_pseudor2), value = round(as.numeric(RecidivismModel_pseudor2), digit=3)) %>%
  t()

colnames(RecidivismModel_pseudor2_df) <- as.character(RecidivismModel_pseudor2_df[1, ])  # Set column names to values in the first row
RecidivismModel_pseudor2_df <- RecidivismModel_pseudor2_df[-1, , drop = FALSE]  # Remove the first row

RecidivismModel_pseudor2_df %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))  %>%
  footnote(general_title = "\n", general = "Table 2")
llh llhNull G2 McFadden r2ML r2CU
value -5378.551 -6637.760 2518.417 0.190 0.228 0.306

Table 2

The result metrics above indicate the model’s performance in predicting recidivism. The negative log-likelihood (llh) and G2 statistic demonstrate the model’s fit, with lower llh values and higher G2 values indicating better fit compared to the null model. The McFadden pseudo R2 and r2ML/r2CU values assess the model’s explanatory power, with higher values suggesting better performance. Overall, the model shows moderate explanatory power and improvement over the null model in predicting recidivism.

Make Predictions

Distribution of probabilities

We create a dataframe of predictions for the 9738 observations in our test set, called testProbs.

These predictions are the estimated probabilities of recidivism. We can compare them to the observed outcome.

testProbs <- data.frame(Outcome = as.factor(RecidivismTest$Recidivism_numeric),
                        Probs = predict(RecidivismModel, RecidivismTest, type= "response"),
                        gender = RecidivismTest$gender,
                        race = RecidivismTest$race)

ggplot(testProbs, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) +
  labs(x = "Probabilities", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome",
             subtitle = "no recidivism = 0, recidivism = 1") +
      theme_minimal() +
      theme(legend.position = "none")

In summary, the chart above shows a promising degree of prediction power in the model. However, the model is better at predicting recidivism than no recidivism. The probabilities associated with recidivism are more concentrated than the no recidivism, which are more spread out along the axis.

Predictability on different groups

From above we see that the model’s overall predictability is good. Currently, we choose probability > 0.5 as the thereshold for predicting recidivism. This section will compare the predictability results for different race or gender groups to see if the model will overpredict minority groups’ probability.

testProbs <- 
  testProbs %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs$Probs > 0.5 , 1, 0))) 

testProbs <- 
  testProbs %>%
  mutate(error = ifelse(testProbs$predOutcome == testProbs$Outcome, 0, 1))
  

race_difference <- testProbs %>% 
  group_by(race, gender) %>%
  summarize(total_error = sum(error),
            total_sample = n()) %>%
  mutate(percent_error = round(total_error / total_sample, digit=3)) 

race_difference %>% 
  kable(col.name=c("Race", 'Gender','Number of Error Predictions','Number of Samples','Error Rate')) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), position = "left")  %>%
  footnote(general_title = "\n", general = "Table 3")
Race Gender Number of Error Predictions Number of Samples Error Rate
BLACK F 114 398 0.286
BLACK M 1515 5164 0.293
WHITE F 242 827 0.293
WHITE M 950 3349 0.284

Table 3

From the table above we can see that our model do not have a strong preference on certain groups. The percent error for each group are similar, but black male and white female have a slightly higher error rate than other groups.

Confusion Matrix

Each threshold (e.g. a probability above which a prediction is “recidivism” and below which is “no recidivism”) has it’s own rate of error. These errors can be classified in four ways for a binary model.

A “confusion matrix” for the threshold of 50% shows us the rate at which we got True Positives (aka Sensitivity), False Positives, True Negatives (aka Specificity) and False Negatives for that threshold.

cm <- caret::confusionMatrix(testProbs$predOutcome, testProbs$Outcome, 
                       positive = "1")

mosaicplot(cm$table, color=palette2, main = "Mosaic Plot for Confusion Matrix",
           xlab = "Prediction", ylab = "Reference")

# print(cm)

cm_df <- as.data.frame(cm$table)
cm_df %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))  %>%
  footnote(general_title = "\n", general = "Table 4")
Prediction Reference Freq
0 0 2515
1 0 1612
0 1 1209
1 1 4402

Table 4
cm_df2 <- as.data.frame(round(cm$byClass, digit = 3)) %>% 
  head(5) %>%
  t()

rownames(cm_df2) <- NULL

cm_df2 %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))  %>%
  footnote(general_title = "\n", general = "Table 5")
Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
0.785 0.609 0.732 0.675 0.732

Table 5

Sensitivity measures the model’s ability to accurately detect true positive cases, specifically recidivism in this context. Our model has a sensitivity score of 0.785, or 78.5%, which means that around 78.5% of individuals who actually experienced recidivism were correctly identified by the model. This high sensitivity rate highlights the model’s effectiveness in recognizing individuals at risk of recidivism.

On the other hand, specificity gauges the model’s capability to correctly identify true negative cases, indicating individuals who do not experience recidivism. With a specificity of 0.609, or 60.9%, the model accurately identified approximately 60.9% of individuals who did not recidivate. Notably, the specificity rate is lower than sensitivity, suggesting that the model may be less proficient at identifying individuals who will not experience recidivism.

In conclusion, the model is more effective at identifying individuals who will recidivate than identify individuals who will not recidivate, which is the same trend as what we observed in the distribution chart above..

ROC Curve

The ROC curve, gives us another visual “goodness of fit” metric.The y=x line shows where our prediction rates for positives and negatives are no better than a random event. On the other hand, if ROC curve is too “square”, we will probably over fit the model.

auc(testProbs$Outcome, testProbs$Probs)
## Area under the curve: 0.7747

An area under the curve (AUC) of our model is ${area}, which means that there is a 77.47% chance that the model will be able to distinguish between a randomly chosen positive instance (one that actually did recidivate) and a negative instance (one that did not recidivate).

ggplot(testProbs, aes(d = as.numeric(Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#bb7154") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1, color = '#5c6e6c') +
  labs(title = "ROC Curve") +
      theme_minimal()

The curve rises steeply towards the upper-left corner of the plot, which shows that the model has a strong true positive rate before accruing false positives.

Cross validation

We run 100-fold cross validation and look at the ROC’s area under the curve (AUC), Sensitivity, and Specificity across these series of predicitons.

ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

cvFit <- train(recidivism_within_3years ~ .,
                  data=Recidivism %>% 
                  dplyr::select(recidivism_within_3years,jobs_per_year,percent_days_employed, supervision_risk_score_first, prison_years, condition_cog_ed,condition_mh_sa, condition_other, gang_affiliated,prior_arrest_episodes_violent, prior_arrest_episodes_property, violations, violations_instruction, violations_failtoreport, violations_1, delinquency_reports,program_attendances, program_unexcusedabsences, residence_changes,  prior_arrest_episodes_drug,prior_arrest_episodes,prior_revocations_parole,prior_revocations_probation, avg_days_per_drugtest,drugtests_thc_positive,drugtests_cocaine_positive,drugtests_meth_positive,drugtests_other_positive,gender,race, age_at_release, residence_puma,education_level, dependents, prison_offense), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)

cvFit
cvFit_df <- as.data.frame(cvFit$results)[, 2:7]
cvFit_df %>% 
  kable(col.name=c('ROC', 'Sensitivity', 'Specificity', 'ROC SD', 'Sensitivity SD', 'Specificity SD')) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))  %>%
  footnote(general_title = "\n", general = "Table 6")
ROC Sensitivity Specificity ROC SD Sensitivity SD Specificity SD
0.7755273 0.6070555 0.7949881 0.0349404 0.0520803 0.0417299

Table 6

The Area Under the Curve (AUC) from the cross-validated model is about 0.7755, which is slightly higher than the previously mentioned AUC of 0.7747. This suggests that the model’s ability to distinguish between the positive and negative classes is consistent and robust across different subsets of the data.

The sensitivity obtained through cross-validation appears to be lower than the previously reported value of 0.785. This discrepancy suggests that, when evaluated across multiple folds, the model’s accuracy in identifying true positive cases—instances of recidivism—may be somewhat diminished compared to the initial assessment. Conversely, the specificity observed in cross-validation surpasses the earlier figure of 0.609. This indicates an enhanced capacity of the model to accurately discern true negative cases—instances of non-recidivism—when subjected to the cross-validation process.

In contrast, cross-validated results offer a more dependable evaluation of the model’s efficacy by mitigating bias stemming from potential overfitting to a singular test set. Disparities between the cross-validated sensitivity and specificity and their initial counterparts suggest that the model may lean towards conservative predictions for positive cases (recidivism). Nevertheless, it demonstrates robustness in correctly identifying negative cases (non-recidivism) across diverse subsamples of the dataset.

dplyr::select(cvFit$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#bb7154") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#5c6e6c", linetype = 2, size = 0.9) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics",
         subtitle = "Across-fold mean reprented as dotted lines") +
      theme_minimal() + theme(legend.position = "none")

The above chart shows that the ROC, sensitivity, and specificity are all concentrated around the across-fold mean, which indicates good model performance.

Cost-Benefit Calculation

It is pretty hard to have a cost-benefit calculation on recidivism rate. Therefore, this section only includes some very general economic calculations. From the USAFacts, States spent an average of $45771 per prisoner per year to keep one individual in the prison. From the CSG Justice Center, the cost of recidivism is about $41450.78 (over 8 billion to incarcerate more than 193,000 people) per individual per year.

Based on these two values, we define the economic cost as below:
* True Negative: $0 * Count (No need to spend money on keeping someone in prison or costs associated with recidivism)
* True Positive: $45771 * Count (Spend money on keeping someone in prison)
* False Negative: $41450.78 * Count (Spend money on costs associated with recidivism)
* False Positive: $45771 * Count (Spend money on keeping someone in prison)

cost_benefit_table <-
   testProbs %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1]),
                False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
       gather(Variable, Count) %>%
       mutate(Revenue =
               ifelse(Variable == "True_Negative", Count * 0,
               ifelse(Variable == "True_Positive",((-45771) * Count),
               ifelse(Variable == "False_Negative", (-41450.78) * Count,
               ifelse(Variable == "False_Positive", (-45771) * Count, 0))))) %>%
    bind_cols(data.frame(Description = c(
              "We correctly predicted no recidivism",
              "We correctly predicted recidivism",
              "We predicted no recidivism but get recidivism",
              "We predicted recidivism but get no recidivism")))

cost_benefit_table %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))  %>%
  footnote(general_title = "\n", general = "Table 7")
Variable Count Revenue Description
True_Negative 2515 0 We correctly predicted no recidivism
True_Positive 4402 -201483942 We correctly predicted recidivism
False_Negative 1209 -50113993 We predicted no recidivism but get recidivism
False_Positive 1612 -73782852 We predicted recidivism but get no recidivism

Table 7

Optimize Thresholds

Previously we chose 0.5 as the line above which a prediction is classified as a “recidivism”. We can then look at the confusion matrices for each threshold and choose the one that returns the most revenue.

x = .01
all_threshold <- data.frame()

while (x <= 1) {
threshold<- 
  testProbs %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs$Probs > x , 1, 0))) %>% 
  count(predOutcome, Outcome) %>% 
  summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1]),
                False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>% 
  gather(Variable, Count) %>% 
  mutate(threshold = x )

all_threshold <- rbind(all_threshold, threshold)
 x <- x + .01
}

all_threshold %>% 
  ggplot() +
  geom_line(aes(x = threshold, y=Count, color = Variable), size = 1.5, linetype = 1) + 
  scale_color_manual(values = palette4, guide=FALSE) + 
  facet_wrap(~Variable) + 
  labs(title = "Confusion Metric Outcomes for Each Threshold") +
  xlab("Threshold") + 
  theme_minimal()

The code below bakes in our cost-revenue calculations. From this chart we can see that the optimized thereshold for our model is about 0.9.

iterateThresholds <- function(data) {
  x = .01
  all_prediction <- data.frame()
  while (x <= 1) {
  
  this_prediction <-
      testProbs %>%
      mutate(predOutcome = ifelse(Probs > x, 1, 0)) %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1]),
                False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
     gather(Variable, Count) %>%
     mutate(Revenue =
               ifelse(Variable == "True_Negative", Count * 0,
               ifelse(Variable == "True_Positive",((-45771) * Count),
               ifelse(Variable == "False_Negative", (-41450.78) * Count,
               ifelse(Variable == "False_Positive", (-45771) * Count, 0)))),
            Threshold = x)
  
  all_prediction <- rbind(all_prediction, this_prediction)
  x <- x + .01
  }
return(all_prediction)
}

whichThreshold <- iterateThresholds(testProbs2)

whichThreshold_revenue <- 
whichThreshold %>% 
    group_by(Threshold) %>% 
    summarize(Revenue = sum(Revenue))

  ggplot(whichThreshold_revenue)+
  geom_line(aes(x = Threshold, y = Revenue), linewidth = 1.5, color = "#bb7154")+
  geom_vline(xintercept =  pull(arrange(whichThreshold_revenue, -Revenue)[1,1]), color = "#5c6e6c", linewidth = 1)+
    labs(title = "Model Revenues By Threshold For Test Sample",
         subtitle = "Vertical Line Denotes Optimal Threshold") +
      theme_minimal()

Conclusion

In summary our model performs well and do not have strong preference on certain race or gender groups. It is better at predicting sensitivity than specificity without doing cross validation, and after doing cross validation, it is better at predicting specificity than sensitivity. When considering revenue, focusing on specificity will minimize the total cost.