Philadelphia Census Block Group Gentrification Prediction Using Machine Learning

Introduction

This project aims to use machine learning methods to forecast patterns of gentrification in the unit of census block groups within Philadelphia. Gentrification, a complex socioeconomic phenomenon, has significant implications for urban development, housing affordability, and community dynamics. From the research of the Urban Displacement Project, gentrification is usually associated with real estate speculation, increased investment in neighborhood amenities such as transit and parks, change in land use, and changes in the character of the neighborhood such as business type change.” These new investments often lead to displacement of existing low-income residents through rent increase, eviction, and other displacement pressure. Therefore, this project seeks to empower policymakers, urban planners, and community advocates to make informed decisions and implement equitable strategies for sustainable growth and inclusive development.

This project will use the 2018 and 2019 data to train the machine learning model and use the 2017 and 2018 data to do the validation. There are two reasons for this decision. Firstly, the COVID pandemic in 2020 largely impacted the normal pattern of people’s daily life, so we don’t want our model to span across the year 2020. Secondly, the area of each census block group gets updated every decade, so we are not able to get the same census block group before and after 2020.

Calculate Gentrification (Dependent Variable)

Since gentrification is a vague and variable term, we need to first define gentrification in this project and calculate the area of gentrification within Philadelphia by using related geospatial datasets.

Based on the dictionary definition, the term gentrification means “a process in which a poor area (as of a city) experiences an influx of middle-class or wealthy people who renovate and rebuild homes and businesses and which often results in an increase in property values and the displacement of earlier, usually poorer residents.” In this way, there are two main characteristics of gentrification: census composition change and property value change.

Firstly, for the census composition change, we define a census block group as gentrifying if its percentage of minority group (non-white) decreases by more than 4% in a 1-year interval and its median household income before is below Philadelphia County’s median. In a research about “Gentrification Definitions and Racial Change: Considering the Evidence,” it analyzes racial change compared to the status of gentrification. Extracting the information from this study and briefly simplifying the analysis, the gentrification area usually has a net loss of black people between 1.1 to 8.4 percent, so we will use the rough average of this scope, which is 4 percent. Besides, this research also points to an online Gentrification Comparison Tool that “takes 3 popular approaches to measure gentrification - by Freeman (2005), Ellen & O’Regan (2008), and McKinnish et al (2010).” From there, the census data change they focus on is average household income. The 3 approaches have slightly different standards to define gentrification based on average household income: median household income below metro area median, average household income below 70% of metro area median, or average household income in the bottom 20% among all evaluated tracts nationally. In order to fit our analysis of Philadelphia County, this project will use median household income below metro area median.

Secondly, for the property value change, we define it as a more than 5% increase in a 1-year interval. Based on the research from Drexel University Urban Heath Collaborative’s study of “A Measure of Gentrification For Use in Longitudinal Public Health Studies in the US,” it used 50% - 75% increase in home value in a 10-year interval to define gentrification and more than 75% home value increase to define intense gentrification. Therefore, transferring this criterion to our 1-year interval will be a more than 5% increase.

Since many of our criteria involves money, we need to adjust for inflation. Based on Macrotrends data, the inflation rate for 2017 is 2.13% and for 2018 is 2.44%.

In summary, our method to calculate gentrification is:
1. Percentage of minority group (non-white) decreases by more than 4% in a 1-year interval.
2. Median household income 1 years ago was below Philadelphia County’s median.
3. More than 5% increase of property value in a 1-year interval

Census Data

The census data we need to collect for calculating gentrification is ACS 5-year data from 2017 to 2019, and the geography is census block groups in Philadelphia. The demographic data we need to collect are B01001_001E Total Population, B02001_002E Race, and B19013_001E Median Household Income in the past 12 Months. In this way, we can calculate the percentage of non-white population.

census2017 <- get_acs(geography = "block group", 
          survey = "acs5",
          variables = c("B01001_001E", "B02001_002E", "B19013_001E"), 
          year = 2017, 
          state = "42", 
          county = "101", 
          geometry = T, 
          output = "wide") %>%
  st_transform(crs = 2272) %>%
  rename(TotalPop = B01001_001E,
         NumberWhites = B02001_002E,
         MedHhIncome = B19013_001E) %>%
  mutate(pctNonWhite = 1 - NumberWhites / TotalPop)%>%
  select(-NAME, -B01001_001M, -B02001_002M, -B19013_001M)

census2018 <- get_acs(geography = "block group", 
          survey = "acs5",
          variables = c("B01001_001E", "B02001_002E", "B19013_001E"), 
          year = 2018, 
          state = "42", 
          county = "101", 
          geometry = T, 
          output = "wide") %>%
  st_transform(crs = 2272) %>%
  rename(TotalPop = B01001_001E,
         NumberWhites = B02001_002E,
         MedHhIncome = B19013_001E) %>%
  mutate(pctNonWhite = 1 - NumberWhites / TotalPop)%>%
  select(-NAME, -B01001_001M, -B02001_002M, -B19013_001M)

census2019 <- get_acs(geography = "block group", 
          survey = "acs5",
          variables = c("B01001_001E", "B02001_002E", "B19013_001E"), 
          year = 2019, 
          state = "42", 
          county = "101", 
          geometry = T, 
          output = "wide") %>%
  st_transform(crs = 2272) %>%
  rename(TotalPop = B01001_001E,
         NumberWhites = B02001_002E,
         MedHhIncome = B19013_001E) %>%
  mutate(pctNonWhite = 1 - NumberWhites / TotalPop)%>%
  select(-NAME, -B01001_001M, -B02001_002M, -B19013_001M)

Since one of our criteria needs to compare median income to Philadelphia County’s value. We need to calculate the median difference between each block group and the larger geography (Philadelphia) for the selected three years. Because some of the block groups do not have median household income value data from the census data, we will keep the household income difference for those as NA.

census2017 <- census2017 %>%
  mutate(MedHhIncomeDif = ifelse(is.na(census2017$MedHhIncome), NA, census2017$MedHhIncome - median(na.omit(census2017$MedHhIncome))))

census2018 <- census2018 %>%
  mutate(MedHhIncomeDif = ifelse(is.na(census2018$MedHhIncome), NA, census2018$MedHhIncome - median(na.omit(census2018$MedHhIncome))))

census2019 <- census2019 %>%
  mutate(MedHhIncomeDif = ifelse(is.na(census2019$MedHhIncome), NA, census2019$MedHhIncome - median(na.omit(census2019$MedHhIncome))))

Property Value Data

The property value data used in this project comes from OpenDataPhilly’s Philadelphia Properties and Assessment History dataset. Because this dataset does not have geometry, we need to join this to the Philadelphia Water Department’s parcel data, which is originally used to calculate parcel-based storm water charges for PWD customers under the new parcel-based storm water billing program.

Since our analysis needs 2017 to 2019’s data, we will first extract these three years’ property data from the history dataset.

In order to match the properties with the parcels, we use certain IDs to join them, and those are parcel_number in the property dataset and BRT_ID in the PWD parcel dataset. After joining these two datasets together, we will calculate the median property value in each census block group and add that as a new column in the corresponding census dataframe. This process will repeat for 3 times since we need to manipulate 3 years’ data.

# 2017

assess_pwd_join_2017 <- inner_join(PWDparcel, Assess2017, join_by(BRT_ID == parcel_number))

# Perform a spatial join between the census block groups and property value data
joined_data_2017 <- st_join(census2017, assess_pwd_join_2017, join = st_intersects)

# Aggregate median property value by census block group
median_property_value_2017 <- aggregate(joined_data_2017["market_value"], by = list(joined_data_2017$GEOID), FUN = median) %>%
  st_drop_geometry()

# Rename the aggregated column
names(median_property_value_2017)[2] <- "median_property_value"

# Merge the aggregated median property value back to the census block groups sf object
census2017 <- merge(census2017, median_property_value_2017, by.x = "GEOID", by.y = "Group.1", all.x = TRUE)


# 2019

assess_pwd_join_2019 <- inner_join(PWDparcel, Assess2019, join_by(BRT_ID == parcel_number))

# Perform a spatial join between the census block groups and property value data
joined_data_2019 <- st_join(census2019, assess_pwd_join_2019, join = st_intersects)

# Aggregate median property value by census block group
median_property_value_2019 <- aggregate(joined_data_2019["market_value"], by = list(joined_data_2019$GEOID), FUN = median) %>%
  st_drop_geometry()

# Rename the aggregated column
names(median_property_value_2019)[2] <- "median_property_value"

# Merge the aggregated median property value back to the census block groups sf object
census2019 <- merge(census2019, median_property_value_2019, by.x = "GEOID", by.y = "Group.1", all.x = TRUE)


# 2018

assess_pwd_join_2018 <- inner_join(PWDparcel, Assess2018, join_by(BRT_ID == parcel_number))

# Perform a spatial join between the census block groups and property value data
joined_data_2018 <- st_join(census2018, assess_pwd_join_2018, join = st_intersects)

# Aggregate median property value by census block group
median_property_value_2018 <- aggregate(joined_data_2018["market_value"], by = list(joined_data_2018$GEOID), FUN = median) %>%
  st_drop_geometry()

# Rename the aggregated column
names(median_property_value_2018)[2] <- "median_property_value"

# Merge the aggregated median property value back to the census block groups sf object
census2018 <- merge(census2018, median_property_value_2018, by.x = "GEOID", by.y = "Group.1", all.x = TRUE)

Calculate change

Because the standard for calculating gentrification involves change between two years, we need to transfer the 2017 data to the 2018 dataframe and the 2018 data to the 2019 dataframe. Firstly, we transfer the percentage of minority(non-white) group and calculate the percent change within 1 year for 2018 and 2019.

# 2017 to 2018

census2017_pctNonwhite <- census2017 %>%
  select(GEOID, pctNonWhite) %>%
  st_drop_geometry() %>% 
  rename(pctNW_2017 = pctNonWhite)

census2018 <- merge(census2018, census2017_pctNonwhite, by.x = "GEOID", by.y = "GEOID", all.x = TRUE) %>%
  mutate(NW_change_pct = (pctNonWhite - pctNW_2017) / pctNW_2017)


# 2018 to 2019

census2018_pctNonwhite <- census2018 %>%
  select(GEOID, pctNonWhite) %>%
  st_drop_geometry() %>% 
  rename(pctNW_2018 = pctNonWhite)

census2019 <- merge(census2019, census2018_pctNonwhite, by.x = "GEOID", by.y = "GEOID", all.x = TRUE) %>%
  mutate(NW_change_pct = (pctNonWhite - pctNW_2018) / pctNW_2018)


# Plots

# Calculate quantiles for class breaks
breaks_quantiles <- classIntervals(census2018$NW_change_pct, n = 5, na.rm=TRUE, style = "quantile")

# Generate custom labels showing the range of each class
labels <- paste0(formatC(breaks_quantiles$brks[-length(breaks_quantiles$brks)], format = "f", digits = 2, big.mark = ","), 
                 " - ", 
                 formatC(breaks_quantiles$brks[-1], format = "f", digits = 2, big.mark = ","))

plot2018 <- ggplot() +
  geom_sf(data = census2018, aes(fill = cut(NW_change_pct, breaks = breaks_quantiles$brks, include.lowest = TRUE)), color = "white", linewidth = 0.01) +
  scale_fill_manual(values = palette5, na.value = "#e9e9e9",
                     name = "Change Percentage",
                     labels = labels) +  # Use custom range labels
  labs(title = "Philadelphia Percent Non-white Change by Block Group in 2018") +
  theme_void()


# Calculate quantiles for class breaks
breaks_quantiles <- classIntervals(census2019$NW_change_pct, n = 5, na.rm=TRUE, style = "quantile")

# Generate custom labels showing the range of each class
labels <- paste0(formatC(breaks_quantiles$brks[-length(breaks_quantiles$brks)], format = "f", digits = 2, big.mark = ","), 
                 " - ", 
                 formatC(breaks_quantiles$brks[-1], format = "f", digits = 2, big.mark = ","))

plot2019 <- ggplot() +
  geom_sf(data = census2019, aes(fill = cut(NW_change_pct, breaks = breaks_quantiles$brks, include.lowest = TRUE)), color = "white", linewidth = 0.01) +
  scale_fill_manual(values = palette5, na.value = "#e9e9e9",
                     name = "Change Percentage",
                     labels = labels) +  # Use custom range labels
  labs(title = "Philadelphia Percent Non-white Change by Block Group in 2019") +
  theme_void()

grid.arrange(
  plot2018, plot2019,
  ncol = 2 
)

Secondly, we transfer the median household income difference and categorize this value. If the median household income one year ago is greater than the Philadelphia median, the census block group will get a 1, and if the median household income one year ago is less than the Philadelphia median, the census block group will get a 0. Besides, since we originally have some NA household income value, we will keep this new categorized value in those census block groups as NA.

# 2017 to 2018

census2017_medianHHI <- census2017 %>%
  select(GEOID, MedHhIncomeDif) %>%
  st_drop_geometry() %>% 
  mutate(MedHhIBool_2017 = ifelse(is.na(census2017$MedHhIncomeDif), NA, ifelse(census2017$MedHhIncomeDif < 0, 1, 0))) %>%
  select(GEOID, MedHhIBool_2017)

census2018 <- merge(census2018, census2017_medianHHI, by.x = "GEOID", by.y = "GEOID", all.x = TRUE)

census2018$MedHhIBool_2017Factor <- factor(census2018$MedHhIBool_2017,
                                                   levels = c(0, 1),
                                                   labels = c("More than PHL Median", "Less than PHL Median"))

# 2018 to 2019

census2018_medianHHI <- census2018 %>%
  select(GEOID, MedHhIncomeDif) %>%
  st_drop_geometry() %>% 
  mutate(MedHhIBool_2018 = ifelse(is.na(census2018$MedHhIncomeDif), NA, ifelse(census2018$MedHhIncomeDif < 0, 1, 0))) %>%
  select(GEOID, MedHhIBool_2018)

census2019 <- merge(census2019, census2018_medianHHI, by.x = "GEOID", by.y = "GEOID", all.x = TRUE)

census2019$MedHhIBool_2018Factor <- factor(census2019$MedHhIBool_2018,
                                                   levels = c(0, 1),
                                                   labels = c("More than PHL Median", "Less than PHL Median"))


# Plots

plot2018 <- ggplot() +
  geom_sf(data = census2018, aes(fill=MedHhIBool_2017Factor), colour='white', linewidth = 0.01) +
  scale_fill_manual(values = palette2, na.value = "#e9e9e9",
                     name = "Household Income") +
  labs(title = "Philadelphia Household Income by Block Group in 2017") +
  theme_void() 

plot2019 <- ggplot() +
  geom_sf(data = census2019, aes(fill=MedHhIBool_2018Factor), colour='white', linewidth = 0.01) +
  scale_fill_manual(values = palette2, na.value = "#e9e9e9",
                     name = "Household Income") +
  labs(title = "Philadelphia Household Income by Block Group in 2018") +
  theme_void()

grid.arrange(
  plot2018, plot2019,
  ncol = 2 
)

We also need to calculate the median property value change within 1 year. In this way, we need to add the property value of 2017 in the 2018 dataframe and the property value of 2018 in the 2019 dataframe in order to calculate the difference. Besides, we also need to adjust for inflation.

# 2017 to 2018

census2017_prop <- census2017 %>%
  select(GEOID, median_property_value) %>%
  st_drop_geometry() %>% 
  rename(median_pv_2017 = median_property_value)

census2018 <- merge(census2018, census2017_prop, by.x = "GEOID", by.y = "GEOID", all.x = TRUE) %>%
  mutate(PV_change_pct = (median_property_value - median_pv_2017 * inflationAdjust2017) / median_pv_2017 * inflationAdjust2017)


# 2018 to 2019

census2018_prop <- census2018 %>%
  select(GEOID, median_property_value) %>%
  st_drop_geometry() %>% 
  rename(median_pv_2018 = median_property_value)

census2019 <- merge(census2019, census2018_prop, by.x = "GEOID", by.y = "GEOID", all.x = TRUE) %>%
  mutate(PV_change_pct = (median_property_value - median_pv_2018 * inflationAdjust2018) / median_pv_2018 * inflationAdjust2018)


# Plots

# Calculate quantiles for class breaks
breaks_quantiles <- classIntervals(census2018$PV_change_pct, n = 5, na.rm=TRUE, style = "quantile")

# Generate custom labels showing the range of each class
labels <- paste0(formatC(breaks_quantiles$brks[-length(breaks_quantiles$brks)], format = "f", digits = 2, big.mark = ","), 
                 " - ", 
                 formatC(breaks_quantiles$brks[-1], format = "f", digits = 2, big.mark = ","))

plot2018 <- ggplot() +
  geom_sf(data = census2018, aes(fill = cut(PV_change_pct, breaks = breaks_quantiles$brks, include.lowest = TRUE)), color = "white", linewidth = 0.01) +
  scale_fill_manual(values = palette5, na.value = "#e9e9e9",
                     name = "Change Percentage",
                     labels = labels) +  # Use custom range labels
  labs(title = "Philadelphia Property Value Change by Block Group in 2018") +
  theme_void()


# Calculate quantiles for class breaks
breaks_quantiles <- classIntervals(census2019$PV_change_pct, n = 5, na.rm=TRUE, style = "quantile")

# Generate custom labels showing the range of each class
labels <- paste0(formatC(breaks_quantiles$brks[-length(breaks_quantiles$brks)], format = "f", digits = 2, big.mark = ","), 
                 " - ", 
                 formatC(breaks_quantiles$brks[-1], format = "f", digits = 2, big.mark = ","))

plot2019 <- ggplot() +
  geom_sf(data = census2019, aes(fill = cut(PV_change_pct, breaks = breaks_quantiles$brks, include.lowest = TRUE)), color = "white", linewidth = 0.01) +
  scale_fill_manual(values = palette5, na.value = "#e9e9e9",
                     name = "Change Percentage",
                     labels = labels) +  # Use custom range labels
  labs(title = "Philadelphia Property Value Change by Block Group in 2019") +
  theme_void()

grid.arrange(
  plot2018, plot2019,
  ncol = 2 
)

Combine and Visualize

After all the manipulations above, we will finally calculate whether a census block group is gentrifying. If a census block group’s percentage of minority group (non-white) decreases by more than 4% in 1 year, median household income one year ago was below Philadelphia’s median, and property value increased by more than 5%, it will be categorized as 1, meaning that it is gentrifying. Otherwise, it will receive a value of 0, meaning that it is not gentrifying.

The two plots below show the result of gentrification in the year 2018 and 2019. We can see more census block groups are gentrifying as time goes by, and most of them are located around Center City and University City.

census2018_Gentrification <- census2018 %>%
  select(GEOID, NW_change_pct, MedHhIBool_2017, PV_change_pct) %>%
  mutate(Gentrification = ifelse(NW_change_pct < -0.04 & MedHhIBool_2017 == 1 & PV_change_pct > 0.05, 1, 0))

census2018_Gentrification$GentrificationFactor <- factor(census2018_Gentrification$Gentrification,
                                                   levels = c(0, 1),
                                                   labels = c("Non-gentrified", "Gentrified"))

census2019_Gentrification <- census2019 %>%
  select(GEOID, NW_change_pct, MedHhIBool_2018, PV_change_pct) %>%
  mutate(Gentrification = ifelse(NW_change_pct < -0.04 & MedHhIBool_2018 == 1 & PV_change_pct > 0.05, 1, 0))

census2019_Gentrification$GentrificationFactor <- factor(census2019_Gentrification$Gentrification,
                                                   levels = c(0, 1),
                                                   labels = c("Non-gentrified", "Gentrified"))


plot2018Gen <- ggplot() +
  geom_sf(data = census2018_Gentrification, aes(fill=GentrificationFactor), colour='white', linewidth = 0.01) +
  scale_fill_manual(values = palette2, na.value = "#e9e9e9",
                     name = "Gentrification") +
  labs(title = "Philadelphia Gentrification by Block Group in 2018") +
  theme_void() 

plot2019Gen <- ggplot() +
  geom_sf(data = census2019_Gentrification, aes(fill=GentrificationFactor), colour='white', linewidth = 0.01) +
  scale_fill_manual(values = palette2, na.value = "#e9e9e9",
                     name = "Gentrification") +
  labs(title = "Philadelphia Gentrification by Block Group in 2019") +
  theme_void()

grid.arrange(
  plot2018Gen, plot2019Gen,
  ncol = 2 
)

Data Wrangling (Independent Variables)

We collected data for independent variables from a diverse range of sources, encompassing both spatial and non-spatial data. Non-spatial data consists of demographic and housing characteristics of block groups, including age distribution, family structure, population density, etc. Spatial data comprises nearby amenities, such as parks and schools, and physical environmental elements, such as safety measures. Additionally, our analysis incorporates Philadelphia’s historical redlining data to investigate the enduring impact of racial inequity on contemporary gentrification trends. The list includes:

Demographic and Housing characteristics

Family structure
Bachelor Degree Attainment
Occupied Rate/ Renter Rate
Housing Structure
Population Density

Amenities

School
Parks
Metro Station

Safety

Burglary

Redlining

Redlining

Data Collection

The first step here is to collect and read those data.

Demographic and Housing characteristics

The demographic and housing data provides information of basic demographic and housing characteristics at our analysis units, including the residents’ profiles and their living conditions within these communities. The data is obtained from the 2017 and 2018 American Community Survey (ACS) 5-year datasets, focusing on census block groups within Philadelphia. This dataset

#2017
Demo2017 <- get_acs(geography = "block group", 
          survey = "acs5",
          variables = c("B01001_001E",  
                        # TOTALPOP
                        
                        "B11016_001E","B11016_009E",
                        #hh type, non-family hh
                        
                        "B15002_015E","B15002_016E","B15002_017E","B15002_018E",
                        "B15002_032E","B15002_033E","B15002_034E","B15002_035E",
                        #Education attainment, Bachelor Degree
                        
                        "B25001_001E",
                        #HU
                        "B25002_002E","B25003_003E",
                        #Occupied, RENTER OCCUPIED
                        
                        "B25018_001E"
                        #MEDIAN NUMBER OF ROOMS
                        ), 
          year = 2017, 
          state = "42", 
          county = "101", 
          geometry = T, 
          output = "wide") %>%
  st_transform(crs = 2272) %>%
  rename(TotalPop = B01001_001E,
         TotalHH = B11016_001E,
         NumbernonFamilyHH = B11016_009E,
         TotalHU=B25001_001E,
         NumberOccupied=B25002_002E,
         NumberRenter = B25003_003E,
         MedNumRooms = B25018_001E) %>%
  mutate(NumberBachelor = B15002_015E+B15002_016E+B15002_017E+B15002_018E+
                        B15002_032E+B15002_033E+B15002_034E+B15002_035E,
         area_m2 = (as.numeric(st_area(.))*0.0929))%>%
  mutate(pctnonFamilyHH= NumbernonFamilyHH / TotalHH,
         pctOccupied = NumberOccupied/TotalHU,
         pctRent = NumberRenter / NumberOccupied,
         pctBachelor = NumberBachelor / TotalPop,
         PopDensity=TotalPop/area_m2)%>%
  select(-starts_with("B"),-NAME,-TotalPop,-TotalHH,-NumbernonFamilyHH,-TotalHU,
         -NumberOccupied,-NumberRenter,-NumberBachelor,-area_m2)


#2018
Demo2018 <- get_acs(geography = "block group", 
          survey = "acs5",
          variables = c("B01001_001E",  
                        # TOTALPOP
                        
                        "B11016_001E","B11016_009E",
                        #hh type, non-family hh
                        
                        "B15002_015E","B15002_016E","B15002_017E","B15002_018E",
                        "B15002_032E","B15002_033E","B15002_034E","B15002_035E",
                        #Education attainment, Bachelor Degree
                        
                        "B25001_001E",
                        #HU
                        "B25002_002E","B25003_003E",
                        #Occupied, RENTER OCCUPIED
                        
                        "B25018_001E"
                        #MEDIAN NUMBER OF ROOMS
                        ), 
          year = 2018, 
          state = "42", 
          county = "101", 
          geometry = T, 
          output = "wide") %>%
  st_transform(crs = 2272) %>%
  rename(TotalPop = B01001_001E,
         TotalHH = B11016_001E,
         NumbernonFamilyHH = B11016_009E,
         TotalHU=B25001_001E,
         NumberOccupied=B25002_002E,
         NumberRenter = B25003_003E,
         MedNumRooms = B25018_001E) %>%
  mutate(NumberBachelor = B15002_015E+B15002_016E+B15002_017E+B15002_018E+
                        B15002_032E+B15002_033E+B15002_034E+B15002_035E,
         area_m2 = (as.numeric(st_area(.))*0.0929))%>%
  mutate(pctnonFamilyHH= NumbernonFamilyHH / TotalHH,
         pctOccupied = NumberOccupied/TotalHU,
         pctRent = NumberRenter / NumberOccupied,
         pctBachelor = NumberBachelor / TotalPop,
         PopDensity=TotalPop/area_m2)%>%
  select(-starts_with("B"),-NAME,-TotalPop,-TotalHH,-NumbernonFamilyHH,-TotalHU,
         -NumberOccupied,-NumberRenter,-NumberBachelor,-area_m2)

School

Education amenities serve as an important independent variable in our analysis. When a block group offers quality educational resources, it often attracts individuals seeking access to such amenities. This influx of residents seeking quality education can contribute to an increase in housing values and may consequently lead to gentrification within the area. To quantify this effect, we calculate the total count of schools within each block group as a variable of interest. The school data comes from The School District of Philadelphia’s School Lists dataset.

#2017
school2017<-school2017%>%
  select(School.Level, GPS.Location)%>%
  mutate(latitude = as.numeric(substring(GPS.Location, 1, regexpr(",", GPS.Location) - 1)),
         longitude = as.numeric(substring(GPS.Location, regexpr(",", GPS.Location) + 1)))%>%
  st_as_sf(., coords = c("longitude", "latitude"), crs = 4326)%>%
  st_transform(crs = 2272)
  
#2018
school2018<-school2018%>%
  select(School.Level, GPS.Location)%>%
  mutate(latitude = as.numeric(substring(GPS.Location, 1, regexpr(",", GPS.Location) - 1)),
         longitude = as.numeric(substring(GPS.Location, regexpr(",", GPS.Location) + 1)))%>%
  st_as_sf(., coords = c("longitude", "latitude"), crs = 4326)%>%
  st_transform(crs = 2272)


#2017
school_block2017 <- st_join(school2017, Demo2017) %>%
  group_by(GEOID) %>%
  summarise(school = n()) %>%
  st_drop_geometry() 

#2018
school_block2018 <- st_join(school2018, Demo2018) %>%
  group_by(GEOID) %>%
  summarise(school = n())%>%
  st_drop_geometry()

Park

Proximity to parks reflects the physical environment of a community. Individuals often prefer to reside near green spaces and parks for a more desirable living environment. When a block group offers such favorable conditions, it tends to attract new residents, potentially leading to gentrification. To quantify this impact, we calculate the count of parks located within 500 feet of each unit as a variable of interest in our analysis. The park data comes from OpenDataPhilly Parks & Recreation Program Sites dataset.

#2017
park_block2017 <- st_join(park, 
                             st_buffer(Demo2017, dist = 500)) %>%
  group_by(GEOID) %>%
  summarise(park = n())%>%
  st_drop_geometry() 

#2018

park_block2018 <- st_join(park, 
                             st_buffer(Demo2017, dist = 500)) %>%
  group_by(GEOID) %>%
  summarise(park = n())%>%
  st_drop_geometry()

Metro Station

Areas with convenient access to public transportation, especially metro stations, are often at a higher risk of gentrification. Residents are inclined to live near public transportation hubs to facilitate their daily commute and access other amenities. To quantify this factor, we have counted the number of metro stations within each block group to gauge its accessibility and convenience. The station data comes from STEPTA Highspeed Stations dataset.

#2017
station_block2017 <- st_join(station, 
                             Demo2017) %>%
  group_by(GEOID) %>%
  summarise(station = n())%>%
  st_drop_geometry() 

#2018
station_block2018 <- st_join(station, 
                             Demo2018) %>%
  group_by(GEOID) %>%
  summarise(station = n())%>%
  st_drop_geometry()

Burglary

Residents are drawn to areas with enhanced safety measures. To assess the safety level of each block group, we utilize burglary data as a metric. The crime data comes from Opendataphilly Crime Incidents dataset.

#2017
crime_block2017 <- st_join(crime2017, Demo2017) %>%
  group_by(GEOID) %>%
  summarise(Burglary = n())%>%
  st_drop_geometry()

#2018
crime_block2018 <- st_join(crime2018, Demo2018) %>%
  group_by(GEOID) %>%
  summarise(Burglary = n())%>%
  st_drop_geometry()

Redlining

Moreover, we want to explore the influence of historic redlining on current gentrification trends. For example, whether gentrification tends to occur more frequently in areas that historically received better grades. The historic Redlining data comes from Philadelphia Office of the Controller .

Redline_block <- st_join(Demo2017, Redline) %>%
  select(GEOID, holc_grade)%>%
  mutate(holc_number = case_when(
  holc_grade == "Commercial" ~ 0,
  holc_grade == "A" ~ 1,
  holc_grade == "B" ~ 2,
  holc_grade == "C" ~ 3,
  holc_grade == "D" ~ 4,
  TRUE ~ NA
))%>%
  group_by(GEOID) %>%
  summarise(holc_number = max(holc_number))%>%
  st_drop_geometry()%>%
  mutate(holc_grade = case_when(
  holc_number == 0 ~ "Commercial",
  holc_number == 1 ~ "A",
  holc_number == 2 ~ "B",
  holc_number == 3 ~ "C",
  holc_number == 4 ~ "D",
  TRUE ~ NA
))%>%
  select(-holc_number)

Summarize and preview the variables

The table below shows a summary of all the possible independent variables.

#2017
data2017<-Demo2017%>%
  left_join(.,school_block2017, by ="GEOID")%>% 
  left_join(.,park_block2017, by ="GEOID")%>%  
  left_join(.,station_block2017, by ="GEOID")%>% 
  left_join(.,crime_block2017, by ="GEOID")%>% 
  left_join(.,Redline_block, by ="GEOID")
  
#2018
data2018<-Demo2018%>%
  left_join(.,school_block2018, by ="GEOID")%>%  
  left_join(.,park_block2018, by ="GEOID")%>% 
  left_join(.,station_block2018, by ="GEOID")%>%  
  left_join(.,crime_block2018, by ="GEOID")%>% 
  left_join(.,Redline_block, by ="GEOID")
  

#2017

data2017 <- data2017%>%
  mutate(school = replace_na(school, 0)) %>%
  mutate(school.factor= case_when(  school == 0 ~ "0",  school == 1 ~ "1",  school == 2 ~ "2",  school > 2 ~"over 2"))%>%
  mutate(park = replace_na(park, 0)) %>%
  mutate(park.factor= case_when(  park == 0 ~ "0",  park == 1 ~ "1",  park == 2 ~ "2",  park > 2 ~"over 2")) %>%
  mutate(station = replace_na(station, 0) )%>%
  mutate(station.factor= case_when(  station == 0 ~ "0",  station == 1 ~ "1",  station > 1 ~"over 1")) %>%
  mutate(Burglary = replace_na(Burglary, 0) )


#2018
data2018 <- data2018%>%
  mutate(school = replace_na(school, 0)) %>%
  mutate(school.factor= case_when(  school == 0 ~ "0",  school == 1 ~ "1",  school == 2 ~ "2",  school > 2 ~"over 2"))%>%
  mutate(park = replace_na(park, 0)) %>%
  mutate(park.factor= case_when(  park == 0 ~ "0",  park == 1 ~ "1",  park == 2 ~ "2",  park > 2 ~"over 2")) %>%
  mutate(station = replace_na(station, 0) )%>%
  mutate(station.factor= case_when(  station == 0 ~ "0",  station == 1 ~ "1",  station > 1 ~"over 1")) %>%
  mutate(Burglary = replace_na(Burglary, 0) )


data2018_summary<- data2018 %>%
  st_drop_geometry() %>%
  select(-GEOID, -holc_grade, -school.factor, -park.factor, -station.factor) 

# stargazer(data2018_summary, type = "text", digits = 1, title = "Summary Statistics")

summary_data <- round(do.call(cbind, lapply(data2018_summary, summary)), digits = 2)

as.data.frame(t(summary_data)) %>%
  select(-"NA's") %>%
  rename('Minimum' = 'Min.',
         '1st Quantile' = '1st Qu.',
         '3rd Quantile' = '3rd Qu.',
         'Maximun' = 'Max.') %>%
  kable(caption = "Summary of Quantitative Independent Variables") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  footnote(general_title = "\n", general = "Table 1")

Summary of Quantitative Independent Variables
	Minimum	1st Quantile	Median	Mean	3rd Quantile	Maximun
MedNumRooms	1.50	5.00	5.70	5.49	6.10	10.00
pctYouth	0.00	0.12	0.18	0.20	0.24	1.00
prcMove	0.00	0.05	0.11	0.14	0.19	0.70
pctnonFamilyHH	0.00	0.32	0.44	0.45	0.56	1.00
pctOccupied	0.36	0.81	0.89	0.87	0.94	1.00
pctRent	0.00	0.30	0.44	0.46	0.61	1.00
pctBachelor	0.00	0.06	0.13	0.19	0.27	0.99
PopDensity	0.00	0.00	0.01	0.01	0.01	0.04
school	0.00	0.00	0.00	0.25	0.00	7.00
park	0.00	0.00	0.00	0.38	1.00	8.00
station	0.00	0.00	0.00	0.04	0.00	5.00
Burglary	0.00	2.00	4.00	4.83	6.25	35.00

Table 1

Continous Features

The plot below shows an initial exploration of the relationship between our continuous independent variables and gentrification, our dependent variable.

data2018 <-data2018%>%
  left_join(.,
            census2019_Gentrification%>%
              st_drop_geometry()%>%
              select(GEOID,Gentrification), 
            by ="GEOID")%>%
  na.omit(Gentrification)

#plot

data2018 %>%
    st_drop_geometry()%>%
    dplyr::select(Gentrification, 
                  MedNumRooms,pctnonFamilyHH,pctOccupied,pctRent,pctBachelor,PopDensity) %>%
    gather(Variable, value, -Gentrification) %>%
    ggplot() + 
    geom_density(aes(value, color=as.factor(Gentrification)), fill = "transparent") + 
    facet_wrap(~Variable, scales = "free") +
    scale_color_manual(values = palette2) +
    labs(title = "Feature distributions Gentrification vs. no Gentrification",
         subtitle = "(continous features)")+ theme_minimal()+
  theme(legend.position = "bottom")

Categorical Features

And also the relationship between our categorical independent variables and gentrification.

plot<-data2018 %>%
    st_drop_geometry()%>%
    dplyr::select(Gentrification, 
                  school.factor,park.factor,station.factor,holc_grade) 
  
plot %>%
    gather(Variable, value, -Gentrification) %>%
    count(Variable, value, Gentrification)%>%
      ggplot(., aes(value, n, fill = as.factor(Gentrification))) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = palette2) +
        labs(x="Categories", y="Value",
             title = "Feature associations with Gentrification",
             subtitle = "Categorical features") +
  theme_minimal()+
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 30, hjust = 1,size=8))

Model development

Data splitting

In our analysis, a logistic regression model has been employed for gentrification prediction. This model estimates the probability of gentrification for each block group based on the independent variables we have identified. The dataset was divided into a 50-50 split between training and test datasets.

inTrain <- createDataPartition(data2018$Gentrification, p = .50,
                                  list = FALSE,
                                  times = 1)

Gentrification.training <- data2018[inTrain,] 
Gentrification.test <- data2018[-inTrain,]

Analyzing associations

Below is the table showcasing our final model. For specific details on model selection criteria, please refer to the appendix. Because there are only 1336 census block groups in Philadelphia, our dataset does not have a lot of data to work with. In this way, after we randomly splitting the original dataset into training and testing data, the summary of the variables is fluctuated. However, the percentage of individuals with Bachelor’s degrees or higher continuously to be statistically significant predictors in the model.

model <- glm(Gentrification ~ MedNumRooms+pctnonFamilyHH+pctOccupied+pctRent+pctBachelor+PopDensity+
                school+park+station.factor+Burglary+holc_grade, 
             data = as.data.frame(Gentrification.training) %>% 
                     dplyr::select(-GEOID, -geometry),
                  family="binomial" (link="logit"))

fit_metrics <- pR2(model)

model_sum <- summary(model)

coefficients_table <- as.data.frame(model_sum$coefficients)

coefficients_table$significance <- ifelse(coefficients_table$`Pr(>|z|)` < 0.01, '***',
                                                ifelse(coefficients_table$`Pr(>|z|)` < 0.05, '**',
                                                       ifelse(coefficients_table$`Pr(>|z|)` < 0.1, '*', '')))

coefficients_table$p_value <- paste0(round(coefficients_table$`Pr(>|z|)`, digits = 3), coefficients_table$significance)

coefficients_table %>%
  select(-significance, -`Pr(>|z|)`) %>% 
  kable(caption = "Model Summary", align = "r", digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))  %>%
  footnote(general_title = "\n", general = "Table 2")

Model Summary
	Estimate	Std. Error	z value	p_value
(Intercept)	-24.34	2098.03	-0.01	0.991
MedNumRooms	-0.03	0.34	-0.09	0.925
pctnonFamilyHH	4.77	1.59	3.00	0.003***
pctOccupied	3.06	2.13	1.44	0.149
pctRent	0.06	1.36	0.04	0.966
pctBachelor	-4.77	1.60	-2.97	0.003***
PopDensity	111.90	36.34	3.08	0.002***
school	0.71	0.35	2.04	0.041**
park	-0.28	0.34	-0.83	0.407
station.factor1	0.30	0.75	0.40	0.69
station.factorover 1	-15.34	3381.12	0.00	0.996
Burglary	0.05	0.04	1.33	0.185
holc_gradeB	14.71	2098.03	0.01	0.994
holc_gradeC	15.09	2098.03	0.01	0.994
holc_gradeCommercial	-0.45	2781.29	0.00	1
holc_gradeD	16.33	2098.03	0.01	0.994

Table 2

as.data.frame(t(fit_metrics)) %>%
  kable(caption = "Fit Metrics for Logistic Regression Model", digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  footnote(general_title = "\n", general = "Table 3")

Fit Metrics for Logistic Regression Model
llh	llhNull	G2	McFadden	r2ML	r2CU
-101.3268	-128.847	55.0404	0.2136	0.0851	0.2499

Table 3

Optimize Threshold

To order to have a better predict the gentrification block group, we want to optimize the threshold. This involves a trade-off between specificity and sensitivity, prioritizing the identification of true positive cases (gentrification) while maintaining overall accuracy. We determine that a threshold of 0.1 strikes a balance between sensitivity and accuracy, ensuring effective identification of gentrification while preserving a satisfactory level of overall performance.

options(yardstick.event_first = FALSE)

testProbs <- data.frame(Outcome = as.factor(Gentrification.test$Gentrification),
                        Probs = predict(model, Gentrification.test, type= "response"))

testProbs <- 
  testProbs %>% 
  mutate(
         predClass_30 = as.factor(ifelse(testProbs$Probs > 0.3 , 1, 0)),
         predClass_20 = as.factor(ifelse(testProbs$Probs > 0.2 , 1, 0)),
         predClass_15 = as.factor(ifelse(testProbs$Probs > 0.15 , 1, 0)),
         predClass_10 = as.factor(ifelse(testProbs$Probs > 0.1 , 1, 0)),
         predClass_05 = as.factor(ifelse(testProbs$Probs > 0.05 , 1, 0))) 

testProbs %>%
  dplyr::select(-Probs) %>%
  gather(Variable, Value, -Outcome) %>%
  group_by(Variable) %>%
  summarize(Specificity = round(yardstick::sens_vec(Outcome,factor(Value)),2),
            Sensitivity = round(yardstick::spec_vec(Outcome,factor(Value)),2),
            Accuracy = round(yardstick::accuracy_vec(Outcome,factor(Value)),2)) %>%
  kable(caption = "Optimize Threshold") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  footnote(general_title = "\n", general = "Table 4")

Optimize Threshold
Variable	Specificity	Sensitivity	Accuracy
predClass_05	0.70	0.56	0.69
predClass_10	0.85	0.28	0.83
predClass_15	0.90	0.20	0.87
predClass_20	0.93	0.16	0.90
predClass_30	0.96	0.04	0.92

Table 4

Cross validation

data2018$Gentrification <- as.factor(data2018$Gentrification)

levels(data2018$Gentrification) <- make.names(levels(data2018$Gentrification))


ctrl <- trainControl(
  method = "cv",
  number = 100,
  classProbs = TRUE, 
  summaryFunction = twoClassSummary
)

cvFit <- train(Gentrification ~ MedNumRooms+pctnonFamilyHH+pctOccupied+pctRent+pctBachelor+PopDensity+
                school+park+station.factor+Burglary+holc_grade,
  data = data2018 %>%
    dplyr::select(-GEOID, -geometry),
  method = "glm",
  family = "binomial",
  metric = "ROC",
  trControl = ctrl
)

cvFit

cvFit_df <- as.data.frame(cvFit$results)[, 2:7]
cvFit_df %>% 
  kable(col.name=c('ROC', 'Sensitivity', 'Specificity', 'ROC SD', 'Sensitivity SD', 'Specificity SD'),caption = "Cross Validation", digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))  %>%
  footnote(general_title = "\n", general = "Table 5")

Cross Validation
ROC	Sensitivity	Specificity	ROC SD	Sensitivity SD	Specificity SD
0.7772	0.9983	0	0.2311	0.0123	0

Table 5

Make Predictions

testProbs <- Gentrification.test %>% mutate(Outcome = as.factor(Gentrification.test$Gentrification),
                        Probs = predict(model, Gentrification.test, type= "response"))

testProbs <- 
  testProbs %>%
  mutate(predOutcome  = as.factor(ifelse(Probs > 0.1 , 1, 0)),
         over.under = ifelse(predOutcome == 1 & Outcome == 0 , 'Overpredict',
                             ifelse(predOutcome == 0 & Outcome == 1, 'Underpredict', NA)))

ggplot(testProbs, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) +
  labs(x = "Probabilities", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome",
             subtitle = "No gentrification = 0, Gentrification = 1") +
  theme_minimal() +
  theme(legend.position = "none")

ROC Curve

ggplot(testProbs, aes(d = as.numeric(Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#BF4146") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1, color = '#6683a9') +
  labs(title = "ROC Curve") +
  theme_minimal()

Confusion Matrix

We employ a threshold of 0.1 to optimize the overall accuracy of the model. In other words, we want to maximum the total number of census block groups that are accurately predicted (0 as 0; 1 as 1). If two models can accurately predict similar number of block groups, we favor the one with higher sensitivity, meaning that it can identify more gentrification. This is because effective sensitivity empowers governments to implement targeted interventions in these areas, mitigating or reducing future gentrification.

cm <- caret::confusionMatrix(testProbs$predOutcome, testProbs$Outcome, 
                       positive = "1")

mosaicplot(cm$table, color=palette2, main = "Mosaic Plot for Confusion Matrix",
           xlab = "Prediction", ylab = "Reference")

cm_df <- as.data.frame(round(cm$byClass, digit = 3)) %>% 
  head(5) %>%
  t()

rownames(cm_df) <- NULL

cm_df %>%
  kable(caption = "Confusion Matrix") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))  %>%
  footnote(general_title = "\n", general = "Table 6")

Confusion Matrix
Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	Precision
0.28	0.85	0.073	0.966	0.073

Table 6

Over and Under Prediction Location

We also want to know which census block group gets over or under-predicted in our testing dataset. Over prediction means the block group is not gentrifying but we predict it as gentrifying, and under prediction means the block group is gentrifying but we predict it as not gentrifying. From the map below we can see that most over-prediction happens around University City and Center City, but there is no strong correlation between under-prediction and block group location. However, census block groups on the south side of University City have a slightly higher tendency to be underpredicted.

ggplot() +
  geom_sf(data = testProbs, aes(fill=over.under), colour='white', linewidth = 0.01) +
  scale_fill_manual(values = palette2, na.value = "#e9e9e9",
                     name = "Prediction") +
  labs(title = "Over and Under Prediction of Testing Dataset in 2019") +
  theme_void()

Validation

Here, we will use Philadelphia city data from 2017-2018 to evaluate the model’s performance.

data2017 <- data2017%>%
  left_join(.,
            census2018_Gentrification%>%
              st_drop_geometry()%>%
              select(GEOID,Gentrification), 
            by ="GEOID")%>%
  na.omit(Gentrification)

data2017$Gentrification <- as.factor(data2017$Gentrification)

levels(data2017$Gentrification) <- make.names(levels(data2017$Gentrification))

validationProbs <- data.frame(
  Outcome = as.factor(data2017$Gentrification),
  Probs = predict(model, data2017, type = "response")
)

validationProbs <- 
  validationProbs %>%
  mutate(predOutcome  = as.factor(ifelse(validationProbs$Probs > 0.1 , 1, 0)))

validationProbs <- validationProbs %>%
  mutate(Outcome = as.numeric(gsub("X", "", Outcome)))
validationProbs$predOutcome <- as.numeric(as.character(validationProbs$predOutcome))

# Convert to factors ensuring both have the same levels
levels <- union(levels(factor(validationProbs$Outcome)), levels(factor(validationProbs$predOutcome)))

validationProbs$Outcome <- factor(validationProbs$Outcome, levels = levels)
validationProbs$predOutcome <- factor(validationProbs$predOutcome, levels = levels)

2017 Confusion Matrix

# Now, run the confusion matrix
confusionMatrix <- caret::confusionMatrix(validationProbs$predOutcome, validationProbs$Outcome, positive = "1")

confusionMatrix_df <- as.data.frame(round(confusionMatrix$byClass, digit = 3)) %>% 
  head(5) %>%
  t()

rownames(confusionMatrix_df) <- NULL

confusionMatrix_df %>%
  kable(caption = "Confusion Matrix") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))  %>%
  footnote(general_title = "\n", general = "Table 7")

Confusion Matrix
Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	Precision
0.5	0.82	0.018	0.996	0.018

Table 7

A sensitivity of 0.375 indicates that the model correctly identifies only 37.5% of gentrified areas, which is relatively low. Combined with a low positive predictive value, this suggests that the model is relatively ineffective at accurately predicting gentrified areas. However, a specificity around 0.9 suggests that it is relatively good at identifying areas that are not gentrified.

2017 Cross Validation

cvFit <- train(Gentrification ~ MedNumRooms+pctnonFamilyHH+pctOccupied+pctRent+pctBachelor+PopDensity+
                school+park+station.factor+Burglary+holc_grade,
  data = data2017 %>%
    dplyr::select(-GEOID, -geometry),
  method = "glm",
  family = "binomial",
  metric = "ROC",
  trControl = ctrl
)

cvFit

cvFit_df <- as.data.frame(cvFit$results)[, 2:7]
cvFit_df %>% 
  kable(col.name=c('ROC', 'Sensitivity', 'Specificity', 'ROC SD', 'Sensitivity SD', 'Specificity SD'), caption = "Cross Validation", digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))  %>%
  footnote(general_title = "\n", general = "Table 8")

Cross Validation
ROC	Sensitivity	Specificity	ROC SD	Sensitivity SD	Specificity SD
0.6803	1	0	0.2352	0	0

Table 8

Here, the ROC value is 0.65, which is slightly lower compared to the model that used 2018 data for cross-validation, which had an ROC of 0.77. Despite this slight decrease, the model still demonstrates a relatively good ability to differentiate between positive and negative outcomes.

After cross-validation, the sensitivity improved dramatically to 1, meaning the model predicted all the gentrified areas correctly. Conversely, specificity dropping to 0 means that the model failed to correctly identify any of the negative cases post-cross-validation.

ggplot(validationProbs, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) +
  labs(x = "Probabilities", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome",
             subtitle = "No gentrification = 0, Gentrification = 1") +
  theme_minimal() +
  theme(legend.position = "none")

The blue plot shows a high density of predicted probabilities close to 0, which indicates that the model confidently predicts a large number of areas as not gentrified. This is consistent with the relatively high specificity observed, as the model successfully identifies a significant proportion of true negatives. The red plot, however, shows most predicted probabilities clustered around the lower end (below 0.2), with a very small portion exceeding this range. This distribution explains the low sensitivity because the model assigns low probabilities to most actual gentrified areas, failing to recognize them as such.

The 2018 model appears slightly better calibrated for predicting gentrified areas, as indicated by a broader spread of probabilities for gentrified cases. It may be detecting gentrification indicators more effectively than the 2017 model.

2017 ROC Curve

ggplot(validationProbs, aes(d = as.numeric(Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#BF4146") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1, color = '#6683a9') +
  labs(title = "ROC Curve") +
  theme_minimal()

When compared to the 2018 model’s ROC curve, several differences are notable. The ROC curve from 2018 showed a more stepped progression with sharper increases in the true positive rate at specific threshold levels. This might imply that the 2018 model had certain thresholds where its sensitivity dramatically increased, but these were balanced by intervals where the performance plateaued, indicating less consistency across threshold adjustments. In contrast, the newer model’s smoother curve suggests a more consistent sensitivity to changes in the threshold, which could indicate a more stable model that generalizes better across different data scenarios.

Summary

To summarize, this model employs demographic and economic indicators such as the percentage change of minority groups, median household income, and change in property value to analyze gentrification in Philadelphia. It incorporates additional socio-economic and urban amenities data to predict areas undergoing gentrification. The model focuses on accurately identifying true positive and true negative areas, aiming to pinpoint actual gentrified and non-gentrified locations effectively. It has been tested with data from the previous years and successfully identified most of the gentrified areas noted.

Moving forward, this model serves as a valuable tool for city planners and policymakers to identify neighborhoods at risk of gentrification. It enables the development of strategies that promote balanced growth, including affordable housing initiatives and support for existing communities. The model’s flexibility allows for ongoing refinement and incorporation of new data and trends, improving both its accuracy and applicability.

Additionally, by tracking the effects of gentrification, the model can help mitigate adverse outcomes such as the displacement of lower-income families and the erosion of cultural heritage. It also provides community advocates and local stakeholders with data-driven insights, supporting efforts to advocate for inclusive urban policies.

Looking ahead, future iterations of the model could integrate real-time data feeds, enhancing its responsiveness to rapid urban changes. This would facilitate more dynamic assessments of gentrification risks and enable more timely and effective interventions.

Appendix

Model specification

Correlation Matrix

To ensure that multicollinearity does not affect our analysis, we create a correlation matrix. This matrix allows us to examine the relationships between variables and identify any instances of high correlation.

numericVars <- data2018 %>%
  st_drop_geometry() %>%  
  select_if(is.numeric) %>% 
  na.omit() 

correlation_matrix <- cor(numericVars)


numericVars %>% 
  correlate() %>% 
  autoplot() +
  geom_tile(aes(fill = r), color = "#e9e9e9") +
  geom_text(aes(label = round(r,digits=2)), size = 3) +
  scale_fill_gradient2(low = "#6683a9", mid = "white", high = "#BF4146",  # Specify color gradient
                       midpoint = 0, limits = c(-1, 1),
                       breaks = seq(-1, 1, by = 0.2)) +
  labs(title = "Correlation across numeric variables")  # Set plot title

Model Comparison

We chose Model 6 as the final model because it performed well on several evaluation metrics with higher stability. Because our sample size is small, the result can be highly fluctuated based on the random sample we get. One model that performs well with one random split may not perform well another time. In this way, we want to choose a model that can stably performs well. When comparing McFadden R-squared, AIC, and AUC together, Model 6 showed good predictive power and model fit. In addition, Model 6’s AUC demonstrated greater stability during cross-validation, further solidifying its position as the model of choice for the analysis.

model1 <- glm(Gentrification ~ MedNumRooms+pctnonFamilyHH+pctOccupied+pctRent+pctBachelor+PopDensity,
             data = as.data.frame(Gentrification.training) %>% 
                     dplyr::select(-GEOID, -geometry),
                  family="binomial" (link="logit"))

model2 <- glm(Gentrification ~ MedNumRooms+pctnonFamilyHH+pctOccupied+pctRent+pctBachelor+log(PopDensity), 
             data = as.data.frame(Gentrification.training) %>% 
                     dplyr::select(-GEOID, -geometry),
                  family="binomial" (link="logit"))

model3 <- glm(Gentrification ~ MedNumRooms+pctnonFamilyHH+pctOccupied+pctRent+pctBachelor+PopDensity+
                school+park, 
             data = as.data.frame(Gentrification.training) %>% 
                     dplyr::select(-GEOID, -geometry),
                  family="binomial" (link="logit"))

model4 <- glm(Gentrification ~ MedNumRooms+pctnonFamilyHH+pctOccupied+pctRent+pctBachelor+PopDensity+
                school+park+station, 
             data = as.data.frame(Gentrification.training) %>% 
                     dplyr::select(-GEOID, -geometry),
                  family="binomial" (link="logit"))

model5 <- glm(Gentrification ~ MedNumRooms+pctnonFamilyHH+pctOccupied+pctRent+pctBachelor+PopDensity+
                school+park+station+Burglary+holc_grade, 
             data = as.data.frame(Gentrification.training) %>% 
                     dplyr::select(-GEOID, -geometry),
                  family="binomial" (link="logit"))

model6 <- glm(Gentrification ~ MedNumRooms+pctnonFamilyHH+pctOccupied+pctRent+pctBachelor+PopDensity+
                school+park+station.factor+Burglary+holc_grade, 
             data = as.data.frame(Gentrification.training) %>% 
                     dplyr::select(-GEOID, -geometry),
                  family="binomial" (link="logit"))

model7 <- glm(Gentrification ~ MedNumRooms+pctnonFamilyHH+pctOccupied+pctRent+pctBachelor+PopDensity+
                school.factor+park+station.factor+Burglary+holc_grade, 
             data = as.data.frame(Gentrification.training) %>% 
                     dplyr::select(-GEOID, -geometry),
                  family="binomial" (link="logit"))

model8 <- glm(Gentrification ~ MedNumRooms+pctnonFamilyHH+pctOccupied+pctRent+pctBachelor+PopDensity+
                school.factor+park.factor+station.factor+Burglary+holc_grade, 
             data = as.data.frame(Gentrification.training) %>% 
                     dplyr::select(-GEOID, -geometry),
                  family="binomial" (link="logit"))


modelList <- paste0("model", 1:8)
bar <- map_dfc(modelList, function(x)pR2(get(x)))[4,] %>%
  setNames(paste0("model",1:8)) %>%
  gather(model,McFadden)

McFadden R-Squared

bar %>%
  ggplot(aes(model, McFadden)) +
    geom_bar(stat="identity", fill="#BF4146") +
    labs(title= "McFadden R-Squared by Model") +
    theme_minimal()

AIC

aic_values <- c(AIC(model1), AIC(model2), AIC(model3), AIC(model4), AIC(model5), AIC(model6), AIC(model7), AIC(model8))

# Create a data frame to store model numbers and corresponding AIC values
model_comparison <- data.frame(Model = c("Model 1", "Model 2", "Model 3", "Model 4","Model 5", "Model 6", "Model 7", "Model 8"),
                                AIC = aic_values) %>%
                                t()

colnames(model_comparison) <- as.character(model_comparison[1, ])  # Set column names to values in the first row
model_comparison <- model_comparison[-1, , drop = FALSE]  # Remove the first row

# Create a table using kable
kable(model_comparison, caption = "Comparison of AIC values") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))  %>%
  footnote(general_title = "\n", general = "Table 9")

Comparison of AIC values
	Model 1	Model 2	Model 3	Model 4	Model 5	Model 6	Model 7	Model 8
AIC	237.2377	238.1296	236.6689	238.6539	233.0515	234.6536	237.7535	241.0488

Table 9

AUC

Because previous analysis already shows model 1-4 perform worse than model 5-8, this section only focuses on model 5-8.

data2018$Gentrification <- as.factor(data2018$Gentrification)

levels(data2018$Gentrification) <- make.names(levels(data2018$Gentrification))


ctrl <- trainControl(
  method = "cv",
  number = 100,
  classProbs = TRUE, 
  summaryFunction = twoClassSummary
)

#model5
cvFit5 <- train(
  Gentrification ~ MedNumRooms+pctnonFamilyHH+pctOccupied+pctRent+pctBachelor+PopDensity+
                school+park+station+Burglary+holc_grade,
  data = data2018 %>%
    dplyr::select(-GEOID, -geometry),
  method = "glm",
  family = "binomial",
  metric = "ROC",
  trControl = ctrl
)

#model6
cvFit6 <- train(
  Gentrification ~ MedNumRooms+pctnonFamilyHH+pctOccupied+pctRent+pctBachelor+PopDensity+
                school+park+station.factor+Burglary+holc_grade,
  data = data2018 %>%
    dplyr::select(-GEOID, -geometry),
  method = "glm",
  family = "binomial",
  metric = "ROC",
  trControl = ctrl
)

#model7
cvFit7 <- train(
  Gentrification ~ MedNumRooms+pctnonFamilyHH+pctOccupied+pctRent+pctBachelor+PopDensity+
                school.factor+park+station.factor+Burglary+holc_grade,
  data = data2018 %>%
    dplyr::select(-GEOID, -geometry),
  method = "glm",
  family = "binomial",
  metric = "ROC",
  trControl = ctrl
)

#model8
cvFit8 <- train(
  Gentrification ~ MedNumRooms+pctnonFamilyHH+pctOccupied+pctRent+pctBachelor+PopDensity+
                school.factor+park.factor+station.factor+Burglary+holc_grade,
  data = data2018 %>%
    dplyr::select(-GEOID, -geometry),
  method = "glm",
  family = "binomial",
  metric = "ROC",
  trControl = ctrl
)

cvFit_df5 <- as.data.frame(cvFit5$results)[, 2:7]
cvFit_df6 <- as.data.frame(cvFit6$results)[, 2:7]
cvFit_df7 <- as.data.frame(cvFit7$results)[, 2:7]
cvFit_df8 <- as.data.frame(cvFit8$results)[, 2:7]

cvFit_combine <- rbind(cvFit_df5, cvFit_df6, cvFit_df7, cvFit_df8)
row.names(cvFit_combine) <- c("Model 5", "Model 6", "Model 7", "Model 8")

cvFit_combine %>% 
  kable(col.name=c('ROC', 'Sensitivity', 'Specificity', 'ROC SD', 'Sensitivity SD', 'Specificity SD'),caption = "Cross Validation", digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))  %>%
  footnote(general_title = "\n", general = "Table 10")

Cross Validation
	ROC	Sensitivity	Specificity	ROC SD	Sensitivity SD	Specificity SD
Model 5	0.7325	0.9983	0	0.2472	0.0117	0
Model 6	0.7453	0.9983	0	0.2463	0.0117	0
Model 7	0.7593	0.9974	0	0.2199	0.0147	0
Model 8	0.7577	0.9983	0	0.2579	0.0117	0

Table 10

References

“Definition of GENTRIFICATION.” In Merriam-Webster, April 29, 2024. https://www.merriam-webster.com/dictionary/gentrification.

“Gentrification Comparison Tool | Enterprise Community Partners.” Accessed May 5, 2024. https://www.enterprisecommunity.org/resources/gentrification-comparison-tool.

Methods Brief. “A Measure of Gentrification for Use in Longitudinal Public Health Studies in the US.” Drexel University Urban Health Collaborative, August 2019. https://drexel.edu/uhc/resources/briefs/Measure-of-Gentrification-for-Use-in-Longitudinal-Public-Health-Studies-in-the-US/.

Rachel Bogardus Drew. “Gentrification Definitions and Racial Change: Considering the Evidence.” Enterprise Community Partners, May 1, 2020. https://www.enterprisecommunity.org/resources/gentrification-definitions-and-racial-change-considering-evidence.

“What Are Gentrification and Displacement – Urban Displacement.” Accessed May 5, 2024. https://www.urbandisplacement.org/about/what-are-gentrification-and-displacement/.