Introduction

This project aims to build a machine learning model for single-family housing price prediction in Seattle, Washington using both the housing (structural characteristics) and spatial characteristics. The housing price used as the dependent variable in the model’s preparation is extracted from King County’s 2015 data, and multiple independent variables used in the model are from City of Seattle Open Data Portal. The predicting model is generated by Junyi and Ziyi and this report is written by Junyi.

Data Wrangling

Data Gathering

The model needs multiple spatial data as possible independent variable. The first step here is to collect and read those data. The list includes: 1. housing structural characteristics that come with the 2015 housing price data 2. Seattle census tracts 3. Seattle neighborhood boundary 4. Seattle crime data from 2008 to present 5. Amenities such as parks, schools, hospitals, and crosswalk 6. Seattle streetcar stations 7. Major roads in Seattle 8. Percent of canopy coverage by census tracts

Data Processing

Safety data

The original crime data we gathered includes the location of each crime event. In this way, we did two calculations for this dataset. The first one is a nearest neighbor analysis in order to calculate the average distance from each housing location to its nearest 1 / 2 / 3 / 4 / 5 crime spots. Besides, we also create a buffer of 0.5 mile to each housing site and counts the number of crime events that are within that buffer.

## Nearest Neighbor Feature
Seattle.sf_tracts2 <-
  Seattle.sf_tracts2 %>% 
    mutate(
      crime_nn1 = nn_function(st_coordinates(Seattle.sf_tracts2), 
                              st_coordinates(Crime.sf), k = 1),
      
      crime_nn2 = nn_function(st_coordinates(Seattle.sf_tracts2), 
                              st_coordinates(Crime.sf), k = 2), 
      
      crime_nn3 = nn_function(st_coordinates(Seattle.sf_tracts2), 
                              st_coordinates(Crime.sf), k = 3), 
      
      crime_nn4 = nn_function(st_coordinates(Seattle.sf_tracts2), 
                              st_coordinates(Crime.sf), k = 4), 
      
      crime_nn5 = nn_function(st_coordinates(Seattle.sf_tracts2), 
                              st_coordinates(Crime.sf), k = 5)) 

# count
Seattle.sf_tracts2$crime.Buffer <- Seattle.sf_tracts2 %>%
    st_buffer(660) %>%
    aggregate(dplyr::select(Crime.sf) %>%
    mutate(counter = 1), ., sum)

Amenities Data

The amenities data we used includes schools, parks, hospitals, crosswalks, and major roads. Firstly, because the school, hospital, and streetcar station data are point data, we use nearest neighbor feature to calculate each housing’s distance to its nearest facility.

# Nearest Neighbor Feature for school
Seattle.sf_tracts2 <-
  Seattle.sf_tracts2 %>% 
    mutate(
      school_nn1 = nn_function(st_coordinates(Seattle.sf_tracts2), 
                              st_coordinates(School), k = 1),
      )
# Nearest Neighbor Feature for hospital
Seattle.sf_tracts2 <-
  Seattle.sf_tracts2 %>% 
    mutate(
      hosp_nn1 = nn_function(st_coordinates(Seattle.sf_tracts2), 
                              st_coordinates(Hospital), k = 1))

# Nearest Neighbor Feature for streetcar
Seattle.sf_tracts2 <-
  Seattle.sf_tracts2 %>% 
    mutate(
      streetcar_nn1 = nn_function(st_coordinates(Seattle.sf_tracts2), 
                              st_coordinates(streetcar), k = 1))

# School
School_clipped <- st_intersection(School, tracts)
school_plot <- ggplot() +
  geom_sf(data = tracts, fill = "#e9e9e9", col = "white") +
  geom_sf(data = School_clipped, size = 1.5, color = "#6F927F") +
  labs(title = "Schools in Seattle") +
  theme_void()

# Hospital
Hospital_clipped <- st_intersection(Hospital, tracts)
hospital_plot <- ggplot() +
  geom_sf(data = tracts, fill = "#e9e9e9", col = "white") +
  geom_sf(data = Hospital_clipped, size = 1.5, color = "#B0522A") +
  labs(title = "Hospitals in Seattle") +
  theme_void()

# streetcar
streetcar_plot <- ggplot() +
  geom_sf(data = tracts, fill = "#e9e9e9", col = "white") +
  geom_sf(data = streetcar, size = 1.5, color = "#e07d45") +
  labs(title = "Streetcar Stations in Seattle") +
  theme_void()

# Arrange plots side by side
grid.arrange(school_plot, hospital_plot, streetcar_plot, ncol = 3)

On the other hand, the park and crosswalk data are either polygons or lines. In this way, we create a buffer of 1 mile to each housing site and counts the number of parks / crosswalk that are within that buffer.

# Parks buffer
Seattle.sf_tracts2$parks.Buffer <- Seattle.sf_tracts2 %>% 
    st_buffer(1320) %>% 
    aggregate(dplyr::select(Parks) %>% 
    mutate(counter = 1), ., sum)

# Crosswalk buffer
Seattle.sf_tracts2$crosswalk.Buffer <- Seattle.sf_tracts2 %>% 
    st_buffer(1320) %>% 
    aggregate(dplyr::select(crosswalk) %>% 
    mutate(counter = 1), ., sum)

# Parks
Park_clipped <- st_intersection(Parks, tracts)
Park_plot <- ggplot() +
  geom_sf(data = tracts, fill = "#e9e9e9", col = "white") +
  geom_sf(data = Park_clipped, size = 1.5, fill = "#6F927F", col = "transparent") +
  labs(title = "Parks in Seattle") +
  theme_void()

# Crosswalk
Crosswalk_clipped <- st_intersection(crosswalk, tracts)
Crosswalk_plot <- ggplot() +
  geom_sf(data = tracts, fill = "#e9e9e9", col = "white") +
  geom_sf(data = Crosswalk_clipped, size = 0.5, color = "#B0522A") +
  labs(title = "Crosswalks in Seattle") +
  theme_void()

# Arrange plots side by side
grid.arrange(Park_plot, Crosswalk_plot, ncol = 2)

Canopy Data

# join canopy to housing points
Seattle.sf_tracts3 <- st_join(Seattle.sf_tracts2, canopy, left = TRUE)

Seattle.sf_tracts3 <- Seattle.sf_tracts3 %>%
  st_drop_geometry() %>%
  select(id, Can_P)

# put the column into the overall dataframe
Seattle.sf_tracts2$canopy <- Seattle.sf_tracts3["Can_P"]


# Calculate quantiles for class breaks
breaks_quantiles <- classIntervals(canopy$Can_P, n = 6, style = "quantile")

# Generate custom labels showing the range of each class
labels <- paste0(formatC(breaks_quantiles$brks[-length(breaks_quantiles$brks)], format = "f", digits = 2, big.mark = ","), 
                 " - ", 
                 formatC(breaks_quantiles$brks[-1], format = "f", digits = 2, big.mark = ","))

ggplot() +
  geom_sf(data = canopy, aes(fill = cut(Can_P, breaks = breaks_quantiles$brks, include.lowest = TRUE)), color = "#e9e9e9") +
  scale_fill_manual(values = c("#f4f4f4", "#DDE3E0", "#BFCDC5", "#A0B6AA", "#87A494", "#6F927F"),
                     name = "Canopy Percentage",
                     labels = labels) +  # Use custom range labels
  labs(title = "Average Canopy Percentage by Census Tracts in Seattle") +
  theme_void()

Traffic Data

Another data processing step is to calculate each housing’s distance to its closest major road.

# Convert Seattle.sf_tracts2 to a point pattern, specifying mark column(s)
Seattle_points <- as.ppp(Seattle.sf_tracts2)

# Convert traffic to a line segment pattern
traffic_lines <- as.psp(traffic)

# Find nearest line segment to each point
nearest_segments <- nncross(Seattle_points, traffic_lines)

# put the column into the overall dataframe
Seattle.sf_tracts2$distance <- nearest_segments['dist']

# road plot
ggplot() +
  geom_sf(data = tracts, fill = "#e9e9e9", col = "white") +
  geom_sf(data = traffic, aes(geometry = geometry), color = "#e07d45", size = 1) +
  labs(title = "Major Roads in Seattle") +
  theme_void()

Demographic Data

The census data we use for this analysis is ACS 5-year data in 2015, and the geography is census tract. The demographic data we want to include in this analysis are total poplation, percentage of white population, and median income.

Demo <- get_acs(geography = "tract", 
          survey = "acs5",
          variables = c("B01001_001E", "B01001A_001E", "B07011_001E"), 
          year = 2015, 
          state = "53", 
          county = "033", 
          geometry = T, 
          output = "wide") %>%
  st_transform(crs = 2926) %>%
  rename(TotalPop = B01001_001E,
         NumberWhites = B01001A_001E,
         MedIncome = B07011_001E) %>%
  mutate(pctWhite = NumberWhites / TotalPop)%>%
  select(-NAME, -GEOID)
# clip to Seattle
Demo <- st_transform(Demo, st_crs(Seattle.sf_tracts2)) %>%
  select("TotalPop", "MedIncome", "pctWhite")
Demo_clipped <- st_intersection(Demo, tracts) %>%
  select("TotalPop", "MedIncome", "pctWhite")

Seattle.sf_tracts2 <- st_join(Seattle.sf_tracts2, Demo_clipped, left = TRUE)
Seattle.sf_tracts2 <- Seattle.sf_tracts2 %>%
  select(-contains("B01"), -contains("B06"))%>%
  mutate(Age = 2024 - yr_built, parksCount = parks.Buffer$counter, crimeCount=crime.Buffer$counter, crosswalkCount=crosswalk.Buffer$counter, canopyPercent=canopy$Can_P, TrafficDis=distance$dist)


# clip to Seattle
Demo_clipped <- st_intersection(Demo, tracts)

# Total
tpop_plot <- ggplot(data = Demo_clipped) +
  geom_sf(aes(fill = TotalPop), col = "white") +
  scale_fill_gradient(low = "#f4f4f4", high = "#6F927F", na.value = "#e9e9e9",
                      name = "Median Income") +
  labs(title = "Total Population",
       subtitle = "Seattle, WA") +
    theme_void()

# Percent white
white_plot <- ggplot(data = Demo_clipped) +
  geom_sf(aes(fill = pctWhite), col = "white") +
  scale_fill_gradient(low = "#f4f4f4", high = "#B0522A", na.value = "#e9e9e9",
                      name = "Percent White") +
  labs(title = "Percentage of White Population",
       subtitle = "Seattle, WA") +
    theme_void()

# MedIncome
MedIncome_plot <- ggplot(data = Demo_clipped) +
  geom_sf(aes(fill = MedIncome), col = "white") +
  scale_fill_gradient(low = "#f4f4f4", high = "#e07d45", na.value = "#e9e9e9",
                      name = "Median Income") +
  labs(title = "Median Income",
       subtitle = "Seattle, WA") +
    theme_void()

# Arrange plots side by side
grid.arrange(tpop_plot, white_plot, MedIncome_plot, ncol = 3)

Summarize and preview the variables

The table shows all the independent variables by categories.

## delete the geometry column
table_Seattle <- Seattle.sf_tracts2 %>% st_drop_geometry()

#category of the data

#internal_character
internal_character_df <- table_Seattle[, c("bedrooms", "bathrooms", "sqft_living","sqft_lot", "floors", "grade")]
internal_character <- round(do.call(cbind, lapply(internal_character_df, summary)), digits = 2)

#Structure_character
Structure_character_df <- table_Seattle[, c("condition", "grade", "Age", "yr_renovated")]
Structure_character <- round(do.call(cbind, lapply(Structure_character_df, summary)), digits = 2)

#amenities
amenities_df <- table_Seattle[, c("parksCount", "crosswalkCount", "hosp_nn1", "school_nn1", "streetcar_nn1")]
amenities <- round(do.call(cbind, lapply(amenities_df, summary))[1:6,], digits = 2)

#distance column for some reason need to be calculated separatly
distance_extract_df <- table_Seattle[, c("distance")]
distance_extract <- round(do.call(cbind, lapply(distance_extract_df, summary)), digits = 2)

amenities_all <- cbind(amenities, distance_extract)


#spatial_character
spatial_character_df <- table_Seattle[, c( "waterfront", "view", "canopyPercent")]
spatial_character <- round(do.call(cbind, lapply(spatial_character_df, summary)), digits = 2)

#safety
safety_df <- table_Seattle[, c( "crime_nn1","crime_nn2","crime_nn3","crime_nn4","crime_nn5", "crimeCount")]
safety <- round(do.call(cbind, lapply(safety_df, summary)), digits = 2)

#Demographics
Demographics_df <- table_Seattle[, c( "TotalPop", "MedIncome", "pctWhite")]
Demographics <- round(do.call(cbind, lapply(Demographics_df, summary))[1:6,], digits = 2)


# Combine the summaries into one data frame
summary_data <- cbind(internal_character, Structure_character, amenities_all, 
                      spatial_character, safety, Demographics)

long_summary_data <- as.data.frame(t(summary_data)) %>%
  rename('Minimum' = 'Min.',
         '1st Quantile' = '1st Qu.',
         '3rd Quantile' = '3rd Qu.',
         'Maximun' = 'Max.')

#generate table
kbl(long_summary_data, caption = "Statistics Summary") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  pack_rows("Internal Character", 1, 6) %>%
  pack_rows("Structure Character", 7, 10) %>%
  pack_rows("Amenities", 13, 16) %>%
  pack_rows("Spatial character", 17, 19) %>%
  pack_rows("Safety", 20, 24) %>%
  pack_rows("Demographics", 25, 27) %>%
  footnote(general_title = "\n", general = "Table 1")
Statistics Summary
Minimum 1st Quantile Median Mean 3rd Quantile Maximun
Internal Character
bedrooms 0.00 2.00 3.00 3.14 4.00 33.00
bathrooms 0.00 1.00 1.75 1.94 2.50 8.00
sqft_living 370.00 1240.00 1626.00 1801.79 2200.00 12050.00
sqft_lot 520.00 3240.00 4950.00 5124.03 6330.00 91681.00
floors 1.00 1.00 1.50 1.54 2.00 3.50
grade 4.00 7.00 7.00 7.43 8.00 13.00
Structure Character
condition 1.00 3.00 3.00 3.45 4.00 5.00
grade.1 4.00 7.00 7.00 7.43 8.00 13.00
Age 9.00 37.00 77.00 70.68 99.00 124.00
yr_renovated 0.00 0.00 0.00 132.46 0.00 2015.00
parksCount 1.00 1.00 2.00 2.31 3.00 16.00
crosswalkCount 1.00 5.00 10.00 12.12 16.00 83.00
Amenities
hosp_nn1 292.19 5682.65 8541.55 9942.49 12788.58 25810.66
school_nn1 52.05 1525.20 2449.55 2954.18 3800.21 17319.46
streetcar_nn1 439.60 13760.81 20597.46 20000.87 26048.54 40119.01
dist 0.04 157.94 319.35 402.60 551.22 2840.94
Spatial character
waterfront 0.00 0.00 0.00 0.00 0.00 1.00
view 0.00 0.00 0.00 0.29 0.00 4.00
canopyPercent 4.23 23.12 27.72 29.08 34.04 60.40
Safety
crime_nn1 3.68 75.27 106.82 113.57 145.36 603.42
crime_nn2 3.68 77.55 109.34 115.69 148.09 603.42
crime_nn3 3.68 78.99 111.54 117.61 150.03 603.42
crime_nn4 3.96 80.13 112.90 119.32 152.13 609.48
crime_nn5 3.96 81.80 114.51 121.03 153.74 613.11
Demographics
crimeCount 11.00 227.00 371.00 512.31 632.00 6538.00
TotalPop 1055.00 4389.00 5217.00 5413.82 6572.00 8395.00
MedIncome 7765.00 33642.00 45115.00 43881.28 52441.00 68909.00
pctWhite 0.08 0.64 0.80 0.72 0.87 0.94

Table 1

Visualize Housing Price

Have a look at the housing price of Seattle in 2015 to have a general idea of the dependent variable of the model.

## Current housing price map
Seattle.sf_tracts2$quintiles <- ntile(Seattle.sf_tracts2$price, 5)

# Convert quintiles to a factor if not already done
Seattle.sf_tracts2$quintiles <- as.factor(Seattle.sf_tracts2$quintiles)


# Generate the plot
ggplot() +
  geom_sf(data = tracts, fill = "#e9e9e9", col = "white") +
  geom_sf(data = Seattle.sf_tracts2, aes(colour = quintiles), show.legend = "point", size = .15) +
  scale_colour_manual(values = c("#6F927F", "#798771", "#877A60", "#956D4E", "#A2603C", "#b0522a"),
                      labels = c("Low", "Relatively Low", "Medium", "Relatively High", "High"),
                      name = "Price Per Square Foot Quintiles") +
  labs(title = "Price Per Square Foot, Seattle") +
  theme_void()

Analysis Variables

Analyzing associations

An overview of the association between housing price and multiple independent variables

# Remove geometry column (if present), calculate Age, and select relevant variables
Seattle <- st_drop_geometry(Seattle.sf_tracts2) %>%
  select(price, sqft_living, bedrooms, bathrooms, crimeCount, crime_nn1, school_nn1, crosswalkCount, TotalPop, pctWhite)
 

Seattle_long <- gather(Seattle, Variable, Value, -price)

# Create scatterplot with linear regression lines for each variable
ggplot(Seattle_long, aes(Value, price)) + 
  geom_point(size = .25, color = "#5c6e6c") +
  geom_smooth(method = "lm", se = FALSE, color = "#b0522a", size = 0.8) +
  facet_wrap(~Variable, ncol = 3, scales = "free") + 
  labs(title = "Price as a function of continuous variables") + 
  theme_minimal()

Correlation matrix

A correlation matrix gives us the pairwise correlation of each set of features in our data. It is usually advisable to include the target/outcome variable in this so we can understand which features are related to it.

# Select only numeric variables and remove rows with missing values
numericVars <- Seattle.sf_tracts2 %>%
  st_drop_geometry() %>%  # Remove geometry column if present
  select_if(is.numeric) %>%  # Select only numeric variables
  na.omit()  # Remove rows with missing values

# Compute correlations
cor_result <- numericVars %>%
  correlate()

# Replace NA values with 0
cor_result[is.na(cor_result)] <- 0

cor_result %>%
  autoplot() +
  geom_tile(aes(fill = r), color = "#e9e9e9") +
  geom_text(aes(label = round(r,digits=2)), size = 2) +
  scale_fill_gradient2(low = "#6F927F", mid = "white", high = "#b0522a",  # Specify color gradient
                       midpoint = 0, limits = c(-1, 1),
                       breaks = seq(-1, 1, by = 0.2)) +
  labs(title = "Correlation across numeric variables")  # Set plot title

Multivariate Regression

Construct baseline model

Seattle_final_0 <- st_drop_geometry(Seattle.sf_tracts2)%>%
  select(-contains("Buffer"), -contains("distance"), -contains("date"), -yr_built, -canopy, -quintiles)%>% 
  filter(price <= 2000000) 


reg0 <- lm(price ~ ., data = Seattle_final_0)

summary_reg0 <- summary(reg0)

# Kable show summary of the model Reg0
coef_table0 <- as.data.frame(summary_reg0$coefficients)
coef_table0$significance <- ifelse(coef_table0$`Pr(>|t|)` < 0.001, '***',
                                    ifelse(coef_table0$`Pr(>|t|)` < 0.01, '**',
                                      ifelse(coef_table0$`Pr(>|t|)` < 0.05, '*',
                                        ifelse(coef_table0$`Pr(>|t|)` < 0.1, '.', ''))))
coef_table0$p_value <- paste0(round(coef_table0$`Pr(>|t|)`, digits = 3), coef_table0$significance)
coef_table0$'t value' <- round(coef_table0$'t value', digits = 2)
coef_table0$'Std. Error' <- round(coef_table0$'Std. Error', digits = 2)
coef_table0$Estimate <- round(coef_table0$Estimate, digits = 2)

coef_table0 %>%
  select(-significance, -`Pr(>|t|)`) %>% 
  kable(align = "r") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  footnote(general_title = "\n", general = "Table 2") 
Estimate Std. Error t value p_value
(Intercept) 61837341.35 12555337.77 4.93 0***
id 0.00 0.00 0.39 0.696
bedrooms -18701.87 4500.02 -4.16 0***
bathrooms 45.42 6845.14 0.01 0.995
sqft_living 182.44 8.11 22.49 0***
sqft_lot 27.61 1.99 13.91 0***
floors -28497.38 8287.01 -3.44 0.001***
view 22111.43 4714.43 4.69 0***
condition 13255.58 5191.62 2.55 0.011*
grade 69929.60 5600.09 12.49 0***
yr_renovated -10.97 7.10 -1.54 0.123
zipcode -638.14 127.94 -4.99 0***
price_per_sqft 839.41 52.45 16.00 0***
crime_nn1 266.87 482.38 0.55 0.58
crime_nn2 -282.87 1065.92 -0.27 0.791
crime_nn3 -775.01 1647.09 -0.47 0.638
crime_nn4 3129.78 1833.46 1.71 0.088.
crime_nn5 -2241.90 978.18 -2.29 0.022*
school_nn1 1.68 2.67 0.63 0.53
hosp_nn1 -3.06 0.85 -3.60 0***
streetcar_nn1 -3.58 0.58 -6.16 0***
TotalPop 4.23 2.84 1.49 0.136
MedIncome 3.45 0.51 6.80 0***
pctWhite 126226.29 31575.07 4.00 0***
Age 1273.64 148.95 8.55 0***
parksCount 3973.39 2128.56 1.87 0.062.
crimeCount -26.45 8.57 -3.09 0.002**
crosswalkCount 65.79 452.46 0.15 0.884
canopyPercent 594.89 500.34 1.19 0.235
TrafficDis 69.34 11.64 5.95 0***

Table 2
# Baseline summary
model_summary <- data.frame(
  Statistic = c("Multiple R-squared", "Adjusted R-squared", "F-statistic"),
  Value = c(
    summary_reg0$r.squared,
    summary_reg0$adj.r.squared,
    summary_reg0$fstatistic[1]
  )
)

model_summary %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  footnote(general_title = "\n", general = "Table 3")
Statistic Value
Multiple R-squared 0.8170151
Adjusted R-squared 0.8137293
F-statistic 248.6505718

Table 3

The baseline model has some multicollinearity happening, such as multiple crime distance data, and we need to include only one of those independent variables in order to construct the final model. Besides, the final model only selects the variables that have a high correlation with the price variable by selecting variables with low p value.

The baseline model perfoms better than the final model can be explained by multicollinearity.

need to include id column for identification later

# Finalized Model Reg1
Seattle_final <- Seattle_final_0 %>%
  select(id, Age, price, bedrooms, bathrooms, sqft_living, waterfront, view, condition, grade,
         crimeCount, hosp_nn1, canopyPercent, parksCount, crosswalkCount, TrafficDis,
         pctWhite, MedIncome)%>%
  mutate_all(~replace(., is.na(.), 0))

reg1 <- lm(price ~ ., data = Seattle_final)

summary_reg1 <- summary(reg1)

## Kable show summary of the model Reg0

coef_table1 <- as.data.frame(summary_reg1$coefficients)
coef_table1$significance <- ifelse(coef_table1$`Pr(>|t|)` < 0.001, '***',
                                    ifelse(coef_table1$`Pr(>|t|)` < 0.01, '**',
                                      ifelse(coef_table1$`Pr(>|t|)` < 0.05, '*',
                                        ifelse(coef_table1$`Pr(>|t|)` < 0.1, '.', ''))))
coef_table1$p_value <- paste0(round(coef_table1$`Pr(>|t|)`, digits = 3), coef_table1$significance)
coef_table1$'t value' <- round(coef_table1$'t value', digits = 2)
coef_table1$'Std. Error' <- round(coef_table1$'Std. Error', digits = 2)
coef_table1$Estimate <- round(coef_table1$Estimate, digits = 2)

coef_table1 %>%
  select(-significance, -`Pr(>|t|)`) %>% 
  kable(align = "r") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  footnote(general_title = "\n", general = "Table 4")
Estimate Std. Error t value p_value
(Intercept) -855864.38 21772.83 -39.31 0***
id 0.00 0.00 -1.97 0.049*
Age 1451.40 65.17 22.27 0***
bedrooms -10405.96 2025.43 -5.14 0***
bathrooms 10179.81 3407.48 2.99 0.003**
sqft_living 155.07 4.09 37.88 0***
waterfront 386272.56 30051.57 12.85 0***
view 35980.16 2364.54 15.22 0***
condition 16351.60 2569.57 6.36 0***
grade 101917.72 2667.04 38.21 0***
crimeCount -21.46 4.75 -4.52 0***
hosp_nn1 -4.56 0.30 -15.06 0***
canopyPercent 786.56 212.88 3.69 0***
parksCount 14691.79 992.05 14.81 0***
crosswalkCount 961.69 226.52 4.25 0***
TrafficDis 40.98 5.30 7.74 0***
pctWhite 300946.67 9695.46 31.04 0***
MedIncome 0.59 0.08 7.59 0***

Table 4
# Model summary

model_summary <- data.frame(
  Statistic = c("Multiple R-squared", "Adjusted R-squared", "F-statistic"),
  Value = c(
    summary_reg1$r.squared,
    summary_reg1$adj.r.squared,
    summary_reg1$fstatistic[1]
  )
)

model_summary %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  footnote(general_title = "\n", general = "Table 5")
Statistic Value
Multiple R-squared 0.7592488
Adjusted R-squared 0.7586334
F-statistic 1233.8252165

Table 5

Residual Plot

From the residual plot we can see that the points are around the horizontal line.

residuals_df <- data.frame(Residuals = resid(reg1), Fitted = fitted(reg1))
ggplot(residuals_df, aes(x = Fitted, y = Residuals)) +
  geom_point(size = 0.5, color = "#5c6e6c") +
  geom_hline(yintercept = 0, color = "#b0522a") +
  labs(title = "Residual Plot for Regression",
       subtitle = "Each dot represent one property ",
       x = "Fitted Values",
       y = "Residuals") +
  theme_minimal()

Training and Testing Datasets

In order to make the model more precise, we split the housing data into training data, which contains 60% of original data, and testing data, which contains 40% of original data.

# Split the dataset into a training set and a test set using stratified sampling
inTrain <- createDataPartition(
              y = paste(Seattle_final$sqft_lot, Seattle_final$view, 
                        Seattle_final$yr_built, Seattle_final$school_nn1), 
              p = .60, list = FALSE)  # Create a vector of indices for the training set

# Subset the dataset to create the training set
Seattle.training <- Seattle_final[inTrain,]  # Training set
# Subset the dataset to create the test set
Seattle.test <- Seattle_final[-inTrain,]     # Test set
 
# Fit a linear regression model to predict Sales Price using selected predictors
reg.training <- 
  lm(price ~ ., data = as.data.frame(Seattle.training))

summary_training <- summary(reg.training)

# Make predictions on the test set and evaluate model performance

Seattle.test <-
  Seattle.test %>%  # Pipe the test set into the following operations
  # Add a column indicating the type of regression model used
  mutate(Regression = "Baseline Regression",
         # Predict sale prices using the trained regression model
         SalePrice.Predict = predict(reg.training, Seattle.test),
         # Calculate the difference between predicted and actual sale prices
         SalePrice.Error = SalePrice.Predict - price,
         # Calculate the absolute difference between predicted and actual sale prices
         SalePrice.AbsError = abs(SalePrice.Predict - price),
         # Calculate the absolute percentage error
         SalePrice.APE = (abs(SalePrice.Predict - price)) / price) %>%
  filter(price < 5000000)  # Filter out records with SalePrice greater than $5,000,000

Examine the model (diagnose)

Coefficients

See the coefficients of our testing model and compare it to the baseline model.

## Plot coefficients
p0 <- plot_summs(reg0, colors = "#6F927F")
p1 <- plot_summs(reg1, colors = "#b0522a")
# Arrange plots side by side
grid.arrange(p0, p1, ncol = 2)

Distribution of Sale Price Error Testing Set

Seattle.test_join <- left_join(Seattle.test, housingData, join_by(id == id))
Seattle.test_withgeo <- 
  Seattle.test_join %>% 
  st_as_sf(coords = c("long", "lat"), crs = 4326, agr = "constant") %>%
  st_transform(crs = 2926)

Seattle.test_withgeo$SalePrice.Error <- round(Seattle.test_withgeo$SalePrice.Error, digits = 2)

ggplot()+
  geom_sf(data = tracts, fill = "#e9e9e9", col = "white") +
  geom_sf(data=Seattle.test_withgeo, aes(colour = Seattle.test_withgeo$SalePrice.Error), size=0.5)+
  scale_color_continuous(high = "#6F927F", low = "#b0522a", name= "Sale Price Error ") +
  labs(title = "Distribution of Sale Price Error Testing Set") +
  theme_void()

Mean Absolute Percentage Error By Neighborhood

Our prediction error is distributed unevenly across neighborhoods in Seattle.

to_plot <- st_intersection(Seattle.test_withgeo, neighborhood %>% dplyr::select("S_HOOD")) %>% 
  st_drop_geometry() %>% 
  group_by(S_HOOD) %>%
  summarise(mean.MAPE = mean(SalePrice.APE, na.rm = T)) %>% 
  left_join(neighborhood) %>% 
  st_sf()
to_plot %>%
  ggplot() + 
      geom_sf(aes(fill = mean.MAPE), col = "white") +
  scale_fill_continuous(low = "#6F927F", high = "#b0522a", na.value = "#e9e9e9", name= "MAPE") +
  labs(title = "Mean Absolute Percentage Error By Neighborhood",
       subtitle = "Seattle, WA") +
    theme_void()

Cross validation

k-fold cross-validation

# Define the control parameters for k-fold cross-validation
control <- trainControl(method = "cv",    # Use k-fold cross-validation
                        number = 10,      # Number of folds
                        verboseIter = TRUE,  # Show verbose output
                        returnData = FALSE,  # Don't return resampled data
                        savePredictions = TRUE,  # Save predictions
                        classProbs = FALSE,  # Don't compute class probabilities
                        summaryFunction = defaultSummary)  # Use default summary function

# Train the linear regression model using k-fold cross-validation
lm_cv <- train(price ~ .,  # Formula for the linear regression model
               data = Seattle_final,  # Dataset
               method = "lm",           # Specify "lm" for linear regression
               trControl = control)     # Use the defined control parameters
# View the cross-validation results
lm_cv_results <- lm_cv['results']
lm_cv_statistic_summary <- do.call(rbind, lm_cv_results) %>% select(-contains("intercept"))


kbl(lm_cv_statistic_summary) %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left") %>%
  footnote(general_title = "\n", general = "Table 6")
RMSE Rsquared MAE RMSESD RsquaredSD MAESD
results 135579.4 0.757575 96878.09 5577.113 0.0171431 2686.396

Table 6
# Plot observed versus predicted values

ggplot(lm_cv$pred, aes(x = obs, y = pred)) +
  geom_point(size = 0.5, color = "#5c6e6c") +
  geom_smooth(method = "lm", se = FALSE, color = "#b0522a", size = 0.5) +
  labs(title = "Observed vs Predicted Housing Values",
       x = "Observed",
       y = "Predicted") +
  theme_minimal()

Leave-one-out Cross-validation

# Set up leave-one-out cross-validation
control <- trainControl(method = "LOOCV",     # Use leave-one-out cross-validation
                        number = nrow(Seattle_final),  # Number of folds = number of observations
                        verboseIter = TRUE,  # Show verbose output
                        returnData = FALSE,  # Don't return resampled data
                        savePredictions = TRUE,  # Save predictions
                        classProbs = FALSE,  # Don't compute class probabilities
                        summaryFunction = defaultSummary)  # Use default summary function

# Train the linear regression model using leave-one-out cross-validation
lm_loocv <- train(price ~ .,  # Formula for the linear regression model
                  data = Seattle_final,  # Dataset
                  method = "lm",           # Specify "lm" for linear regression
                  trControl = control)     # Use the defined control parameters


# 6669 fold in total
# View the cross-validation results
loocv_results <- lm_loocv['results']
statistic_summary <- do.call(rbind, loocv_results) %>% select(-contains("intercept"))


kbl(statistic_summary) %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left") %>%
  footnote(general_title = "\n", general = "Table 7")
RMSE Rsquared MAE
results 135662.4 0.7571996 96882.43

Table 7

Summary

The R-squared value from all the evaluations above indicates that around 75.8% of the variability in SalePrice is explained by the model’s predictor variables, which means the model performs pretty well. When compared between methods, the model without splitting data has the highest R-squared value and the leave-one-out Cross-validation method has the lowest R-square value.

However, there are still places that this model can be further improved. For example, some of the independent variable, such as crime data, are not in the same year as the housing price data.Besides, the mean absolute percentage error is not distributed evenly among different neighborhoods in Seattle.