Seattle Single-Family Housing Price Prediction with Geospatial Machine Learning

Introduction

This project aims to build a machine learning model for single-family housing price prediction in Seattle, Washington using both the housing (structural characteristics) and spatial characteristics. The housing price used as the dependent variable in the model’s preparation is extracted from King County’s 2015 data, and multiple independent variables used in the model are from City of Seattle Open Data Portal. The predicting model is generated by Junyi and Ziyi and this report is written by Junyi.

Data Wrangling

Data Gathering

The model needs multiple spatial data as possible independent variable. The first step here is to collect and read those data. The list includes: 1. housing structural characteristics that come with the 2015 housing price data 2. Seattle census tracts 3. Seattle neighborhood boundary 4. Seattle crime data from 2008 to present 5. Amenities such as parks, schools, hospitals, and crosswalk 6. Seattle streetcar stations 7. Major roads in Seattle 8. Percent of canopy coverage by census tracts

Data Processing

Safety data

The original crime data we gathered includes the location of each crime event. In this way, we did two calculations for this dataset. The first one is a nearest neighbor analysis in order to calculate the average distance from each housing location to its nearest 1 / 2 / 3 / 4 / 5 crime spots. Besides, we also create a buffer of 0.5 mile to each housing site and counts the number of crime events that are within that buffer.

## Nearest Neighbor Feature
Seattle.sf_tracts2 <-
  Seattle.sf_tracts2 %>% 
    mutate(
      crime_nn1 = nn_function(st_coordinates(Seattle.sf_tracts2), 
                              st_coordinates(Crime.sf), k = 1),
      
      crime_nn2 = nn_function(st_coordinates(Seattle.sf_tracts2), 
                              st_coordinates(Crime.sf), k = 2), 
      
      crime_nn3 = nn_function(st_coordinates(Seattle.sf_tracts2), 
                              st_coordinates(Crime.sf), k = 3), 
      
      crime_nn4 = nn_function(st_coordinates(Seattle.sf_tracts2), 
                              st_coordinates(Crime.sf), k = 4), 
      
      crime_nn5 = nn_function(st_coordinates(Seattle.sf_tracts2), 
                              st_coordinates(Crime.sf), k = 5)) 

# count
Seattle.sf_tracts2$crime.Buffer <- Seattle.sf_tracts2 %>%
    st_buffer(660) %>%
    aggregate(dplyr::select(Crime.sf) %>%
    mutate(counter = 1), ., sum)

Amenities Data

The amenities data we used includes schools, parks, hospitals, crosswalks, and major roads. Firstly, because the school, hospital, and streetcar station data are point data, we use nearest neighbor feature to calculate each housing’s distance to its nearest facility.

# Nearest Neighbor Feature for school
Seattle.sf_tracts2 <-
  Seattle.sf_tracts2 %>% 
    mutate(
      school_nn1 = nn_function(st_coordinates(Seattle.sf_tracts2), 
                              st_coordinates(School), k = 1),
      )
# Nearest Neighbor Feature for hospital
Seattle.sf_tracts2 <-
  Seattle.sf_tracts2 %>% 
    mutate(
      hosp_nn1 = nn_function(st_coordinates(Seattle.sf_tracts2), 
                              st_coordinates(Hospital), k = 1))

# Nearest Neighbor Feature for streetcar
Seattle.sf_tracts2 <-
  Seattle.sf_tracts2 %>% 
    mutate(
      streetcar_nn1 = nn_function(st_coordinates(Seattle.sf_tracts2), 
                              st_coordinates(streetcar), k = 1))

# School
School_clipped <- st_intersection(School, tracts)
school_plot <- ggplot() +
  geom_sf(data = tracts, fill = "#e9e9e9", col = "white") +
  geom_sf(data = School_clipped, size = 1.5, color = "#6F927F") +
  labs(title = "Schools in Seattle") +
  theme_void()

# Hospital
Hospital_clipped <- st_intersection(Hospital, tracts)
hospital_plot <- ggplot() +
  geom_sf(data = tracts, fill = "#e9e9e9", col = "white") +
  geom_sf(data = Hospital_clipped, size = 1.5, color = "#B0522A") +
  labs(title = "Hospitals in Seattle") +
  theme_void()

# streetcar
streetcar_plot <- ggplot() +
  geom_sf(data = tracts, fill = "#e9e9e9", col = "white") +
  geom_sf(data = streetcar, size = 1.5, color = "#e07d45") +
  labs(title = "Streetcar Stations in Seattle") +
  theme_void()

# Arrange plots side by side
grid.arrange(school_plot, hospital_plot, streetcar_plot, ncol = 3)

On the other hand, the park and crosswalk data are either polygons or lines. In this way, we create a buffer of 1 mile to each housing site and counts the number of parks / crosswalk that are within that buffer.

# Parks buffer
Seattle.sf_tracts2$parks.Buffer <- Seattle.sf_tracts2 %>% 
    st_buffer(1320) %>% 
    aggregate(dplyr::select(Parks) %>% 
    mutate(counter = 1), ., sum)

# Crosswalk buffer
Seattle.sf_tracts2$crosswalk.Buffer <- Seattle.sf_tracts2 %>% 
    st_buffer(1320) %>% 
    aggregate(dplyr::select(crosswalk) %>% 
    mutate(counter = 1), ., sum)

# Parks
Park_clipped <- st_intersection(Parks, tracts)
Park_plot <- ggplot() +
  geom_sf(data = tracts, fill = "#e9e9e9", col = "white") +
  geom_sf(data = Park_clipped, size = 1.5, fill = "#6F927F", col = "transparent") +
  labs(title = "Parks in Seattle") +
  theme_void()

# Crosswalk
Crosswalk_clipped <- st_intersection(crosswalk, tracts)
Crosswalk_plot <- ggplot() +
  geom_sf(data = tracts, fill = "#e9e9e9", col = "white") +
  geom_sf(data = Crosswalk_clipped, size = 0.5, color = "#B0522A") +
  labs(title = "Crosswalks in Seattle") +
  theme_void()

# Arrange plots side by side
grid.arrange(Park_plot, Crosswalk_plot, ncol = 2)

Canopy Data

# join canopy to housing points
Seattle.sf_tracts3 <- st_join(Seattle.sf_tracts2, canopy, left = TRUE)

Seattle.sf_tracts3 <- Seattle.sf_tracts3 %>%
  st_drop_geometry() %>%
  select(id, Can_P)

# put the column into the overall dataframe
Seattle.sf_tracts2$canopy <- Seattle.sf_tracts3["Can_P"]


# Calculate quantiles for class breaks
breaks_quantiles <- classIntervals(canopy$Can_P, n = 6, style = "quantile")

# Generate custom labels showing the range of each class
labels <- paste0(formatC(breaks_quantiles$brks[-length(breaks_quantiles$brks)], format = "f", digits = 2, big.mark = ","), 
                 " - ", 
                 formatC(breaks_quantiles$brks[-1], format = "f", digits = 2, big.mark = ","))

ggplot() +
  geom_sf(data = canopy, aes(fill = cut(Can_P, breaks = breaks_quantiles$brks, include.lowest = TRUE)), color = "#e9e9e9") +
  scale_fill_manual(values = c("#f4f4f4", "#DDE3E0", "#BFCDC5", "#A0B6AA", "#87A494", "#6F927F"),
                     name = "Canopy Percentage",
                     labels = labels) +  # Use custom range labels
  labs(title = "Average Canopy Percentage by Census Tracts in Seattle") +
  theme_void()

Traffic Data

Another data processing step is to calculate each housing’s distance to its closest major road.

# Convert Seattle.sf_tracts2 to a point pattern, specifying mark column(s)
Seattle_points <- as.ppp(Seattle.sf_tracts2)

# Convert traffic to a line segment pattern
traffic_lines <- as.psp(traffic)

# Find nearest line segment to each point
nearest_segments <- nncross(Seattle_points, traffic_lines)

# put the column into the overall dataframe
Seattle.sf_tracts2$distance <- nearest_segments['dist']

# road plot
ggplot() +
  geom_sf(data = tracts, fill = "#e9e9e9", col = "white") +
  geom_sf(data = traffic, aes(geometry = geometry), color = "#e07d45", size = 1) +
  labs(title = "Major Roads in Seattle") +
  theme_void()

Demographic Data

The census data we use for this analysis is ACS 5-year data in 2015, and the geography is census tract. The demographic data we want to include in this analysis are total poplation, percentage of white population, and median income.

Demo <- get_acs(geography = "tract", 
          survey = "acs5",
          variables = c("B01001_001E", "B01001A_001E", "B07011_001E"), 
          year = 2015, 
          state = "53", 
          county = "033", 
          geometry = T, 
          output = "wide") %>%
  st_transform(crs = 2926) %>%
  rename(TotalPop = B01001_001E,
         NumberWhites = B01001A_001E,
         MedIncome = B07011_001E) %>%
  mutate(pctWhite = NumberWhites / TotalPop)%>%
  select(-NAME, -GEOID)

# clip to Seattle
Demo <- st_transform(Demo, st_crs(Seattle.sf_tracts2)) %>%
  select("TotalPop", "MedIncome", "pctWhite")
Demo_clipped <- st_intersection(Demo, tracts) %>%
  select("TotalPop", "MedIncome", "pctWhite")

Seattle.sf_tracts2 <- st_join(Seattle.sf_tracts2, Demo_clipped, left = TRUE)
Seattle.sf_tracts2 <- Seattle.sf_tracts2 %>%
  select(-contains("B01"), -contains("B06"))%>%
  mutate(Age = 2024 - yr_built, parksCount = parks.Buffer$counter, crimeCount=crime.Buffer$counter, crosswalkCount=crosswalk.Buffer$counter, canopyPercent=canopy$Can_P, TrafficDis=distance$dist)


# clip to Seattle
Demo_clipped <- st_intersection(Demo, tracts)

# Total
tpop_plot <- ggplot(data = Demo_clipped) +
  geom_sf(aes(fill = TotalPop), col = "white") +
  scale_fill_gradient(low = "#f4f4f4", high = "#6F927F", na.value = "#e9e9e9",
                      name = "Median Income") +
  labs(title = "Total Population",
       subtitle = "Seattle, WA") +
    theme_void()

# Percent white
white_plot <- ggplot(data = Demo_clipped) +
  geom_sf(aes(fill = pctWhite), col = "white") +
  scale_fill_gradient(low = "#f4f4f4", high = "#B0522A", na.value = "#e9e9e9",
                      name = "Percent White") +
  labs(title = "Percentage of White Population",
       subtitle = "Seattle, WA") +
    theme_void()

# MedIncome
MedIncome_plot <- ggplot(data = Demo_clipped) +
  geom_sf(aes(fill = MedIncome), col = "white") +
  scale_fill_gradient(low = "#f4f4f4", high = "#e07d45", na.value = "#e9e9e9",
                      name = "Median Income") +
  labs(title = "Median Income",
       subtitle = "Seattle, WA") +
    theme_void()

# Arrange plots side by side
grid.arrange(tpop_plot, white_plot, MedIncome_plot, ncol = 3)

Summarize and preview the variables

The table shows all the independent variables by categories.

## delete the geometry column
table_Seattle <- Seattle.sf_tracts2 %>% st_drop_geometry()

#category of the data

#internal_character
internal_character_df <- table_Seattle[, c("bedrooms", "bathrooms", "sqft_living","sqft_lot", "floors", "grade")]
internal_character <- round(do.call(cbind, lapply(internal_character_df, summary)), digits = 2)

#Structure_character
Structure_character_df <- table_Seattle[, c("condition", "grade", "Age", "yr_renovated")]
Structure_character <- round(do.call(cbind, lapply(Structure_character_df, summary)), digits = 2)

#amenities
amenities_df <- table_Seattle[, c("parksCount", "crosswalkCount", "hosp_nn1", "school_nn1", "streetcar_nn1")]
amenities <- round(do.call(cbind, lapply(amenities_df, summary))[1:6,], digits = 2)

#distance column for some reason need to be calculated separatly
distance_extract_df <- table_Seattle[, c("distance")]
distance_extract <- round(do.call(cbind, lapply(distance_extract_df, summary)), digits = 2)

amenities_all <- cbind(amenities, distance_extract)


#spatial_character
spatial_character_df <- table_Seattle[, c( "waterfront", "view", "canopyPercent")]
spatial_character <- round(do.call(cbind, lapply(spatial_character_df, summary)), digits = 2)

#safety
safety_df <- table_Seattle[, c( "crime_nn1","crime_nn2","crime_nn3","crime_nn4","crime_nn5", "crimeCount")]
safety <- round(do.call(cbind, lapply(safety_df, summary)), digits = 2)

#Demographics
Demographics_df <- table_Seattle[, c( "TotalPop", "MedIncome", "pctWhite")]
Demographics <- round(do.call(cbind, lapply(Demographics_df, summary))[1:6,], digits = 2)


# Combine the summaries into one data frame
summary_data <- cbind(internal_character, Structure_character, amenities_all, 
                      spatial_character, safety, Demographics)

long_summary_data <- as.data.frame(t(summary_data)) %>%
  rename('Minimum' = 'Min.',
         '1st Quantile' = '1st Qu.',
         '3rd Quantile' = '3rd Qu.',
         'Maximun' = 'Max.')

#generate table
kbl(long_summary_data, caption = "Statistics Summary") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  pack_rows("Internal Character", 1, 6) %>%
  pack_rows("Structure Character", 7, 10) %>%
  pack_rows("Amenities", 13, 16) %>%
  pack_rows("Spatial character", 17, 19) %>%
  pack_rows("Safety", 20, 24) %>%
  pack_rows("Demographics", 25, 27) %>%
  footnote(general_title = "\n", general = "Table 1")

Statistics Summary
	Minimum	1st Quantile	Median	Mean	3rd Quantile	Maximun
Internal Character
bedrooms	0.00	2.00	3.00	3.14	4.00	33.00
bathrooms	0.00	1.00	1.75	1.94	2.50	8.00
sqft_living	370.00	1240.00	1626.00	1801.79	2200.00	12050.00
sqft_lot	520.00	3240.00	4950.00	5124.03	6330.00	91681.00
floors	1.00	1.00	1.50	1.54	2.00	3.50
grade	4.00	7.00	7.00	7.43	8.00	13.00
Structure Character
condition	1.00	3.00	3.00	3.45	4.00	5.00
grade.1	4.00	7.00	7.00	7.43	8.00	13.00
Age	9.00	37.00	77.00	70.68	99.00	124.00
yr_renovated	0.00	0.00	0.00	132.46	0.00	2015.00
parksCount	1.00	1.00	2.00	2.31	3.00	16.00
crosswalkCount	1.00	5.00	10.00	12.12	16.00	83.00
Amenities
hosp_nn1	292.19	5682.65	8541.55	9942.49	12788.58	25810.66
school_nn1	52.05	1525.20	2449.55	2954.18	3800.21	17319.46
streetcar_nn1	439.60	13760.81	20597.46	20000.87	26048.54	40119.01
dist	0.04	157.94	319.35	402.60	551.22	2840.94
Spatial character
waterfront	0.00	0.00	0.00	0.00	0.00	1.00
view	0.00	0.00	0.00	0.29	0.00	4.00
canopyPercent	4.23	23.12	27.72	29.08	34.04	60.40
Safety
crime_nn1	3.68	75.27	106.82	113.57	145.36	603.42
crime_nn2	3.68	77.55	109.34	115.69	148.09	603.42
crime_nn3	3.68	78.99	111.54	117.61	150.03	603.42
crime_nn4	3.96	80.13	112.90	119.32	152.13	609.48
crime_nn5	3.96	81.80	114.51	121.03	153.74	613.11
Demographics
crimeCount	11.00	227.00	371.00	512.31	632.00	6538.00
TotalPop	1055.00	4389.00	5217.00	5413.82	6572.00	8395.00
MedIncome	7765.00	33642.00	45115.00	43881.28	52441.00	68909.00
pctWhite	0.08	0.64	0.80	0.72	0.87	0.94

Table 1

Visualize Housing Price

Have a look at the housing price of Seattle in 2015 to have a general idea of the dependent variable of the model.

## Current housing price map
Seattle.sf_tracts2$quintiles <- ntile(Seattle.sf_tracts2$price, 5)

# Convert quintiles to a factor if not already done
Seattle.sf_tracts2$quintiles <- as.factor(Seattle.sf_tracts2$quintiles)


# Generate the plot
ggplot() +
  geom_sf(data = tracts, fill = "#e9e9e9", col = "white") +
  geom_sf(data = Seattle.sf_tracts2, aes(colour = quintiles), show.legend = "point", size = .15) +
  scale_colour_manual(values = c("#6F927F", "#798771", "#877A60", "#956D4E", "#A2603C", "#b0522a"),
                      labels = c("Low", "Relatively Low", "Medium", "Relatively High", "High"),
                      name = "Price Per Square Foot Quintiles") +
  labs(title = "Price Per Square Foot, Seattle") +
  theme_void()

Analysis Variables

Analyzing associations

An overview of the association between housing price and multiple independent variables

# Remove geometry column (if present), calculate Age, and select relevant variables
Seattle <- st_drop_geometry(Seattle.sf_tracts2) %>%
  select(price, sqft_living, bedrooms, bathrooms, crimeCount, crime_nn1, school_nn1, crosswalkCount, TotalPop, pctWhite)
 

Seattle_long <- gather(Seattle, Variable, Value, -price)

# Create scatterplot with linear regression lines for each variable
ggplot(Seattle_long, aes(Value, price)) + 
  geom_point(size = .25, color = "#5c6e6c") +
  geom_smooth(method = "lm", se = FALSE, color = "#b0522a", size = 0.8) +
  facet_wrap(~Variable, ncol = 3, scales = "free") + 
  labs(title = "Price as a function of continuous variables") + 
  theme_minimal()

Correlation matrix

A correlation matrix gives us the pairwise correlation of each set of features in our data. It is usually advisable to include the target/outcome variable in this so we can understand which features are related to it.

# Select only numeric variables and remove rows with missing values
numericVars <- Seattle.sf_tracts2 %>%
  st_drop_geometry() %>%  # Remove geometry column if present
  select_if(is.numeric) %>%  # Select only numeric variables
  na.omit()  # Remove rows with missing values

# Compute correlations
cor_result <- numericVars %>%
  correlate()

# Replace NA values with 0
cor_result[is.na(cor_result)] <- 0

cor_result %>%
  autoplot() +
  geom_tile(aes(fill = r), color = "#e9e9e9") +
  geom_text(aes(label = round(r,digits=2)), size = 2) +
  scale_fill_gradient2(low = "#6F927F", mid = "white", high = "#b0522a",  # Specify color gradient
                       midpoint = 0, limits = c(-1, 1),
                       breaks = seq(-1, 1, by = 0.2)) +
  labs(title = "Correlation across numeric variables")  # Set plot title

Multivariate Regression

Construct baseline model

Seattle_final_0 <- st_drop_geometry(Seattle.sf_tracts2)%>%
  select(-contains("Buffer"), -contains("distance"), -contains("date"), -yr_built, -canopy, -quintiles)%>% 
  filter(price <= 2000000) 


reg0 <- lm(price ~ ., data = Seattle_final_0)

summary_reg0 <- summary(reg0)

# Kable show summary of the model Reg0
coef_table0 <- as.data.frame(summary_reg0$coefficients)
coef_table0$significance <- ifelse(coef_table0$`Pr(>|t|)` < 0.001, '***',
                                    ifelse(coef_table0$`Pr(>|t|)` < 0.01, '**',
                                      ifelse(coef_table0$`Pr(>|t|)` < 0.05, '*',
                                        ifelse(coef_table0$`Pr(>|t|)` < 0.1, '.', ''))))
coef_table0$p_value <- paste0(round(coef_table0$`Pr(>|t|)`, digits = 3), coef_table0$significance)
coef_table0$'t value' <- round(coef_table0$'t value', digits = 2)
coef_table0$'Std. Error' <- round(coef_table0$'Std. Error', digits = 2)
coef_table0$Estimate <- round(coef_table0$Estimate, digits = 2)

coef_table0 %>%
  select(-significance, -`Pr(>|t|)`) %>% 
  kable(align = "r") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  footnote(general_title = "\n", general = "Table 2")

	Estimate	Std. Error	t value	p_value
(Intercept)	61837341.35	12555337.77	4.93	0***
id	0.00	0.00	0.39	0.696
bedrooms	-18701.87	4500.02	-4.16	0***
bathrooms	45.42	6845.14	0.01	0.995
sqft_living	182.44	8.11	22.49	0***
sqft_lot	27.61	1.99	13.91	0***
floors	-28497.38	8287.01	-3.44	0.001***
view	22111.43	4714.43	4.69	0***
condition	13255.58	5191.62	2.55	0.011*
grade	69929.60	5600.09	12.49	0***
yr_renovated	-10.97	7.10	-1.54	0.123
zipcode	-638.14	127.94	-4.99	0***
price_per_sqft	839.41	52.45	16.00	0***
crime_nn1	266.87	482.38	0.55	0.58
crime_nn2	-282.87	1065.92	-0.27	0.791
crime_nn3	-775.01	1647.09	-0.47	0.638
crime_nn4	3129.78	1833.46	1.71	0.088.
crime_nn5	-2241.90	978.18	-2.29	0.022*
school_nn1	1.68	2.67	0.63	0.53
hosp_nn1	-3.06	0.85	-3.60	0***
streetcar_nn1	-3.58	0.58	-6.16	0***
TotalPop	4.23	2.84	1.49	0.136
MedIncome	3.45	0.51	6.80	0***
pctWhite	126226.29	31575.07	4.00	0***
Age	1273.64	148.95	8.55	0***
parksCount	3973.39	2128.56	1.87	0.062.
crimeCount	-26.45	8.57	-3.09	0.002**
crosswalkCount	65.79	452.46	0.15	0.884
canopyPercent	594.89	500.34	1.19	0.235
TrafficDis	69.34	11.64	5.95	0***

Table 2

# Baseline summary
model_summary <- data.frame(
  Statistic = c("Multiple R-squared", "Adjusted R-squared", "F-statistic"),
  Value = c(
    summary_reg0$r.squared,
    summary_reg0$adj.r.squared,
    summary_reg0$fstatistic[1]
  )
)

model_summary %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  footnote(general_title = "\n", general = "Table 3")

Statistic	Value
Multiple R-squared	0.8170151
Adjusted R-squared	0.8137293
F-statistic	248.6505718

Table 3

The baseline model has some multicollinearity happening, such as multiple crime distance data, and we need to include only one of those independent variables in order to construct the final model. Besides, the final model only selects the variables that have a high correlation with the price variable by selecting variables with low p value.

The baseline model perfoms better than the final model can be explained by multicollinearity.

need to include id column for identification later

# Finalized Model Reg1
Seattle_final <- Seattle_final_0 %>%
  select(id, Age, price, bedrooms, bathrooms, sqft_living, waterfront, view, condition, grade,
         crimeCount, hosp_nn1, canopyPercent, parksCount, crosswalkCount, TrafficDis,
         pctWhite, MedIncome)%>%
  mutate_all(~replace(., is.na(.), 0))

reg1 <- lm(price ~ ., data = Seattle_final)

summary_reg1 <- summary(reg1)

## Kable show summary of the model Reg0

coef_table1 <- as.data.frame(summary_reg1$coefficients)
coef_table1$significance <- ifelse(coef_table1$`Pr(>|t|)` < 0.001, '***',
                                    ifelse(coef_table1$`Pr(>|t|)` < 0.01, '**',
                                      ifelse(coef_table1$`Pr(>|t|)` < 0.05, '*',
                                        ifelse(coef_table1$`Pr(>|t|)` < 0.1, '.', ''))))
coef_table1$p_value <- paste0(round(coef_table1$`Pr(>|t|)`, digits = 3), coef_table1$significance)
coef_table1$'t value' <- round(coef_table1$'t value', digits = 2)
coef_table1$'Std. Error' <- round(coef_table1$'Std. Error', digits = 2)
coef_table1$Estimate <- round(coef_table1$Estimate, digits = 2)

coef_table1 %>%
  select(-significance, -`Pr(>|t|)`) %>% 
  kable(align = "r") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  footnote(general_title = "\n", general = "Table 4")

	Estimate	Std. Error	t value	p_value
(Intercept)	-855864.38	21772.83	-39.31	0***
id	0.00	0.00	-1.97	0.049*
Age	1451.40	65.17	22.27	0***
bedrooms	-10405.96	2025.43	-5.14	0***
bathrooms	10179.81	3407.48	2.99	0.003**
sqft_living	155.07	4.09	37.88	0***
waterfront	386272.56	30051.57	12.85	0***
view	35980.16	2364.54	15.22	0***
condition	16351.60	2569.57	6.36	0***
grade	101917.72	2667.04	38.21	0***
crimeCount	-21.46	4.75	-4.52	0***
hosp_nn1	-4.56	0.30	-15.06	0***
canopyPercent	786.56	212.88	3.69	0***
parksCount	14691.79	992.05	14.81	0***
crosswalkCount	961.69	226.52	4.25	0***
TrafficDis	40.98	5.30	7.74	0***
pctWhite	300946.67	9695.46	31.04	0***
MedIncome	0.59	0.08	7.59	0***

Table 4

# Model summary

model_summary <- data.frame(
  Statistic = c("Multiple R-squared", "Adjusted R-squared", "F-statistic"),
  Value = c(
    summary_reg1$r.squared,
    summary_reg1$adj.r.squared,
    summary_reg1$fstatistic[1]
  )
)

model_summary %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  footnote(general_title = "\n", general = "Table 5")

Statistic	Value
Multiple R-squared	0.7592488
Adjusted R-squared	0.7586334
F-statistic	1233.8252165

Table 5

Residual Plot

From the residual plot we can see that the points are around the horizontal line.

residuals_df <- data.frame(Residuals = resid(reg1), Fitted = fitted(reg1))
ggplot(residuals_df, aes(x = Fitted, y = Residuals)) +
  geom_point(size = 0.5, color = "#5c6e6c") +
  geom_hline(yintercept = 0, color = "#b0522a") +
  labs(title = "Residual Plot for Regression",
       subtitle = "Each dot represent one property ",
       x = "Fitted Values",
       y = "Residuals") +
  theme_minimal()

Training and Testing Datasets

In order to make the model more precise, we split the housing data into training data, which contains 60% of original data, and testing data, which contains 40% of original data.

# Split the dataset into a training set and a test set using stratified sampling
inTrain <- createDataPartition(
              y = paste(Seattle_final$sqft_lot, Seattle_final$view, 
                        Seattle_final$yr_built, Seattle_final$school_nn1), 
              p = .60, list = FALSE)  # Create a vector of indices for the training set

# Subset the dataset to create the training set
Seattle.training <- Seattle_final[inTrain,]  # Training set
# Subset the dataset to create the test set
Seattle.test <- Seattle_final[-inTrain,]     # Test set
 
# Fit a linear regression model to predict Sales Price using selected predictors
reg.training <- 
  lm(price ~ ., data = as.data.frame(Seattle.training))

summary_training <- summary(reg.training)

# Make predictions on the test set and evaluate model performance

Seattle.test <-
  Seattle.test %>%  # Pipe the test set into the following operations
  # Add a column indicating the type of regression model used
  mutate(Regression = "Baseline Regression",
         # Predict sale prices using the trained regression model
         SalePrice.Predict = predict(reg.training, Seattle.test),
         # Calculate the difference between predicted and actual sale prices
         SalePrice.Error = SalePrice.Predict - price,
         # Calculate the absolute difference between predicted and actual sale prices
         SalePrice.AbsError = abs(SalePrice.Predict - price),
         # Calculate the absolute percentage error
         SalePrice.APE = (abs(SalePrice.Predict - price)) / price) %>%
  filter(price < 5000000)  # Filter out records with SalePrice greater than $5,000,000

Examine the model (diagnose)

Coefficients

See the coefficients of our testing model and compare it to the baseline model.

## Plot coefficients
p0 <- plot_summs(reg0, colors = "#6F927F")
p1 <- plot_summs(reg1, colors = "#b0522a")
# Arrange plots side by side
grid.arrange(p0, p1, ncol = 2)

Distribution of Sale Price Error Testing Set

Seattle.test_join <- left_join(Seattle.test, housingData, join_by(id == id))
Seattle.test_withgeo <- 
  Seattle.test_join %>% 
  st_as_sf(coords = c("long", "lat"), crs = 4326, agr = "constant") %>%
  st_transform(crs = 2926)

Seattle.test_withgeo$SalePrice.Error <- round(Seattle.test_withgeo$SalePrice.Error, digits = 2)

ggplot()+
  geom_sf(data = tracts, fill = "#e9e9e9", col = "white") +
  geom_sf(data=Seattle.test_withgeo, aes(colour = Seattle.test_withgeo$SalePrice.Error), size=0.5)+
  scale_color_continuous(high = "#6F927F", low = "#b0522a", name= "Sale Price Error ") +
  labs(title = "Distribution of Sale Price Error Testing Set") +
  theme_void()

Mean Absolute Percentage Error By Neighborhood

Our prediction error is distributed unevenly across neighborhoods in Seattle.

to_plot <- st_intersection(Seattle.test_withgeo, neighborhood %>% dplyr::select("S_HOOD")) %>% 
  st_drop_geometry() %>% 
  group_by(S_HOOD) %>%
  summarise(mean.MAPE = mean(SalePrice.APE, na.rm = T)) %>% 
  left_join(neighborhood) %>% 
  st_sf()
to_plot %>%
  ggplot() + 
      geom_sf(aes(fill = mean.MAPE), col = "white") +
  scale_fill_continuous(low = "#6F927F", high = "#b0522a", na.value = "#e9e9e9", name= "MAPE") +
  labs(title = "Mean Absolute Percentage Error By Neighborhood",
       subtitle = "Seattle, WA") +
    theme_void()

Cross validation

k-fold cross-validation

# Define the control parameters for k-fold cross-validation
control <- trainControl(method = "cv",    # Use k-fold cross-validation
                        number = 10,      # Number of folds
                        verboseIter = TRUE,  # Show verbose output
                        returnData = FALSE,  # Don't return resampled data
                        savePredictions = TRUE,  # Save predictions
                        classProbs = FALSE,  # Don't compute class probabilities
                        summaryFunction = defaultSummary)  # Use default summary function

# Train the linear regression model using k-fold cross-validation
lm_cv <- train(price ~ .,  # Formula for the linear regression model
               data = Seattle_final,  # Dataset
               method = "lm",           # Specify "lm" for linear regression
               trControl = control)     # Use the defined control parameters

# View the cross-validation results
lm_cv_results <- lm_cv['results']
lm_cv_statistic_summary <- do.call(rbind, lm_cv_results) %>% select(-contains("intercept"))


kbl(lm_cv_statistic_summary) %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left") %>%
  footnote(general_title = "\n", general = "Table 6")

	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
results	135579.4	0.757575	96878.09	5577.113	0.0171431	2686.396

Table 6

# Plot observed versus predicted values

ggplot(lm_cv$pred, aes(x = obs, y = pred)) +
  geom_point(size = 0.5, color = "#5c6e6c") +
  geom_smooth(method = "lm", se = FALSE, color = "#b0522a", size = 0.5) +
  labs(title = "Observed vs Predicted Housing Values",
       x = "Observed",
       y = "Predicted") +
  theme_minimal()

Leave-one-out Cross-validation

# Set up leave-one-out cross-validation
control <- trainControl(method = "LOOCV",     # Use leave-one-out cross-validation
                        number = nrow(Seattle_final),  # Number of folds = number of observations
                        verboseIter = TRUE,  # Show verbose output
                        returnData = FALSE,  # Don't return resampled data
                        savePredictions = TRUE,  # Save predictions
                        classProbs = FALSE,  # Don't compute class probabilities
                        summaryFunction = defaultSummary)  # Use default summary function

# Train the linear regression model using leave-one-out cross-validation
lm_loocv <- train(price ~ .,  # Formula for the linear regression model
                  data = Seattle_final,  # Dataset
                  method = "lm",           # Specify "lm" for linear regression
                  trControl = control)     # Use the defined control parameters


# 6669 fold in total

# View the cross-validation results
loocv_results <- lm_loocv['results']
statistic_summary <- do.call(rbind, loocv_results) %>% select(-contains("intercept"))


kbl(statistic_summary) %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left") %>%
  footnote(general_title = "\n", general = "Table 7")

	RMSE	Rsquared	MAE
results	135662.4	0.7571996	96882.43

Table 7

Summary

The R-squared value from all the evaluations above indicates that around 75.8% of the variability in SalePrice is explained by the model’s predictor variables, which means the model performs pretty well. When compared between methods, the model without splitting data has the highest R-squared value and the leave-one-out Cross-validation method has the lowest R-square value.

However, there are still places that this model can be further improved. For example, some of the independent variable, such as crime data, are not in the same year as the housing price data.Besides, the mean absolute percentage error is not distributed evenly among different neighborhoods in Seattle.