One of the big operational challenges of bike share systems is “re-balancing” - getting bikes to stations that are anticipated to have demand but lack bikes. Figuring out how to do this is one of the keys to operating a successful system. This project aims to use predictive modeling to operationalize intelligence in this use case.
Here we will predict only the demand (ignoring the supply of bikes, network routing of rebalancing trucs etc.,). If we knew the bike station capacities, we could see when demand for bikes might drive stations to run out of bikes, and then move excess bikes from elsewhere. A program manager for a bike-share system could reasonably anticipate demand and allocate bikes ahead of time.
The demand for parking spaces, uber trips, bike share, road access and a whole host of urban transportation phenomena are time and space dependent, and modeling them frequently involves simply controlling for the day, hour, location, weather and other temporal phenomena. Quite simply, the demand for bike share trips today at my location at 5PM is probably highly correlated with the demand last week at the same time.
# plotTheme <- theme(
# plot.title =element_text(size=12),
# plot.subtitle = element_text(size=8),
# plot.caption = element_text(size = 6),
# axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
# axis.text.y = element_text(size = 10),
# axis.title.y = element_text(size = 10),
# # Set the entire chart region to blank
# panel.background=element_blank(),
# plot.background=element_blank(),
# #panel.border=element_rect(colour="#F0F0F0"),
# # Format the grid
# panel.grid.major=element_line(colour="#D0D0D0",size=.2),
# axis.ticks=element_blank())
# mapTheme <- theme(plot.title =element_text(size=12),
# plot.subtitle = element_text(size=8),
# plot.caption = element_text(size = 6),
# axis.line=element_blank(),
# axis.text.x=element_blank(),
# axis.text.y=element_blank(),
# axis.ticks=element_blank(),
# axis.title.x=element_blank(),
# axis.title.y=element_blank(),
# panel.background=element_blank(),
# panel.border=element_blank(),
# panel.grid.major=element_line(colour = 'transparent'),
# panel.grid.minor=element_blank(),
# legend.direction = "vertical",
# legend.position = "right",
# plot.margin = margin(1, 1, 1, 1, 'cm'),
# legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))
palette7 <- c("#fedc78","#bbcf9b","#abba72","#498b6d","#496a81","#33576e","#f4a24c")
palette5 <- c("#f4a24c","#fedc78","#bbcf9b","#498b6d","#33576e")
palette4 <- c("#f4a24c","#bbcf9b","#498b6d","#fedc78")
palette2 <- c("#33576e","#f4a24c")
Let’s read in the Indego data of the quarter of April to June from Indego’s site
dat <- read.csv("data/indego-trips-2023-q2.csv")
Let’s use some date parsing to bin” the data by 15 and 60 minute intervals by rounding.
Notice we use the time format mdy_hm
to denote month,
day, year, hour, and minute. We extract the week
of the
observation (ranging from 1-52 throughout the year) and the
dotw
for day of the week.
dat2 <- dat %>%
mutate(interval60 = floor_date(mdy_hm(start_time), unit = "hour"),
interval15 = floor_date(mdy_hm(start_time), unit = "15 mins"),
week = week(interval60),
dotw = wday(interval60, label=TRUE))
Using the tidycensus
package, we can download census
geography and variables. These are used to test generalizeability later,
but we don’t use them as independent variables because they end up
being perfectly colinear with the stations fixed effects. We
extract the tracts for mapping and joining purposes - creating an
sf
object that consists only of GEOIDs and geometries.
phlCensus <-
get_acs(geography = "tract",
variables = c("B01003_001", "B19013_001",
"B02001_002", "B08013_001",
"B08012_001", "B08301_001",
"B08301_010", "B01002_001"),
year = 2022,
state = "42",
geometry = TRUE,
county= "101",
output = "wide") %>%
rename(Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
Med_Age = B01002_001E,
White_Pop = B02001_002E,
Travel_Time = B08013_001E,
Num_Commuters = B08012_001E,
Means_of_Transport = B08301_001E,
Total_Public_Trans = B08301_010E) %>%
select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
Means_of_Transport, Total_Public_Trans,
Med_Age,
GEOID, geometry) %>%
mutate(Percent_White = White_Pop / Total_Pop,
Mean_Commute_Time = Travel_Time / Total_Public_Trans,
Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)
# extract_geometries
phlTracts <-
phlCensus %>%
as.data.frame() %>%
distinct(GEOID, .keep_all = TRUE) %>%
select(GEOID, geometry) %>%
st_sf %>%
st_transform(crs = 4326)
We add the spatial information to our rideshare data as origin and destination data, first joining the origin station, then the destination station to our census data. We don’t use the destination data in this project.
dat_census <- st_join(dat2 %>%
filter(is.na(start_lon) == FALSE &
is.na(start_lat) == FALSE &
is.na(end_lat) == FALSE &
is.na(end_lon) == FALSE) %>%
st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326),phlTracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Origin.Tract = GEOID) %>%
mutate(start_lon = unlist(map(geometry, 1)),
start_lat = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
select(-geometry)%>%
st_as_sf(., coords = c("end_lon", "end_lat"), crs = 4326) %>%
st_join(., phlTracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Destination.Tract = GEOID) %>%
mutate(end_lon = unlist(map(geometry, 1)),
end_lat = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
select(-geometry)
Import weather data from Philadelphia International Airport (PHL)
using riem_measures
. We can mutate
the data to
get temperature, wind speed, precipitation on an hourly basis and plot
the temperature and precipitation trends over our study period.
weather.Panel <-
riem_measures(station = "PHL", date_start = "2023-04-01", date_end = "2023-06-30") %>%
dplyr::select(valid, tmpf, p01i, sknt)%>%
replace(is.na(.), 0) %>%
mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
group_by(interval60) %>%
summarize(Temperature = max(tmpf),
Precipitation = sum(p01i),
Wind_Speed = max(sknt)) %>%
mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
grid.arrange(
ggplot(weather.Panel, aes(interval60,Precipitation)) +
geom_line(size = 0.2, color = '#33576e') +
labs(title="Percipitation", x="Month", y="Perecipitation") +
theme_minimal(),
ggplot(weather.Panel, aes(interval60,Wind_Speed)) +
geom_line(size = 0.05, color = '#33576e') +
labs(title="Wind Speed", x="Month", y="Wind Speed") +
theme_minimal(),
ggplot(weather.Panel, aes(interval60,Temperature)) +
geom_line(size = 0.2, color = '#33576e') +
labs(title="Temperature", x="Month", y="Temperature") +
theme_minimal(),
top="PHL Airport Weather Data: April - June, 2023")
We begin by examining the time and frequency components of our data.
First, we look at the overall time pattern - there is clearly a daily periodicity and there are lull periods on weekends.
ggplot(dat_census %>%
group_by(interval60) %>%
tally())+
geom_line(aes(x = interval60, y = n), size = 0.2, color = '#33576e')+
labs(title="Bike share trips per hr in Philadelphia, April - June, 2023",
x="Date",
y="Number of trips")+
theme_minimal()
Let’s examine the distribution of trip volume by station for different times of the day. Our data consists of a lot of low demand station hours and a few high demand station hours.
# all station plot
ggplot(dat_census %>%
group_by(interval60, start_station) %>%
tally())+
geom_histogram(aes(n), binwidth = 1, fill="#33576e")+
labs(title="Bike share trips per hr by station in Philadelphia, April - June, 2023",
x="Trip Counts",
y="Station hour Count")+
theme_minimal()
We can also separate the distribution above by different time period
of the day. In this project, we defined the time as following:
* Overnight: 19 PM - 7 AM
* AM Rush: 7 AM - 9 AM
* Mid-Day: 9 AM - 16 PM
* PM Rush: 16 PM - 19 PM
# time period plot
dat_census %>%
mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 19 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 9 ~ "AM Rush",
hour(interval60) >= 9 & hour(interval60) < 16 ~ "Mid-Day",
hour(interval60) >= 16 & hour(interval60) <= 19 ~ "PM Rush"))%>%
group_by(interval60, start_station, time_of_day) %>%
tally() %>%
group_by(start_station, time_of_day) %>%
summarize(mean_trips = mean(n)) %>%
ggplot()+
geom_histogram(aes(mean_trips, fill=time_of_day), binwidth = 1) +
scale_fill_manual(values = palette4) +
labs(title="Mean Number of Hourly Trips Per Station in Philadelphia, April - June, 2023",
x="Number of trips",
y="Frequency")+
facet_wrap(~time_of_day)+
theme_minimal() + theme(legend.position = "none")
We can also track the daily trends in ridership by day of the week and weekend versus weekday, to see what temporal patterns we’d like to control for.
ggplot(dat_census %>% mutate(hour = hour(mdy_hm(start_time))))+
geom_freqpoly(aes(hour, color = dotw), binwidth = 1, lwd = 0.8)+
scale_color_manual(values = palette7,
name = "Day of a week") +
labs(title="Bike share trips in Philadelphia, by day of the week, April - June, 2023",
x="Hour",
y="Trip Counts")+
theme_minimal()
ggplot(dat_census %>%
mutate(hour = hour(mdy_hm(start_time)),
weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
geom_freqpoly(aes(hour, color = weekend), binwidth = 1, lwd = 1)+
scale_color_manual(values = palette2) +
labs(title="Bike share trips in Philadelphia: weekend vs weekday, April - June, 2023",
x="Hour",
y="Trip Counts")+
theme_minimal()
The maps below visualize the trips per hour in the geospatial context, where each point stands for the location of an Indego station.
ggplot()+
geom_sf(data = phlTracts %>%
st_transform(crs=4326), fill = "#e9e9e9", col = "grey")+
geom_point(data = dat_census %>%
mutate(hour = hour(mdy_hm(start_time)),
weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 19 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 9 ~ "AM Rush",
hour(interval60) >= 9 & hour(interval60) < 16 ~ "Mid-Day",
hour(interval60) >= 16 & hour(interval60) <= 19 ~ "PM Rush"))%>%
group_by(start_station, start_lat, start_lon, weekend, time_of_day) %>%
tally(),
aes(x=start_lon, y = start_lat, color = n),
size = 1, alpha=0.8)+
scale_color_gradientn(colors = rev(palette5),
name = "Number of trips") +
ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
facet_grid(weekend ~ time_of_day)+
labs(title="Bike share trips per hr by station in Philadelphia, April - June, 2023")+
theme_minimal() +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank()
)
First we have to make sure each unique station and hour/day combo exists in our data set. This is done in order to create a “panel” (e.g. a time-series) data set where each time period in the study is represented by a row. So if a station didn’t have any trips originating from it at a given hour, we still need a zero in that spot in the panel.
We start by determining the maximum number of combinations. Then we compare that to the actual number of combinations. The result shows that these two values are the same.
# Calculate the number of rows in the resulting panel datase
length(unique(dat_census$interval60)) * length(unique(dat_census$start_station))
# This block of code creates the panel dataset 'study.panel'.
study.panel <-
expand.grid(interval60=unique(dat_census$interval60),
start_station = unique(dat_census$start_station)) %>%
left_join(., dat_census %>%
select(start_station, Origin.Tract, start_lon, start_lat )%>%
distinct() %>%
group_by(start_station) %>%
slice(1))
# Calculate the number of rows in the resulting panel dataset
nrow(study.panel)
We create the full panel by summarizing counts by station for each
time interval, keep census info and lat/lon information along for
joining later to other data. We remove data for station IDs that are
FALSE
.
# Create a panel dataset for bike ride data
ride.panel <-
dat_census %>%
mutate(Trip_Counter = 1) %>% # Add a Trip_Counter variable, indicating one trip for each row
right_join(study.panel) %>% # Right join with the study.panel dataset to ensure all spatial-time combinations are included
group_by(interval60, start_station, Origin.Tract, start_lon, start_lat) %>% # Group by spatial-time and station variables
summarize(Trip_Count = sum(Trip_Counter, na.rm = TRUE)) %>% # Summarize the Trip_Counter variable to calculate total trip count for each group
left_join(weather.Panel) %>% # Left join with the weather.Panel dataset to include weather information
ungroup() %>% # Remove grouping
filter(!is.na(start_station)) %>% # Filter out rows with missing from_station_id values
mutate(week = week(interval60), # Create a week variable indicating the week number of the year
dotw = wday(interval60, label = TRUE)) %>% # Create a dotw variable indicating the day of the week (label format)
filter(!is.na(Origin.Tract)) # Filter out rows with missing Origin.Tract values
ride.panel <-
left_join(ride.panel, phlCensus %>%
as.data.frame() %>%
select(-geometry), by = c("Origin.Tract" = "GEOID"))
Creating time lag variables will add additional nuance about the demand during a given time period - hours before and during that day.
We can also try to control for the effects of holidays that disrupt the expected demand during a given weekend or weekday. The holidays in April to June are: Easter (April 1st), Memorial Day (May 28th), and Juneteenth (June 19th).
We can evaluate the correlations in these lags. They are pretty
strong. There’s a Pearson’s R of 0.89 for the lagHour
.
This makes sense because the demand right now should be relatively similar to the demand tomorrow at this time, and to the demand an hour from now, but twelve hours from now, we likely expect the opposite in terms of demand.
# Arrange the ride.panel dataset based on 'from_station_id' and 'interval60'
# This step ensures that the dataset is ordered properly for subsequent calculations
ride.panel <-
ride.panel %>%
arrange(start_station, interval60) %>%
# Compute lag variables for the 'Trip_Count' variable at different time intervals
mutate(
lagHour = dplyr::lag(Trip_Count, 1), # Lag of 1 hour
lag2Hours = dplyr::lag(Trip_Count, 2), # Lag of 2 hours
lag3Hours = dplyr::lag(Trip_Count, 3), # Lag of 3 hours
lag4Hours = dplyr::lag(Trip_Count, 4), # Lag of 4 hours
lag12Hours = dplyr::lag(Trip_Count, 12), # Lag of 12 hours
lag1day = dplyr::lag(Trip_Count, 24) # Lag of 1 day (24 hours)
) %>%
# Identify holidays and create a binary variable indicating whether it's a holiday
mutate(
holiday = ifelse(yday(interval60) == 91 | yday(interval60) == 148 | yday(interval60) == 170, 1, 0) # Check if it's a holiday
) %>%
# Create a variable representing the day of the year
mutate(
day = yday(interval60) # Extract the day of the year from 'interval60'
) %>%
# Create additional variables related to holiday lag
mutate(
holidayLag = case_when(
dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay", # If the previous day was a holiday
dplyr::lag(holiday, 2) == 1 ~ "PlusTwoDays", # If two days ago was a holiday
dplyr::lag(holiday, 3) == 1 ~ "PlusThreeDays", # If three days ago was a holiday
dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay", # If the next day will be a holiday
dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays", # If two days ahead will be a holiday
dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays" # If three days ahead will be a holiday
),
holidayLag = ifelse(is.na(holidayLag) == TRUE, 0, holidayLag) # Convert NA to 0 for non-holiday cases
)
panel_result <-as.data.frame(ride.panel) %>%
group_by(interval60) %>%
summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
gather(Variable, Value, -interval60, -Trip_Count) %>%
mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
"lag12Hours","lag1day")))%>%
group_by(Variable) %>%
summarize(correlation = round(cor(Value, Trip_Count),2)) %>%
t()
colnames(panel_result) <- as.character(panel_result[1, ]) # Set column names to values in the first row
panel_result <- panel_result[-1, , drop = FALSE] # Remove the first row
panel_result %>%
kable() %>%
kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
lagHour | lag2Hours | lag3Hours | lag4Hours | lag12Hours | lag1day | |
---|---|---|---|---|---|---|
correlation | 0.89 | 0.71 | 0.50 | 0.31 | -0.55 | 0.83 |
We split our data into a training and a test set. We split the data
by time instead of randomly because we want to have a full range of
values (days in a week or time in a day) for both train and test
dataset. We create five linear models using the lm
function.
We create the models using our training data ride.Train
.
The first models include only temporal controls, but the later ones
contain all of our lag information.
ride.Train <- filter(ride.panel, week >= 20)
ride.Test <- filter(ride.panel, week < 20)
# 5 models
reg1 <-
lm(Trip_Count ~ hour(interval60) + dotw + Temperature, data=ride.Train)
reg2 <-
lm(Trip_Count ~ start_station + dotw + Temperature, data=ride.Train)
reg3 <-
lm(Trip_Count ~ start_station + hour(interval60) + dotw + Temperature + Precipitation,
data=ride.Train)
reg4 <-
lm(Trip_Count ~ start_station + hour(interval60) + dotw + Temperature + Precipitation +
lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day,
data=ride.Train)
reg5 <-
lm(Trip_Count ~ start_station + hour(interval60) + dotw + Temperature + Precipitation +
lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday,
data=ride.Train)
We create a nested data frame of test data by week and a function
called model_pred
which we can then map
onto
each data frame in our nested structure.
ride.Test.weekNest <-
ride.Test %>%
nest(-week)
model_pred <- function(dat, fit){
pred <- predict(fit, newdata = dat)}
Then, we can performs model predictions, reshapes the data, calculates error metrics, and organizes the results into a single data frame for further analysis and evaluation of the regression models’ performance in predicting rideshare demand.
week_predictions <-
ride.Test.weekNest %>%
mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
ETime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg5, .f = model_pred)) %>%
gather(Regression, Prediction, -data, -week) %>%
mutate(Observed = map(data, pull, Trip_Count),
Absolute_Error = map2(Observed, Prediction, ~ abs(.x - .y)),
MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))
The chart below shows the MAE for the 5 models, which proves that model 4 has the smallest MAE for the test dataset. The holiday lags in model 5 actually make our model less accurate.
week_predictions %>%
dplyr::select(week, Regression, MAE) %>%
gather(Variable, MAE, -Regression, -week) %>%
ggplot(aes(week, MAE)) +
geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
scale_fill_manual(values = palette5) +
labs(title = "Mean Absolute Errors by model specification and week") +
theme_minimal()
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station)) %>%
dplyr::select(interval60, start_station, Observed, Prediction, Regression) %>%
unnest() %>%
gather(Variable, Value, -Regression, -interval60, -start_station) %>%
group_by(Regression, Variable, interval60) %>%
summarize(Value = sum(Value)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
geom_line(size = 0.7) +
scale_color_manual(values=palette2)+
facet_wrap(~Regression, ncol=1) +
labs(title = "Predicted/Observed bike share time series", subtitle = "Philadelphia; A test set of 2 weeks", x = "Hour", y= "Station Trips") +
theme_minimal()
We select reg4
to move forward, which have the best
goodness of fit generally.
We can look at our mean absolute errors by station - stations in Center City have higher MAE than stations in other areas, but we need to go a bit further to get at the temporal element of the error.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
start_lat = map(data, pull, start_lat),
start_lon = map(data, pull, start_lon)) %>%
select(interval60, start_station, start_lon, start_lat, Observed, Prediction, Regression) %>%
unnest() %>%
filter(Regression == "DTime_Space_FE_timeLags") %>%
group_by(start_station, start_lon, start_lat) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = phlCensus, color = "grey", fill = "transparent")+
geom_point(aes(x = start_lon, y = start_lat, color = MAE))+
scale_color_gradientn(colors = rev(palette5),
name = "MAE") +
ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
labs(title="Mean Abs Error, Test Set, Model 4") +
theme_void()
If we plot observed vs. predicted for different times of day during the week and weekend, we can see that we under predict the trips in general, and the prediction is not that accurate.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
start_lat = map(data, pull, start_lat),
start_lon = map(data, pull, start_lon),
dotw = map(data, pull, dotw)) %>%
select(interval60, start_station, start_lon,
start_lat, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "DTime_Space_FE_timeLags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 19 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 9 ~ "AM Rush",
hour(interval60) >= 9 & hour(interval60) < 16 ~ "Mid-Day",
hour(interval60) >= 16 & hour(interval60) <= 19 ~ "PM Rush"))%>%
ggplot()+
geom_point(aes(x= Observed, y = Prediction), color = "#33576e",size = 1, alpha = 0.4)+
geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "#498b6d", size = 0.8)+
geom_abline(slope = 1, intercept = 0, color = "#f4a24c", size = 0.8)+
facet_grid(time_of_day~weekend)+
labs(title="Observed vs Predicted",
x="Observed trips",
y="Predicted trips")+
theme_minimal()
When looking at our errors on a map by weekend/weekday and time of day, we can see that most big errors occur in AM Rush and PM Rush during weekdays and Midday and PM Rush during weekends. In summary it follows trend of the number of trips in the time of day.
The errors are concentrated in Center City and University City during weekdays and near the Philadelphia Art Museum during weekend.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
start_lat = map(data, pull, start_lat),
start_lon = map(data, pull, start_lon),
dotw = map(data, pull, dotw) ) %>%
select(interval60, start_station, start_lon,
start_lat, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "DTime_Space_FE_timeLags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
group_by(start_station, weekend, time_of_day, start_lon, start_lat) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = phlCensus, fill = "#e9e9e9", col = "grey")+
geom_point(aes(x = start_lon, y = start_lat, color = MAE), size = 1, alpha=0.8)+
scale_color_gradientn(colors = rev(palette5),
name = "MAE") +
ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
facet_grid(weekend~time_of_day)+
labs(title="Mean Absolute Errors, Test Set")+
theme_minimal() +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank()
)
Let’s focus on the morning commute, where station locations probably relate to likely users, who seem to be commuting downtown to the loop. There are a select few stations that are proving very resistant to our model - they have high income, low transit usage, longer commute time, and are more likely to be white people.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
start_lat = map(data, pull, start_lat),
start_lon = map(data, pull, start_lon),
dotw = map(data, pull, dotw),
Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
Med_Inc = map(data, pull, Med_Inc),
Percent_White = map(data, pull, Percent_White),
Mean_Commute_Time = map(data, pull, Mean_Commute_Time)) %>%
select(interval60, start_station, start_lon,
start_lat, Observed, Prediction, Regression,
dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White, Mean_Commute_Time) %>%
unnest() %>%
filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 19 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 9 ~ "AM Rush",
hour(interval60) >= 9 & hour(interval60) < 16 ~ "Mid-Day",
hour(interval60) >= 16 & hour(interval60) <= 19 ~ "PM Rush")) %>%
filter(time_of_day == "AM Rush") %>%
group_by(start_station, Percent_Taking_Public_Trans, Med_Inc, Percent_White, Mean_Commute_Time) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
gather(-start_station, -MAE, key = "variable", value = "value")%>%
ggplot(.)+
geom_point(aes(x = value, y = MAE), alpha = 0.4, color = "#33576e")+
geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE, color = "#f4a24c", size = 0.8)+
facet_wrap(~variable, scales = "free", ncol = 4)+
labs(title="Errors as a function of socio-economic variables",
y="Mean Absolute Error (Trips)")+
theme_minimal()
Based on our time-series plots, we can see that we are able to track the time components of demand, but we miss the peaks, and under predict in general. Based on subsequent maps of our errors, the peaks seem to have some spatial or demographic pattern to them.
In summary, the model is helpful in predicting the overall demand of all stations but does not work very well with predicting high demand at certain stations, which makes it not very helpful to serve as a guide to rebalance bikes. Extra independent variables are needed to generate a more accurate model.