I live next to the Dolores River. It’s an often overlooked gem of the southwest. It runs from just outside Rico, Colorado at its headwaters to the Colorado River near Moab, Utah. It’s an experience.

As of late the river has become something of a destination for the rafting community. Most sections of the river are mellow and scenic. But two sections have class 4 rapids that keep things interesting: Snaggletooth Rapid and State Line Rapid (also known as the Chicken Raper Rapid).

In 1985 the McPhee Dam was completed, forever changing the flow of the lower Dolores River. Today, farmers and ranchers rely on water from the McPhee Dam to to grow crops and raise livestock. As a result, the primary goal of reservoir managers is to keep as much water in the reservoir as possible. Raftable releases are secondary.

This has resulted in several years where the reservoir has not filled and there have been no raftable releases. Other years reservoir managers say that the ther will be no release and we get a release.

Because the Dolores is so loved in our community, there are a few people who try and predict when the Dolores. I thought it would be fun to do myself. Here’s how I built my model to predict the number of days the Dolores would run.

Here I load packages and set a custom theme. You can skip this part and just add in set_theme(theme_light) if you want to skip this part.

library(tidyverse) ## For all the data cleaning work.
library(lubridate) ## For working with dates.
library(extrafont) ## For Font Graphics
options(scipen = 999)

t<-theme_light()+
theme(
text=element_text(family = "Poppins", color = "#FFFFFF"),
plot.margin = unit(c(0.5,0.5,0.5,0.5), "cm"),
plot.title = element_text(face = "bold", size = rel(2.3), color = "#FFFFFF"),
plot.subtitle = element_text(color = "#7A7A7A", size = rel(1.3), margin = margin(t = 0, r = 0, b = 15, l = 0)),
plot.background = element_rect(fill = "#222222", color = "#FFFFFF"),
panel.grid.major = element_line(color = "#555555"),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "#222222", color = "#222222"),
panel.border = element_blank(),
axis.text = element_text(size = rel(0.8), colour = "#CCCCCC"),
axis.title.x = element_text(margin = margin(t = 5, r = 0, b = 10, l = 0)),
axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 20)),
legend.title = element_text(face = "bold"),
legend.text = element_text(color = "#222222"),
legend.direction = "horizontal",
legend.position="bottom",
legend.key.width = unit(5, "cm"),
legend.key.height = unit(0.3, "cm"),
legend.box = "vertical",
plot.caption = element_text(color = "#7A7A7A", size = rel(0.9))
)
theme_set(t)

# Getting a variable to predict

### Getting Stream Guage Data Below McPhee Dam from the USGS

url<-paste0("https://waterservices.usgs.gov/nwis/dv/?format=rdb&sites=09169500,%2009166500&startDT=1985-02-01&endDT=", Sys.Date(), "&statCd=00003&siteType=ST&siteStatus=all")

select(2:5)%>%
rename(site_id = 1, date = 2, flow=3, code = 4)%>%
mutate(site_id = ifelse(site_id == "09166500", "Dolores", "Bedrock"))%>%
drop_na()

bedrock_flow<-flow_data%>%
filter(site_id == "Bedrock"& year(date)>1986)%>%
select(-code)

bedrock_flow
## # A tibble: 12,098 x 3
##    site_id date        flow
##    <chr>   <date>     <dbl>
##  1 Bedrock 1987-01-01   120
##  2 Bedrock 1987-01-02   130
##  3 Bedrock 1987-01-03   120
##  4 Bedrock 1987-01-04   120
##  5 Bedrock 1987-01-05   140
##  6 Bedrock 1987-01-06   140
##  7 Bedrock 1987-01-07   130
##  8 Bedrock 1987-01-08   130
##  9 Bedrock 1987-01-09   110
## 10 Bedrock 1987-01-10    95
## # … with 12,088 more rows

I want to predict the raftable days released below McPhee Dam. To do this I need flow data from a gauge below McPhee. There are a few to choose from, but the oldest is at Bedrock, CO. To get the data I used the USGS Water Services REST API. There’s an R package that you can used to get the data, but since I’d used the API a few times before and I knew that the data could be returned in tab-separated format, I just used the readr::read_tsv() function.

• The first step is to build the URL for the API. I used the USGS Rest API builder tool to get the right data. I concatenated in todays data using Sys.Date() function.
• Then used the read_tsv() function to pull data from the API skpping the first 35 lines of comments.
• I then renamed the columns to something I could understand.
• And then convert the site_ids to Characters instead of numbers. I pulled data from above and below the dams because at the time I thought I might use the data from above the Dam later for another predictive variable.
• Then I drop all the missing values and subset the data to just below the dam.

### Visualizing the flow

bedrock_flow%>%
filter(year(date)>2007)%>%
ggplot(aes(date, flow))+
geom_line(color = "#1AB8FF", size =1.2, alpha = 0.6)+
labs(title = "Dolores Flows from 2008 to Present",
subtitle = "USGS Flow Gauge Site 09169500 Near Bedrock, Colorado",
caption = "Data aquired from the Bureau of Reclemation using the RNRCS package.",
y = "Flow Rate (cfs)",
x = "Year")

I know from living here that the Dolores has flash flood events in the monsoon season around the end of August giving the river raftable flows that are not produced by a Dam release. I want to remove these events from the prediction because they could alter our predictive variable.

### Subsetting the release

predicted_variable<-bedrock_flow%>%
filter(flow>800 & month(date) %in% c(3:7))%>%
count(year(date))%>%
rename(year = 1, raftable_releases = 2)

predicted_variable
## # A tibble: 21 x 2
##     year raftable_releases
##    <dbl>             <int>
##  1  1987                87
##  2  1988                13
##  3  1989                17
##  4  1991                 9
##  5  1992                61
##  6  1993                99
##  7  1994                35
##  8  1995                89
##  9  1997                82
## 10  1998                63
## # … with 11 more rows

To get the number of raftable release we subset the flow rates to flows above 800 (cfs - cubic feet per second).

• First we filter the data for flows above 1000 cfs and dated from March to July (most of the monsoons occur from august to October).
• Then we simply count the number of dates that are left by year.

We now have the variable we want to predict. Now we need predicters.

# Getting the predictive variables.

There are are many variables that affect reservoir volume. Current Reservoir volume and snowpack are probably two of the more important variables that are reliably measured. Snowpack tells us how much water will eventually be coming into the reservoir in the spring and reservoir volume tells us how much water we will need to cause a spill. Spring rain would be another variable that would be great to include, but I don’t want to rely on weather forecasts that far out. So we will stick with snowpack and current reservoir volume.

## Snow Depth

The SNOTEL (Snow Telemetry) network provides snow data for 730 sites accross the country. There are several in the Dolores watershed that can give us a bunch of data on snowpack. For this we will use the handy dandy package library(RNRCS) (the NRCS is the government agency in the Department of Agriculture that manages the SNOTEL network) ### Finding All Snotel Sites in the Dolores Watershed First we need to find which SNOTEL sites are within the Dolores River watershed to inform how much water could end up in McPhee reservoir.

library(RNRCS)
meta_data<-grabNRCS.meta(ntwrks = c("SNTL", "SNTLT", "SCAN"))

meta_data_unlist<-meta_data[[1]]%>%
as_tibble()

meta_data_unlist
## # A tibble: 876 x 13
##    ntwk  wyear state site_name ts    start enddate latitude longitude
##    <chr> <fct> <fct> <chr>     <fct> <fct> <fct>   <fct>    <fct>
##  1 SNTL  2019  AK    Frostbit… ""    2019… 2100-J… "  61.7… -149.27
##  2 SNTL  2019  AK    McGrath   ""    2019… 2100-J… "  62.9… -155.61
##  3 SNTL  2019  UT    Parleys … ""    2019… 2100-J… "  40.7… -111.61
##  4 SNTL  2018  AK    East Pal… ""    2018… 2100-J… "  61.6… -149.10
##  5 SNTL  2018  AK    Galena AK ""    2018… 2100-J… "  64.7… -156.71
##  6 SNTL  2018  MT    JL Meadow ""    2017… 2100-J… "  44.7… -113.12
##  7 SNTL  2018  NV    ONeil Cr… ""    2018… 2100-J… "  41.8… -115.08
##  8 SNTL  2017  AK    Flower M… ""    2017… 2100-J… "  59.4… -136.28
##  9 SNTL  2017  CA    Fredonye… ""    2016… 2100-J… "  40.6… -120.61
## 10 SNTL  2017  UT    Bobs Hol… ""    2016… 2100-J… "  38.9… -112.15
## # … with 866 more rows, and 4 more variables: elev_ft <fct>, county <fct>,
## #   huc <fct>, site_id <chr>

grabNRCS.meta - grabs all sites within the Snotel (“SNTL”), Snotel Lite, (“SNTLT”), and Scan (“SCAN”) systems. I’m really just interested in the snotel sites, but I wasn’t really sure what Snotel Lite or Scan was so I loaded them just in case. grabNRCS returns a single list, se we subset just the first item of the list.

dolores_sites<-meta_data_unlist%>%
filter(state =="CO", str_detect(huc, "140300"))%>%
mutate(site_id_num = as.numeric(str_match_all(site_id, "[0-9]+")))

dolores_sites
## # A tibble: 6 x 14
##   ntwk  wyear state site_name ts    start enddate latitude longitude
##   <chr> <fct> <fct> <chr>     <fct> <fct> <fct>   <fct>    <fct>
## 1 SNTL  2011  CO    Black Me… ""    2011… 2100-J… "  37.7… -108.18
## 2 SNTL  2005  CO    Sharksto… ""    2004… 2100-J… "  37.5… -108.11
## 3 SNTL  1986  CO    El Dient… ""    1985… 2100-J… "  37.7… -108.02
## 4 SNTL  1986  CO    Scotch C… ""    1985… 2100-J… "  37.6… -108.01
## 5 SNTL  1980  CO    Lizard H… ""    1979… 2100-J… "  37.8… -107.92
## 6 SNTL  1980  CO    Lone Cone ""    1979… 2100-J… "  37.8… -108.20
## # … with 5 more variables: elev_ft <fct>, county <fct>, huc <fct>,
## #   site_id <chr>, site_id_num <dbl>

dolores_sites - We then filter all sites in Colorado (“CO”) and detect all of the huc values with 140300. To pull data from the sites we will need a site ID number. The Meta data has the site IDs in a string that contains characters and numbers.

dolores_site_ids<-dolores_sites%>%
pull(site_id_num)%>%
unlist()%>%
as.numeric()

dolores_site_ids
## [1] 1185 1060  465  739  586  589

dolores_site_ids - The last step is to convert the site_id_num variable into a vector so that we can pull data from each site in the next step.

## Pulling the data from the snotel sites

### Function to Pull All Sites

First we need a function that we can loop over each of the snotel site IDs and pull data for each of one. grabNRCS.data does not allow you to pull data from more than one site at a time.

get_snotl_data<-function(site_id){
grabNRCS.data(network = "SNTL",
site_id = site_id,
timescale = "daily",
DayBgn = '1985-01-01',
DayEnd = Sys.Date()
)%>%
as_tibble()%>%
mutate(site_id_num = site_id)
}

To get data from each site we need to use grabNRCS.data() function. grabNRCS.data takes: * network - the network we want to pull from. * the site_id which we produced a list of in the last step. * DayBgn - The first day you want to pull data from. * DayEnd - the Last date you want data from. Here we use the system data (Today).

We then convert the data to a tibble for better output in Rstudio and add a column with mutate with the site_id number for working with the data later.

### Getting ALL the data with lapply

all_sntl_data<-lapply(dolores_site_ids, get_snotl_data)%>%
bind_rows()

all_sntl_data
## # A tibble: 58,531 x 22
##    Date  Air.Temperature… Air.Temperature… Air.Temperature…
##    <chr>            <int>            <int>            <int>
##  1 2012…               46               52               42
##  2 2012…               46               53               42
##  3 2012…               52               62               44
##  4 2012…               52               61               46
##  5 2012…               52               64               46
##  6 2012…               54               64               47
##  7 2012…               53               64               49
##  8 2012…               52               61               47
##  9 2012…               54               63               49
## 10 2012…               50               61               42
## # … with 58,521 more rows, and 18 more variables:
## #   Air.Temperature.Observed..degF..Start.of.Day.Values <int>,
## #   Precipitation.Accumulation..in..Start.of.Day.Values <dbl>,
## #   Snow.Depth..in..Start.of.Day.Values <int>,
## #   Snow.Water.Equivalent..in..Start.of.Day.Values <dbl>,
## #   Soil.Moisture.Percent..4in..pct..Start.of.Day.Values <dbl>,
## #   Soil.Moisture.Percent..8in..pct..Start.of.Day.Values <dbl>,
## #   Soil.Moisture.Percent..20in..pct..Start.of.Day.Values <dbl>,
## #   Soil.Moisture.Percent..40in..pct..Start.of.Day.Values <dbl>,
## #   Soil.Temperature.Observed..4in..degF..Start.of.Day.Values <int>,
## #   Soil.Temperature.Observed..8in..degF..Start.of.Day.Values <int>,
## #   Soil.Temperature.Observed..20in..degF..Start.of.Day.Values <int>,
## #   Soil.Temperature.Observed..40in..degF..Start.of.Day.Values <int>,
## #   site_id_num <dbl>,
## #   Soil.Moisture.Percent..2in..pct..Start.of.Day.Values <dbl>,
## #   Soil.Temperature.Observed..2in..degF..Start.of.Day.Values <int>,
## #   Wind.Direction.Average..degree. <int>, Wind.Speed.Average..mph. <dbl>,
## #   Wind.Speed.Maximum..mph. <dbl>

lapply takes: * dolores_site_ids a vector of values to loop over. * get_snotl_data a function to inject each of the vector values into one at a time.

lapply returns a list so the final step is to bind_rows which merges all of the columns.

### Cleaning the Data

I’m going to predict number of days the Dolores Runs in a year, so I need to convert the snow data from daily data from multiple sites to annual data from all sites. I’m assuming that max snow water equivalant correlates best with runoff in this dataset so we will use that to summarize each year.

se_site<-all_sntl_data%>%
select(Date, Snow.Depth..in..Start.of.Day.Values, Snow.Water.Equivalent..in..Start.of.Day.Values, site_id_num)%>%
mutate(
date = as.Date(Date)
)%>%
rename(snow_depth = 2, snow_water_eq=3)%>%
filter(site_id_num %in% c(465, 586, 589, 739))

se_site
## # A tibble: 50,182 x 5
##    Date       snow_depth snow_water_eq site_id_num date
##    <chr>           <int>         <dbl>       <dbl> <date>
##  1 1986-08-05         NA            NA         465 1986-08-05
##  2 1986-08-06         NA             0         465 1986-08-06
##  3 1986-08-07         NA             0         465 1986-08-07
##  4 1986-08-08         NA             0         465 1986-08-08
##  5 1986-08-09         NA             0         465 1986-08-09
##  6 1986-08-10         NA             0         465 1986-08-10
##  7 1986-08-11         NA             0         465 1986-08-11
##  8 1986-08-12         NA             0         465 1986-08-12
##  9 1986-08-13         NA             0         465 1986-08-13
## 10 1986-08-14         NA             0         465 1986-08-14
## # … with 50,172 more rows
• First we select the columns that contain the Date, Snow Depth and Snow water Equivalant. I don’t end up using Snow depth because it only goes back to around 2000 for all sites while snow water equivalant goes back to 1986.
• mutate add a date field that is formatted as a date.
• rename snow depth and snow water equivalant to something more manageable.
• The last thing we do is subset the sites to only those that have been around since 1986 using filter.

Let’s take a look at these sites really quick to see how they look.

se_site%>%
filter(year(date)>2007)%>%
ggplot(aes(date, snow_water_eq, color = as.factor(site_id_num)))+
geom_line(size = 1.5, alpha = 0.5)+
theme_light()

se_site_year<-se_site%>%
group_by(year(date), site_id_num)%>%
summarize(max_se = max(snow_water_eq, na.rm = T))%>%
ungroup()%>%
rename(year=1)

se_site_year
## # A tibble: 142 x 3
##     year site_id_num max_se
##    <dbl>       <dbl>  <dbl>
##  1  1985         586   21.6
##  2  1985         589   23.5
##  3  1986         465    6.3
##  4  1986         586   19.8
##  5  1986         589   20.8
##  6  1986         739    5.1
##  7  1987         465   19.1
##  8  1987         586   19.1
##  9  1987         589   22
## 10  1987         739   20.1
## # … with 132 more rows
• The next several steps are where the sausage is made. First I want the max value for snow water equivalant for each site per year. To do this, we group_by year(date) which uses the date field to group the data and then. We also want to group by site_id_num so we add that to the group_by as well.
• We can then summarize by our groups. Here we want to make a variable caleed max_se which is the max of the snow_water_eq field within the group we described above.
• Lastly we ungroup the data (group_by can cause nasty problems if not ungrouped) and rename the year(date) column to year.

### Combining a subset of sites

To simplify things we will average the sites. I think if I did this again I would not average the sites but have each site be a predictor.

avg_snwater_eq<-se_site_year%>%
filter(year>1986)%>%
group_by(year)%>%
summarize(avg_snow_water_e = mean(max_se))%>%
ungroup()

avg_snwater_eq
## # A tibble: 34 x 2
##     year avg_snow_water_e
##    <dbl>            <dbl>
##  1  1987            20.1
##  2  1988            11.4
##  3  1989            16.3
##  4  1990             9.28
##  5  1991            16.5
##  6  1992            15.5
##  7  1993            25.4
##  8  1994            12.3
##  9  1995            21.8
## 10  1996            13.6
## # … with 24 more rows
• Only 4 of the 6 sites in the basin have values all the way back to 1986, the year that the McPhee Dam was put in.
• We then group_by year because we want one value per year.
• Then summarize the max snow water equivalent (max_se) - created in the last step - as a mean of the four site maxes per year.
• And then, as always, ungroup.

### Summary of Snotel Snow depth

Now we have our first predictive variable: average max snow water equivalent four three snotel sites in the Dolores River watershed.

The steps we took to get that data: 1. We used the RNRCS package to… 2. Get all sites in the Snotel Network with grabNRCS.meta() 3. We then subset those sites to just those in the Dolores watershed. 4. We then used the site_ids to get daily snow data from each site. 5. We combined each sites data into one dataframe. 6. We then grouped the data to get the max snow water equivalent for each site for each year. 7. We then averaged the maxes from four sites to get one average snow water equivalent per year.

## Prediction of Future Snowpack

We’ll use the {fable} package for timeseries forecasting.

library(fable) #For prediction

For forecasting we need an average of all four sites per day. We want to predict what the average will be.

avg_sn<-se_site%>%
filter(year(Date)>1987)%>%
group_by(Date)%>%
summarize(avg_sn_eq =  mean(snow_water_eq, rm.na = T))%>%
ungroup()%>%
mutate(Date = as.Date(Date))

avg_sn
## # A tibble: 11,741 x 2
##    Date       avg_sn_eq
##    <date>         <dbl>
##  1 1988-01-01      4.1
##  2 1988-01-02      4.18
##  3 1988-01-03      4.38
##  4 1988-01-04      4.53
##  5 1988-01-05      4.82
##  6 1988-01-06      5.4
##  7 1988-01-07      5.6
##  8 1988-01-08      5.65
##  9 1988-01-09      5.8
## 10 1988-01-10      5.9
## # … with 11,731 more rows
• We filter years after 1987 because we don’t want the year that they were filling the reservior to influence the prediction.
• Next we group by Date which is daily and then summarize, averaging all four sites for that day. We ungroup to prevent weird stuff from happening.
• And lastly we convert the Date column, which for some reaon was converted to a character, to an actual date with as.Date.

Let see what it looks like:

avg_sn%>%
ggplot(aes(Date, avg_sn_eq))+
geom_line(color = "#1AB8FF", size = 2, alpha = 0.5)+
labs(title = "Average Snow Water Equivalant",
subtitle = "From four Snotel Sites in the Dolores River watershed",
y = "Avg. Snow Water Equivalant")

To do a timeseries forcast we need to convert the data into tsibble format. I don’t know a ton about tsibbles. I think they are basically just a tibble with an time index.

avg_sn_ts<-avg_sn%>%
as_tsibble(index = Date)

Next we will make our model. I by no means am an expert at forecasting (this took me an embarassingly long time). Many of my initial tries failed using a variety of other models. I finally got an auto ARIMA forecast to work using fourier. If you want to learn more about forecasting, check out Rob Hyndman’s Blog.

fit<-avg_sn_ts%>%
model(
arima = ARIMA(avg_sn_eq~fourier("year", K = 15))
)

fit%>%
forecast(h = "3 months")%>%
autoplot(avg_sn_ts%>%filter(year(Date)>2018))+
geom_line(color = "#ffffff", size = 2)+
labs(title = "Snowpack Prediction",
subtitle = "Dolores River watershed Snotel sites",
y = "Avg. Snow Whater Equivalant")

Looking at the forecast in the autoplot() which is just an easy ggplot for forecasting, it looks to me that the forecast looks pretty good, but might be a bit optimistic for the average. But it’s hard to tell. We could get a lot of snow in the end of February or March. It gives us a good range to work with and use in our predictions.

To use in our prediction we need a max, min and average value for the max snowpack.

pred_sn_pk<-fit%>%
forecast(h = "6 months")%>%
filter(avg_sn_eq == max(avg_sn_eq))%>%
mutate(ci_90 = hilo(.distribution, 90))%>%
mutate(pred_lower = ci_90$.lower, pred_upper = ci_90$.upper)%>%
as_tibble()%>%
rename(pred_avg_sn_eq = avg_sn_eq)%>%
select(Date, pred_avg_sn_eq, pred_lower, pred_upper)%>%
pivot_longer(-Date, names_to = "estimated_eq", values_to = "avg_snow_water_e")%>%
mutate(year = year(Date))%>%
select(-Date)

pred_sn_pk
## # A tibble: 3 x 3
##   estimated_eq   avg_snow_water_e  year
##   <chr>                     <dbl> <dbl>
## 1 pred_avg_sn_eq            12.6   2020
## 2 pred_lower                 9.70  2020
## 3 pred_upper                15.5   2020
• User our model to forecast() six months out.
• filter out the max value for the average (this isn’t perfect but it will work for our purposes).
• Add a column with higher and lower values based on a 90% confidence interval using the hilo function.
• convert to a tibble.
• Select only the values we want.
• pivot_longer so that we have a year column and a snow_water_equivalant column.
• Make a year column.
• drop the Date column.

## Reservoir Volume

The next variable we need is reservoir volume. ### Get Daily McPhee Reservoir Volumes

bor_data<-grabBOR.data(site_id = "MPHC2000",
timescale = 'daily',
DayBgn = '1985-01-01',
DayEnd = Sys.Date())%>%
as_tibble()%>%
mutate(date = as.Date(Date),
res_volume = as.numeric(Reservoir Storage Volume (ac_ft) Start of Day Values))%>%
select(date, res_volume)

bor_data
## # A tibble: 12,826 x 2
##    date       res_volume
##    <date>          <dbl>
##  1 1985-01-01      44997
##  2 1985-01-02      45079
##  3 1985-01-03      45161
##  4 1985-01-04      45189
##  5 1985-01-05      45207
##  6 1985-01-06      45262
##  7 1985-01-07      45372
##  8 1985-01-08      45463
##  9 1985-01-09      45601
## 10 1985-01-10      45721
## # … with 12,816 more rows

We use the {RNRCS} package again to get Bureau of Reclemation (BOR - the agency that managed the McPhee dam)data. * The {RNRCS} package has a function to grab Bureau of Reclemation data grabBOR.data(). It works much the same way as the grapNRCS.data() function that we wrapped our own function around above. However this time, we only needed one site so I just looked it up and put it in instead of using a meta function to get sites like we did with the snotel data. * We convert the data to a tibble (df) convert the date to date format and convert the volume into numeric. * We then subset the columns to just date and reservoir volume.

### Summarize data to yearly

res_vol<-bor_data%>%
filter(month(date) %in% c(01, 02))%>%
group_by(year(date))%>%
summarize(min_vol = min(res_volume, na.rm = T))%>%
ungroup()%>%
rename(year = 1)%>%
filter(year>1986)

res_vol
## # A tibble: 34 x 2
##     year min_vol
##    <dbl>   <dbl>
##  1  1987  293309
##  2  1988  324234
##  3  1989  319239
##  4  1990  274597
##  5  1991  241536
##  6  1992  297471
##  7  1993  304632
##  8  1994  310609
##  9  1995  268308
## 10  1996  312568
## # … with 24 more rows

Here we subset the data to just the winter months and then summarize by year by getting the minimum value for all three months (I originally averaged the months, but getting the min gives a better prediction). * First we filter January and February to get that years lowest reservoir values prior to runoff. * We then group_by() year and then get the minimul volume from each of the two months. * then we rename the first column to year instead of year(date) and filter out all years before 1987.

# Building the Model

### Combining the variables and the datasets.

var_df<-predicted_variable%>%
full_join(avg_snwater_eq)%>%
full_join(res_vol)%>%
arrange(year)%>%
mutate(raftable_releases = ifelse(is.na(raftable_releases), 0, raftable_releases))

var_df_train<-var_df%>%
filter(year!=2020)

var_df_test<-var_df%>%
filter(year==2020)

var_df_train
## # A tibble: 33 x 4
##     year raftable_releases avg_snow_water_e min_vol
##    <dbl>             <dbl>            <dbl>   <dbl>
##  1  1987                87            20.1   293309
##  2  1988                13            11.4   324234
##  3  1989                17            16.3   319239
##  4  1990                 0             9.28  274597
##  5  1991                 9            16.5   241536
##  6  1992                61            15.5   297471
##  7  1993                99            25.4   304632
##  8  1994                35            12.3   310609
##  9  1995                89            21.8   268308
## 10  1996                 0            13.6   312568
## # … with 23 more rows
• To join the datasets we use a full join because not every year has a raftable release and therefore has some year missing. Both of the other datasets have data for each year and a full join fills any missing variables with NA but keeps all observations.
• The years with NA values for raftable release should be 0, because those are the years without a raftable release and we want to th model to predict when those years are as well.
• Lastly we make two datasets one to train the model, which includes all years except this year 2020, and test model this year.

Note: typically you want to split the data into training and testing datasets so you can train your model with the training dataset and then evaluate it with the testing dataset that it hasn’t seen. But here we don’t really care that much because this is just fun and we can predict this upcoming release and see if it is right.

### Random Forest Using caret

library(caret)
set.seed(1234)

control <- trainControl(method="cv", number=10)

rf_model <- train(raftable_releases~avg_snow_water_e+min_vol,
data=var_df_train,
method="rf",
tuneLength=3,
trControl=control)
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
rf_model
## Random Forest
##
## 33 samples
##  2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 29, 31, 29, 29, 29, 30, ...
## Resampling results:
##
##   RMSE      Rsquared   MAE
##   15.69357  0.6885501  13.05262
##
## Tuning parameter 'mtry' was held constant at a value of 2

Here we build a cross validated random forest regression model to go predict number of raftable days. * meothod = "repeatedcv" is repeated cross validation where we do 10 cross validations 3 times with a random search. * We use the train() functio from caret to fit the model using both predictive variables avg_snow_water_e and min_vol. * We use Random Forest algorythm "rf" with 2000 trees. * we try 3 mtry depths (thgere are only two because we have only two variables, but tuneLength of two was not tuning at all for some reason).

### Evaluating the model

WARNING: this is not how you should evaluate a model typically like I said above.

var_df_train%>%
mutate(prediction = stats::predict(rf_model, .))%>%
ggplot(aes(raftable_releases, prediction))+
geom_point(color = "#1AB8FF", size =4, alpha = 0.6)+
geom_text(aes(label=year),hjust=-0.3, vjust=-0.3, color = "#FFFFFF")+
labs(title = "Random Forest: Predicted vs Actual Raftable Release Days",
x = "Actual Release Days",
y = "Predicted Release Days",
subtitle = "Sudo-testing accuracy of RF model")

Pretty good. Again, I keep saying this becuase it is really important, we would want to make sure that the model works in out of training sample, but with this dataset, we don’t have enough data to do that.

var_df_test%>%
mutate(prediction = predict(rf_model, .))
## # A tibble: 1 x 5
##    year raftable_releases avg_snow_water_e min_vol prediction
##   <dbl>             <dbl>            <dbl>   <dbl>      <dbl>
## 1  2020                 0             10.4  287807       4.71

So far the model predicts we will have 4.7078333 days of raftable flows. I’d say that is within the margin of error of 0 days looking at past predictions. But what if we get more snow? Given that it is February and Peak snow is typically around mid March would be a good assumption.

Let’s add the snotel forcast from the ARIMA model above to the predicted dataset.

pred_sn_pk_1<-pred_sn_pk%>%
mutate(min_vol = var_df%>%
filter(year == 2020)%>%
pull(min_vol))

test_data<-var_df%>%
filter(year==2020)%>%
select(-raftable_releases)%>%
mutate(estimated_eq = "current")%>%
bind_rows(pred_sn_pk_1)

test_data
## # A tibble: 4 x 4
##    year avg_snow_water_e min_vol estimated_eq
##   <dbl>            <dbl>   <dbl> <chr>
## 1  2020            10.4   287807 current
## 2  2020            12.6   287807 pred_avg_sn_eq
## 3  2020             9.70  287807 pred_lower
## 4  2020            15.5   287807 pred_upper

Now let’s use the prediction for current, pred_avg_sn_eq, pred_lower, and pred_upper

test_data%>%
mutate(prediction = stats::predict(rf_model, .))
## # A tibble: 4 x 5
##    year avg_snow_water_e min_vol estimated_eq   prediction
##   <dbl>            <dbl>   <dbl> <chr>               <dbl>
## 1  2020            10.4   287807 current              4.71
## 2  2020            12.6   287807 pred_avg_sn_eq       6.60
## 3  2020             9.70  287807 pred_lower           3.09
## 4  2020            15.5   287807 pred_upper          23.0

That looks pretty good.

• Current Prediction is 4.71 days.
• Predicted is 6.6 at the average forecast snowdepth.
• 3.09 is the low end. And 23 days if we get dumped on from now until April 1.

Let’s try something a bit simpler, like a linear model.

model_lm<-glm(raftable_releases~avg_snow_water_e+min_vol, family = "poisson", data = var_df_train)

##saveRDS(model_lm, "output/models/lm_model.rds")

summary(model_lm)
##
## Call:
## glm(formula = raftable_releases ~ avg_snow_water_e + min_vol,
##     family = "poisson", data = var_df_train)
##
## Deviance Residuals:
##    Min      1Q  Median      3Q     Max
## -6.033  -3.211  -2.308   1.997   6.488
##
## Coefficients:
##                       Estimate    Std. Error z value             Pr(>|z|)
## (Intercept)      -2.0553926585  0.2693758160   -7.63   0.0000000000000234
## avg_snow_water_e  0.1900049675  0.0072692190   26.14 < 0.0000000000000002
## min_vol           0.0000076058  0.0000008358    9.10 < 0.0000000000000002
##
## (Intercept)      ***
## avg_snow_water_e ***
## min_vol          ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
##     Null deviance: 1292.88  on 32  degrees of freedom
## Residual deviance:  446.98  on 30  degrees of freedom
## AIC: 563.07
##
## Number of Fisher Scoring iterations: 6

This model looks pretty good.

model_lm%>%
augment(data = var_df_train, type.predict = "response")%>%
ggplot(aes(raftable_releases, .fitted))+
geom_point(color = "#1AB8FF", size =4, alpha = 0.6)+
geom_text(aes(label=year),hjust=-0.3, vjust=-0.3, color = "#FFFFFF")+
labs(title = "Linear Model: Predicted vs Actual Raftable Release Days",
subtitle = "Actual Values vs. Predicted Values",
y = "Predicted Values",
x = "Actual Raftable Release Days")

This model doesn’t fit the data quite as well. Lets see how it predicts the predictions.

model_lm%>%
augment(newdata = test_data, type.predict = "response")
## # A tibble: 4 x 6
##    year avg_snow_water_e min_vol estimated_eq   .fitted .se.fit
##   <dbl>            <dbl>   <dbl> <chr>            <dbl>   <dbl>
## 1  2020            10.4   287807 current           8.25   0.620
## 2  2020            12.6   287807 pred_avg_sn_eq   12.5    0.767
## 3  2020             9.70  287807 pred_lower        7.21   0.575
## 4  2020            15.5   287807 pred_upper       21.6    0.986

That looks more rational than the random forest model. The results:

• Current snowpack = 8.25 days
• predicted peak snowpack = 12.5 days.
• predicted lower snowpack = 7.21 (this really isn’t rational considering current is already greater than it).
• predicted upper snowpack = 21.6 days (let it snow let it snow).

I’ll continue to run this as the winter progresses. We now have a model that we can push into production to predict future data sets.