1 Introduction

This document provides an introduction to the weathergen R package. This package provides functions for generating synthetic weather data using the method described by Steinschneider and Brown (2013):

Steinschneider, S., & Brown, C. (2013). A semiparametric multivariate, multisite weather generator with low-frequency variability for use in climate risk assessments. Water Resources Research, 49(11), 7205–7220. doi:10.1002/wrcr.20528

The code in this package is based on the original scripts written by Scott Steinschneider, PhD and modified by Jeffrey D Walker, PhD.

In order to run the code below, a few packages must be loaded first.

library(dplyr)
library(lubridate)
library(tidyr)
library(ggplot2)
library(gridExtra)
library(zoo)
library(moments)
library(weathergen)
theme_set(theme_bw())

# set random seed for reproducibility
set.seed(1)

2 Load Historical Data

The weathergen package includes a dataset of historical climate data for three cities in the Northeast US. These datasets are provided in the climate_cities dataset, which is a list containing dataframes for 3 cities: Boston, Providence, and Worcester:

data(climate_cities)
str(climate_cities)

## List of 3
##  $ boston    :'data.frame':  22645 obs. of  5 variables:
##   ..$ DATE: Date[1:22645], format: "1949-01-01" ...
##   ..$ PRCP: num [1:22645] 0.35 0 0 0.17 16.67 ...
##   ..$ TMAX: num [1:22645] 5.99 2.14 3.75 5.46 7.07 14.3 9.21 12.9 9 7.33 ...
##   ..$ TMIN: num [1:22645] -0.84 -1.51 -3.15 -1.49 0.61 3.26 1.5 2.21 2.12 4.26 ...
##   ..$ WIND: num [1:22645] 10.93 13.11 9.57 7.68 3.31 ...
##  $ providence:'data.frame':  22645 obs. of  5 variables:
##   ..$ DATE: Date[1:22645], format: "1949-01-01" ...
##   ..$ PRCP: num [1:22645] 0 0 0 0.65 12.9 18.4 0 0 0 0 ...
##   ..$ TMAX: num [1:22645] 3.68 0.93 4.06 5.44 9.95 ...
##   ..$ TMIN: num [1:22645] -1.96 -2.19 -4.68 -3.78 -1.26 2.65 0.66 2.25 0.18 2.87 ...
##   ..$ WIND: num [1:22645] 12.32 13.82 10.05 7.52 3.2 ...
##  $ worcester :'data.frame':  22645 obs. of  5 variables:
##   ..$ DATE: Date[1:22645], format: "1949-01-01" ...
##   ..$ PRCP: num [1:22645] 0.47 0 0 0 9.23 ...
##   ..$ TMAX: num [1:22645] 6.6 -0.06 3.26 3.82 3.82 ...
##   ..$ TMIN: num [1:22645] -3.39 -3.39 -5.62 -8.39 -5.62 -1.73 -3.95 -2.28 -2.28 2.71 ...
##   ..$ WIND: num [1:22645] 11.21 13.18 9.58 7.06 3.83 ...

For this vignette, we will use the dataset for Boston, MA.

# extract dataset for boston
obs_day <- climate_cities[['boston']]

# subset complete water years using complete_water_years() function
obs_day <- obs_day[complete_water_years(obs_day$DATE),]

obs_day <- mutate(obs_day,
                  WYEAR=wyear(DATE, start_month=10), # extract water year
                  MONTH=month(DATE),                 # extract month
                  TEMP=(TMIN+TMAX)/2) %>%            # compute mean temperature
  select(WYEAR, MONTH, DATE, PRCP, TEMP, TMIN, TMAX, WIND)

# compute annual timeseries by water year
obs_wyr <- group_by(obs_day, WYEAR) %>%
  summarise(PRCP=sum(PRCP),
            TEMP=mean(TEMP),
            TMIN=mean(TMIN),
            TMAX=mean(TMAX),
            WIND=mean(WIND))

# save the daily and annual timeseries to a list
obs <- list(day=obs_day, wyr=obs_wyr)

# summary of the daily timeseries
summary(obs[['day']])

##      WYEAR          MONTH             DATE                 PRCP        
##  Min.   :1950   Min.   : 1.000   Min.   :1949-10-01   Min.   :  0.000  
##  1st Qu.:1965   1st Qu.: 4.000   1st Qu.:1964-12-30   1st Qu.:  0.000  
##  Median :1980   Median : 7.000   Median :1980-03-31   Median :  0.000  
##  Mean   :1980   Mean   : 6.523   Mean   :1980-03-31   Mean   :  3.259  
##  3rd Qu.:1995   3rd Qu.:10.000   3rd Qu.:1995-07-01   3rd Qu.:  2.120  
##  Max.   :2010   Max.   :12.000   Max.   :2010-09-30   Max.   :185.850  
##       TEMP             TMIN              TMAX             WIND       
##  Min.   :-18.68   Min.   :-25.130   Min.   :-14.05   Min.   : 0.020  
##  1st Qu.:  3.09   1st Qu.: -0.850   1st Qu.:  6.98   1st Qu.: 4.090  
##  Median : 10.86   Median :  6.070   Median : 15.64   Median : 6.330  
##  Mean   : 10.62   Mean   :  5.903   Mean   : 15.33   Mean   : 6.724  
##  3rd Qu.: 18.86   3rd Qu.: 13.810   3rd Qu.: 23.98   3rd Qu.: 8.920  
##  Max.   : 31.99   Max.   : 26.240   Max.   : 38.15   Max.   :23.980

This figure shows the daily timeseries of each climate variable.

gather(obs[['day']], VAR, VALUE, PRCP:WIND) %>%
  ggplot(aes(DATE, VALUE)) +
  geom_line() +
  facet_wrap(~VAR, scales='free_y', ncol=1) +
  labs(x='', y='', title='Historical Daily Weather Data for Boston, MA')

This figure shows the annual sum of precipitation, and means of minimum temperature, maximum temperature, and wind speed by water year.

gather(obs[['wyr']], VAR, VALUE, PRCP:WIND) %>%
  ggplot(aes(WYEAR, VALUE)) +
    geom_line() +
    facet_wrap(~VAR, scales='free_y', ncol=1) +
    labs(x='', y='', title='Historical Annual Weather Data for Boston, MA')

3 Daily Weather Generator

The primary function for running the daily weather generator on a single site is wgen_daily(). The function then depends on a number of other functions, which are described in the next section.

The first argument to wgen_daily() is the daily timeseries of historical climate variables. This argument must be provided as a zoo object which is designed to store regular and irregular timeseries (see the zoo package homepage for more details).

This code converts the daily timeseries (currently in obs[['day']]) to a zoo object. The first argument provides the values of each variable, the second argument provides the associated dates.

zoo_day <- zoo(x = obs[['day']][, c('PRCP', 'TEMP', 'TMIN', 'TMAX', 'WIND')], 
             order.by = obs[['day']][['DATE']])
class(zoo_day)

## [1] "zoo"

summary(zoo_day)

##      Index                 PRCP              TEMP             TMIN        
##  Min.   :1949-10-01   Min.   :  0.000   Min.   :-18.68   Min.   :-25.130  
##  1st Qu.:1964-12-30   1st Qu.:  0.000   1st Qu.:  3.09   1st Qu.: -0.850  
##  Median :1980-03-31   Median :  0.000   Median : 10.86   Median :  6.070  
##  Mean   :1980-03-31   Mean   :  3.259   Mean   : 10.62   Mean   :  5.903  
##  3rd Qu.:1995-07-01   3rd Qu.:  2.120   3rd Qu.: 18.86   3rd Qu.: 13.810  
##  Max.   :2010-09-30   Max.   :185.850   Max.   : 31.99   Max.   : 26.240  
##       TMAX             WIND       
##  Min.   :-14.05   Min.   : 0.020  
##  1st Qu.:  6.98   1st Qu.: 4.090  
##  Median : 15.64   Median : 6.330  
##  Mean   : 15.33   Mean   : 6.724  
##  3rd Qu.: 23.98   3rd Qu.: 8.920  
##  Max.   : 38.15   Max.   :23.980

The daily historical timeseries is then passed as the first argument to the wgen_daily() function. The other arguments specify a number of parameters for the simulation. These parameters include:

n_year - number of simulation years
start_month - first month of the water year (e.g. 10=October)
start_water_year - first water year to use for the simulated timeseries
include_leap_days - boolean flag to include or exclude leap days in the simulation timeseries (currently only allows FALSE)
n_knn_annual - number of years to use in the annual k-nearest neighbors (knn) sampling algorithm
dry_wet_threshold - daily precipitation threshold between the dry and wet Markov states
wet_extreme_quantile_threshold - daily precipitation quantile used as threshold between the wet and extreme Markov states
adjust_annual_precip - boolean flag to adjust simulated daily precipitation amounts to match the simulated annual precipitation
annual_precip_adjust_limits - lower and upper limits for the daily precipitation adjustment factor to match simulated annual precipitation (e.g. c(0.9, 1.1) will only adjust daily values between 90-110%)
dry_spell_changes - adjustment factor for changing the transition probabilities of dry spells
wet_spell_changes - adjustment factor for changing the transition probabilities of wet spells
prcp_mean_changes - multiplicative adjustment factor for changing the annual mean precipitation
prcp_cv_changes - multiplicative adjustment factor for changing the coefficient of variation for precipitation
temp_mean_changes - additive adjustment factor for changing the annual mean temperature

The last five parameters for assigning adjustment factors can each be either a single number, which is applied to all months, or a vector of length 12 where each value is the change factor for a specific month (starting with January).

sim <- wgen_daily(zoo_day, 
                  n_year=2, 
                  start_month=10, 
                  start_water_year=2000, 
                  include_leap_days=FALSE,
                  n_knn_annual=100,
                  dry_wet_threshold=0.3, 
                  wet_extreme_quantile_threshold=0.8,
                  adjust_annual_precip=TRUE,
                  annual_precip_adjust_limits=c(0.9, 1.1),
                  dry_spell_changes=1,
                  wet_spell_changes=1,
                  prcp_mean_changes=1,
                  prcp_cv_changes=1,
                  temp_mean_changes=0)

The wgen_daily() function returns a list object with the following elements:

str(sim, max=1)

## List of 6
##  $ obs                :'data.frame': 22280 obs. of  9 variables:
##  $ state_thresholds   :'data.frame': 12 obs. of  3 variables:
##  $ transition_matrices:List of 12
##  $ state_equilibria   :List of 12
##  $ change_factors     :List of 6
##  $ out                :'data.frame': 730 obs. of  11 variables:

The first element, sim[['obs']], is the historical daily timeseries, with additional columns for the water year (WYEAR), month (MONTH), and Markov state (d=Dry, w=Wet, e=Extreme).

head(sim[['obs']])

##         DATE WYEAR MONTH STATE PRCP   TEMP  TMIN  TMAX WIND
## 1 1949-10-01  1950    10     d 0.00 13.765  8.89 18.64 8.53
## 2 1949-10-02  1950    10     d 0.00 10.225  4.94 15.51 7.31
## 3 1949-10-03  1950    10     d 0.00 12.665  6.06 19.27 3.50
## 4 1949-10-04  1950    10     d 0.00 16.555 11.06 22.05 2.76
## 5 1949-10-05  1950    10     d 0.28 18.970 14.02 23.92 5.59
## 6 1949-10-06  1950    10     d 0.00 14.135  7.84 20.43 6.76

The second element, sim[['state_thresholds']], is a dataframe with the monthly precipitation thresholds for each Markov state. For example, in June the Markov state on a given day is dry if the precipitation on that day is less than 0.3 mm/day, wet if it is between 0.3 and 3.33 mm/day, and extreme if it is greater than 3.33 mm/day.

sim[['state_thresholds']]

##    MONTH DRY_WET WET_EXTREME
## 1      1     0.3        4.18
## 2      2     0.3        3.80
## 3      3     0.3        4.93
## 4      4     0.3        4.03
## 5      5     0.3        3.60
## 6      6     0.3        3.33
## 7      7     0.3        2.70
## 8      8     0.3        3.12
## 9      9     0.3        2.60
## 10    10     0.3        2.88
## 11    11     0.3        4.75
## 12    12     0.3        4.60

The third element, sim[['transition_matrices']], is a list of length 12 where each element is the Markov state transition matrix for each month. For example, the transition matrix for June is:

sim[['transition_matrices']][[6]]

##    
##             d         w         e
##   d 0.6977169 0.1634703 0.1388128
##   w 0.5189189 0.2621622 0.2189189
##   e 0.3863014 0.2520548 0.3616438

The fourth element, sim[['state_equilibria]]`, is a list of length 12 where each element is the state equilibrium of each monthly transition matrix. For example, the equilibrium for June is:

sim[['state_equilibria']][[6]]

##         d         w         e 
## 0.5997119 0.2009608 0.1993273

The fifth element, sim[['change_factors']], is a list containing the monthly values for each adjustment factor as well as an additional set of adjustments that are computed from the simulated state transitions (ratio_probability_wet). Each of these elements will each be a vector of length 12 (even if the corresponding parameter to wgen_daily() was a single value, in which case the same value will be used for all months).

str(sim[['change_factors']])

## List of 6
##  $ ratio_probability_wet: num [1:12] 1 1 1 1 1 1 1 1 1 1 ...
##  $ dry_spell_changes    : num [1:12] 1 1 1 1 1 1 1 1 1 1 ...
##  $ wet_spell_changes    : num [1:12] 1 1 1 1 1 1 1 1 1 1 ...
##  $ prcp_mean            : num [1:12] 1 1 1 1 1 1 1 1 1 1 ...
##  $ prcp_cv              : num [1:12] 1 1 1 1 1 1 1 1 1 1 ...
##  $ temp_mean            : num [1:12] 0 0 0 0 0 0 0 0 0 0 ...

The last element, sim[['out']], contains the simulated daily timeseries as a dataframe with the following columns:

SIM_YEAR - simulation year starting at 1
DATE - simulation date (starting on the first day of the start_month and start_water_year parameters, e.g. 1999-10-01 for water year 2000 starting in month 10 (October))
MONTH - simulation month
WDAY - simulation “water day” which is the julian day starting at the beginning of the water year (e.g. water day 1 = Oct 1)
SAMPLE_DATE - the historical date sampled from the KNN sampling algorithm and used for the each simulation date
STATE - simulated Markov state
PRCP, TEMP, TMIN, TMAX, WIND - simulated climate variables

str(sim[['out']])

## 'data.frame':    730 obs. of  11 variables:
##  $ SIM_YEAR   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ DATE       : Date, format: "1999-10-01" "1999-10-02" ...
##  $ MONTH      : num  10 10 10 10 10 10 10 10 10 10 ...
##  $ WDAY       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SAMPLE_DATE: Date, format: "2004-10-01" "1985-09-30" ...
##  $ STATE      : Ord.factor w/ 3 levels "d"<"w"<"e": 1 1 1 1 2 3 3 1 1 1 ...
##  $ PRCP       : num  0.284 0 0 0 2.25 ...
##  $ TEMP       : num  15.8 17.3 15.2 12.7 12.5 ...
##  $ TMIN       : num  12.07 10.11 10.19 7.48 7.84 ...
##  $ TMAX       : num  19.5 24.6 20.3 17.9 17.1 ...
##  $ WIND       : num  1.8 3.56 5.44 5.95 7.41 6.29 9.69 7.3 6.37 9.68 ...

This figure shows the simulated timeseries for each variable:

select(sim[['out']], DATE, PRCP, TEMP, TMIN, TMAX, WIND) %>%
  gather(VAR, VALUE, -DATE) %>%
  ggplot(aes(DATE, VALUE)) +
  geom_line() +
  facet_wrap(~VAR, ncol=1, scales='free_y') +
  labs(x='Simulation Date', y='Simulated Value', title='Simulated Timeseries') +
  theme_bw()

4 Algorithm Description

The daily weather generator uses the following algorithm:

Simulate annual precipitation using an ARMA model (wavelet model coming soon)
For each simulation year:
Use KNN sampling to create a historical daily timeseries that only includes only years where annual precipitation is similar to the current simulated annual precipitation
compute the monthly Markov state transition thresholds and assign states to daily timeseries
fit the monthly Markov chain transition probability matrices
adjust the transition matrices based on the dry and wet spell change factors
for each simulation day in the current year, use KNN sampling to sample the synthetic historical daily values based on the state, precipitation and mean temperature of the current and previous days
Combine the daily simulation for each year
Adjust daily simulated precipitation values for each year to equal the simulated annual values (if adjust_annual_precip=TRUE)
Adjust daily simulated precipitation values based on the precipitation adjustment factors
Adjust daily simulated minimum, maximum and mean temperature based on the temperature adjustment factor

4.1 Annual Precipitation Simulation

the sim_annual_arima() function is used to simulate annual precipitation using an ARIMA model. The arguments to this function include:

x - historical annual precipitation as a zoo object

function first fits the ARIMA model to the historical annual precipitation. The ARIMA model is then used to generate a simulation of annual precipitation for length n_year. The sim_annual_arma() function returns a list of length 3 containing the ARIMA model object (model), the historical annual precipitation (observed), and the simulated annual precipitation (values).

zoo_precip_wyr <- zoo(obs[['wyr']][['PRCP']], order.by = obs[['wyr']][['WYEAR']])
sim_annual_prcp <- sim_annual_arima(x = zoo_precip_wyr, start_year = 2000, n_year = 40)

This function returns a list of length 3 containing elements:

model - the ARIMA model object fitted to the historical data
x - the historical annual precipitation used to fit the model as a zoo object
out - the simulated annual precipitation as a zoo object

str(sim_annual_prcp, max.level=1)

## List of 3
##  $ model:List of 17
##   ..- attr(*, "class")= chr [1:2] "ARIMA" "Arima"
##  $ x    :'zoo' series from 1950 to 2010
##   Data: num [1:61] 798 1235 1248 1186 1686 ...
##   Index:  num [1:61] 1950 1951 1952 1953 1954 ...
##  $ out  :'zoo' series from 2000 to 2039
##   Data: num [1:40] 1571 1404 1071 1234 1437 ...
##   Index:  num [1:40] 2000 2001 2002 2003 2004 ...

The model element contains an ARIMA model object, which contains the parameters and error statistics of the fitted model:

summary(sim_annual_prcp[['model']])

## Series: x 
## ARIMA(0,0,0) with non-zero mean 
## 
## Coefficients:
##       intercept
##       1190.3175
## s.e.    27.0648
## 
## sigma^2 estimated as 44683:  log likelihood=-413.13
## AIC=830.26   AICc=830.47   BIC=834.48
## 
## Training set error measures:
##                        ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 8.013399e-14 211.3826 162.3144 -3.399893 14.55972 0.6760569
##                    ACF1
## Training set -0.0452353

The out element is an annual precipitation timeseries as a zoo object:

sim_annual_prcp[['out']]

##      2000      2001      2002      2003      2004      2005      2006 
## 1570.8299 1404.1184 1071.1577 1233.7399 1436.6759 1663.0372 1254.2111 
##      2007      2008      2009      2010      2011      2012      2013 
##  969.9498  982.4138 1614.2915  752.6348 1836.2482 1135.0726 1094.2667 
##      2014      2015      2016      2017      2018      2019      2020 
## 1223.6231 1387.6197 1254.3302  776.8215 1265.0490 1285.5294 1329.7351 
##      2021      2022      2023      2024      2025      2026      2027 
##  972.2932  689.1246 1121.7078  990.7092 1028.4217  988.7052 1106.1863 
##      2028      2029      2030      2031      2032      2033      2034 
## 1124.5317 1358.5977 1398.8313 1022.3674 1125.0388 1266.7207 1485.9976 
##      2035      2036      2037      2038      2039 
## 1178.4652  831.2053 1239.3271 1165.1438 1564.9916

This figure plots the simulated annual precipitation and shows the mean annual precipitation based on the historical dataset for reference.

data.frame(WYEAR=time(sim_annual_prcp[['out']]),
           PRCP=zoo::coredata(sim_annual_prcp[['out']])) %>%
  ggplot(aes(WYEAR, PRCP)) +
  geom_line() +
  geom_hline(yintercept=mean(obs[['wyr']][['PRCP']]), color='red') +
  geom_text(aes(label=TEXT), data=data.frame(WYEAR=2000, PRCP=mean(obs[['wyr']][['PRCP']]), TEXT="Historical Mean"),
            hjust=0, vjust=-1, color='red') +
  labs(x="Simulation Water Year", y="Annual Precip (mm/yr)", title="Simulated Annual Precipitation")

To determine whether the annual precipitation generator sufficiently reproduces the statistics of the historical timeseries, a Monte Carlo simulation is performed for 50 trials using the same historical timeseries.

# batch simulation to compare mean/sd/skew
sim_wyr_batch <- lapply(1:50, function(i) { 
    sim_annual_arima(x = obs[['wyr']][['PRCP']], start_year = 2000, n_year = 20)[['out']]
  }) 
names(sim_wyr_batch) <- paste0(seq(1, length(sim_wyr_batch)))
sim_wyr_batch <- do.call(merge, sim_wyr_batch) %>% 
  apply(MARGIN=2, FUN=function(x) { c(mean=mean(x), sd=sd(x), skew=skewness(x))}) %>% 
  t
sim_wyr_stats <- rbind(t(colMeans(sim_wyr_batch)),
                       c(mean=mean(obs[['wyr']][['PRCP']]), 
                         sd=sd(obs[['wyr']][['PRCP']]), 
                         skew=skewness(obs[['wyr']][['PRCP']])))
sim_wyr_stats <- as.data.frame(sim_wyr_stats)
sim_wyr_stats$dataset <- c('obs', 'sim')
sim_wyr_stats <- gather(sim_wyr_stats, stat, value, mean, sd, skew)

The following figure shows the distributions of the mean, standard deviation, and skewness statistics over the simulation trials. The overall mean value of each statistic (blue point) is also compared to the statistic value of the historical timeseries (red point). This figure shows that the annual precipitation generator creates simulated timeseries with similar statistics as the historical timeseries.

as.data.frame(sim_wyr_batch) %>% 
  mutate(trial=row_number()) %>% 
  gather(stat, value, mean:skew) %>% 
  ggplot(aes(x=stat, value)) + 
  geom_boxplot() + 
  geom_point(aes(x=stat, y=value, color=dataset), data=sim_wyr_stats, size=3) +
  scale_color_manual('', values=c('sim'='deepskyblue', 'obs'='orangered'),
                     labels=c('sim'='Mean of Simulated Trials', 'obs'='Mean of Historical')) +
  labs(x='', y='') +
  facet_wrap(~stat, scales='free')

4.2 Daily Weather Simulation

The annual precipitation generated in the previous step is then used to drive a daily simulation for all weather variables. For each year, a k-nearest neighbors algorithm samples n_knn_annual years from the historical record with replacement. The daily historical timeseries for each year is then extracted and combined into a daily timeseries, where the sequence of years is based on the KNN sampling. A Markov Chain model is then fitted to the sampled historical daily and used to simulate a sequence of states for each simulation year. Finally, the simulated states are used to sample the historical daily values using a daily KNN algorithm.

4.2.1 KNN Annual Sampling

For each simulation year, a set of n_knn_annual years are sampled with replacement from the historical record using inverse distance weighted sampling. Historical years with annual precipitation similar to the current simulated annual precipitation are given a higher weight and thus more likely to be sampled than historical years where the annual precipitation was much higher or lower than the simulated precipitation.

n_knn_annual <- 100

# generate vector of sampled years
sampled_years <- knn_annual(prcp=coredata(sim_annual_prcp[['out']])[1],
                        obs_prcp=zoo_precip_wyr,
                        n=n_knn_annual)
sampled_years

##   [1] 1984 1998 1998 1990 2006 1998 1954 2006 1984 2006 1998 1958 1998 1958
##  [15] 1984 1960 1951 1984 1984 1967 1984 1970 1974 1956 1998 1998 1955 1998
##  [29] 2004 1955 1984 1998 1960 1998 1994 1952 1958 1998 1998 1998 2010 1996
##  [43] 1984 2006 1998 1998 1955 1961 1996 1954 1958 1998 1970 1984 1998 1983
##  [57] 2008 1967 1984 2004 1954 1958 2003 1998 1955 2009 1955 1998 1998 1987
##  [71] 1998 1998 2008 1996 1998 1998 2006 1976 2010 1998 1984 2004 1998 1998
##  [85] 1998 1954 1954 1967 1998 2003 1998 1987 2006 1984 1998 2006 2006 1974
##  [99] 1970 1984

This figure shows the historical annual precipitation for each water year. The red line shows the annual precipitation for the first simulated year. The red points show which historical years were included in the KNN sampling. Note that these points tend to be close to the target simulated precipitation.

ggplot() +
  geom_point(aes(WYEAR, PRCP, color=SAMPLED), data=obs[['wyr']] %>% mutate(SAMPLED=WYEAR %in% sampled_years)) +
  geom_hline(yintercept=sim_annual_prcp[['out']][1], linetype=2, color='red') +
  geom_text(aes(x=x, y=y, label=label),
            data=data.frame(x=min(obs[['wyr']][['WYEAR']]),
                            y=sim_annual_prcp[['out']][1],
                            label='Target Simulated Annual Precip'),
            vjust=-0.5, hjust=0) +
  scale_color_manual('Sampled in KNN', values=c('TRUE'='orangered', 'FALSE'='grey30')) +
  labs(x='Water Year', y='Annual Precipitation (mm/yr)')

For each of the sampled years in sampled_years, the corresponding daily values are extracted from the historical time series and then combined into a single timeseries. The following figure shows the resulting daily timeseries for the 100 years sampled for the first simulated water year. The points are colored by their historical water year, and thus show years where the historical annual precip is closest to the simulated precip are repeated.

# loop through population years and extract historical daily values
sampled_days <- lapply(sampled_years, function(yr) {
  obs[['day']][which(obs[['day']][['WYEAR']]==yr), ]
}) %>%
  rbind_all() %>%
  mutate(SAMPLE_INDEX=row_number())

# plot combined daily timeseries based on the KNN sampled years
ggplot(sampled_days, aes(SAMPLE_INDEX, PRCP, color=WYEAR)) +
  geom_point(size=1) +
  labs(x='Index Day', y='Daily Precip (mm/day)', title='Synthetic Daily Timeseries from KNN Sampled Years')

4.2.2 Daily Simulation

The resulting timeseries of daily values for the sampled years is then used to run the sim_daily() function. In this function, a Markov Chain simulation is fitted to the synthetic daily timeseries.

4.2.2.1 State Thresholds

The first step is to compute the state thresholds using the mc_state_threshold() function which takes a vector of daily precipitation, a vector of corresponding months, and the dry_wet_threshold and wet_extreme_quantile_threshold.

thresh <- mc_state_threshold(sampled_days[['PRCP']], sampled_days[['MONTH']],
                             dry_wet_threshold=0.3, wet_extreme_quantile_threshold=0.8)

thresh_df <- do.call(rbind, thresh) %>%
  as.data.frame() %>%
  mutate(month=row_number()) %>%
  select(month, dry_wet, wet_extreme)
thresh_df

##    month dry_wet wet_extreme
## 1      1     0.3       4.600
## 2      2     0.3       5.000
## 3      3     0.3       4.970
## 4      4     0.3       5.356
## 5      5     0.3       8.100
## 6      6     0.3       5.970
## 7      7     0.3       3.280
## 8      8     0.3       5.150
## 9      9     0.3       1.800
## 10    10     0.3       3.150
## 11    11     0.3       7.350
## 12    12     0.3       4.694

This figure shows the threshold amounts between the Dry/Wet and Wet/Extreme states for each month.

gather(thresh_df, var, value, -month) %>%
  ggplot(aes(factor(month), value, fill=var)) +
  geom_bar(stat='identity', position='dodge') +
  scale_fill_manual('Threshold', values=c('dry_wet'='orangered', 'wet_extreme'='deepskyblue'),
                    labels=c('dry_wet'='Dry-Wet', 'wet_extreme'='Wet-Extreme')) +
  labs(x='Month', y='Threshold Amount (mm/day)', title='Markov State Thresholds by Month')

4.2.2.2 Assign States

After the state thresholds are determined, they are assigned to the synthetic daily timeseries, sampled_days.

sampled_days$STATE <- mc_assign_states(sampled_days$PRCP, sampled_days$MONTH, c('d', 'w', 'e'), thresh)
head(sampled_days)

## Source: local data frame [6 x 10]
## 
##   WYEAR MONTH       DATE  PRCP   TEMP  TMIN  TMAX  WIND SAMPLE_INDEX
##   (dbl) (dbl)     (date) (dbl)  (dbl) (dbl) (dbl) (dbl)        (int)
## 1  1984    10 1983-10-01  3.15 15.420 11.41 19.43  4.91            1
## 2  1984    10 1983-10-02  5.35 16.630 12.47 20.79  1.33            2
## 3  1984    10 1983-10-03  0.00 19.825 14.59 25.06  5.89            3
## 4  1984    10 1983-10-04  0.00 20.760 13.24 28.28  7.72            4
## 5  1984    10 1983-10-05  0.55 15.110 10.38 19.84  8.08            5
## 6  1984    10 1983-10-06  0.10 16.965 13.49 20.44  6.39            6
## Variables not shown: STATE (fctr)

This figure shows the Markov state for each day in the first 5 simulated years.

filter(sampled_days, SAMPLE_INDEX <= 365*5) %>%
ggplot(aes(SAMPLE_INDEX, PRCP, color=STATE)) +
  geom_point(size=2) +
  scale_color_manual('Markov State',
                     values=c('d'='orangered', 'w'='chartreuse3', 'e'='deepskyblue'),
                     labels=c('d'='Dry', 'w'='Wet', 'e'='Extreme')) +
  ylim(0, 50) +
  labs(x='Sample Date Index', y='Daily Precipitation (mm/day)', title='Markov State of Synthetic Daily Timeseries')

## Warning: Removed 17 rows containing missing values (geom_point).

4.2.2.3 Fit Markov Chain Transition Matrix

After the states have been assigned ot the sampled daily timeseries, the mc_fit() function can be used to fit a Markov Chain model for each month.

transition_matrices <- mc_fit(states=sampled_days[['STATE']], months=sampled_days[['MONTH']])

For example, the transition matrix for June is shown below. The probability of having a wet day follow a dry day is 0.26, and the probability of two dry days in a row is 0.66.

transition_matrices[[6]]

##    
##              d          w          e
##   d 0.65725542 0.25607354 0.08667104
##   w 0.44909502 0.28167421 0.26923077
##   e 0.24957841 0.39797639 0.35244519

The equilibrium state can be determined for each month using the mc_state_equilibrium() function. This function finds the eigenvector of a transition matrix. The equilibrium state for June is thus:

mc_state_equilibrium(transition_matrices[[6]])

##         d         w         e 
## 0.5192105 0.2905118 0.1902777

Once the transition matrices have been fit to the sampled historical daily values, the daily generator can be run.

4.2.2.4 KNN and Markov Chain Simulation

Given the fitted monthly transition matrices, the daily simulation for each simulation year can be performed using the sim_mc_knn_day() function. This function simulates a chain of Markov states using the transition matrices, and uses a KNN sampling algorithm to choose the values of each weather variable.

sampled_days <- dplyr::mutate(sampled_days,
                              WDAY=waterday(DATE, start_month=10),
                              STATE_PREV=lag(STATE),
                              PRCP_PREV=lag(PRCP),
                              TEMP_PREV=lag(TEMP),
                              TMAX_PREV=lag(TMAX),
                              TMIN_PREV=lag(TMIN),
                              WIND_PREV=lag(WIND))

sim_days <- sim_mc_knn_day(x=sampled_days, n_year=1, states=c('d', 'w', 'e'), transitions=transition_matrices,
                           start_month=10, start_water_year=2000, include_leap_days=FALSE)

str(sim_days, max=1)

## 'data.frame':    365 obs. of  11 variables:
##  $ SIM_YEAR   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ DATE       : Date, format: "1999-10-01" "1999-10-02" ...
##  $ MONTH      : num  10 10 10 10 10 10 10 10 10 10 ...
##  $ WDAY       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SAMPLE_DATE: Date, format: "1957-10-01" "1957-10-02" ...
##  $ STATE      : Ord.factor w/ 3 levels "d"<"w"<"e": 2 1 1 1 2 1 1 1 1 1 ...
##  $ PRCP       : num  0.35 0 0 0 1.77 0 0 0 0 0 ...
##  $ TEMP       : num  19 14.7 12 12.9 12.2 ...
##  $ TMIN       : num  12.39 9.66 5.89 6.46 6.63 ...
##  $ TMAX       : num  25.7 19.6 18.1 19.4 17.7 ...
##  $ WIND       : num  0.4 8.55 9.28 7.65 2.9 ...

This figure shows the simulated weather variables for the current simulation year.

select(sim_days, DATE, PRCP:WIND) %>%
  gather(VAR, VALUE, -DATE) %>%
  ggplot(aes(DATE, VALUE)) +
  geom_line() +
  facet_wrap(~VAR, scales='free_y', ncol=1) +
  labs(x='Simulate Date', y='Value', title='Daily Simulation for One Year')

4.2.2.5 Annual Precipitation Adjustment

Because the daily precipitation values are sampled directly from historical values, the total precipitation for each simulation year will likely differ from the corresponding simulated annual precipitation.

The adjust_daily_to_annual() function is used to adjust the simulated daily precipitation values so that the sum of these values equals the simulated annual total precipitation.

For example, the current simulated annual precipitation is 1571 mm/yr, but the sum of the simulated daily precipitation is 970.6 mm/yr. To adjust the daily values such that their sum equals the simmulated annual value, the daily values can be all multiplied by a factor of 1.62. The following code block uses the adjust_daily_to_annual() function to perform this adjustment for the first simulation year. Note that this function can be applied to the entire simulated timeseries containing multiple years, and adjust each individual accordingly.

orig_prcp <- sim_days[['PRCP']]
adj_prcp <- adjust_daily_to_annual(values_day=sim_days[['PRCP']], 
                                   years_day=wyear(sim_days[['DATE']]), 
                                   values_yr=coredata(sim_annual_prcp[['out']])[1], 
                                   years_yr=time(sim_annual_prcp[['out']])[1], 
                                   min_ratio=0.5, max_ratio=1.5)
c(Original=sum(orig_prcp), Adjusted=sum(adj_prcp$adjusted), Annual=coredata(sim_annual_prcp[['out']])[1])

## Original Adjusted   Annual 
##   970.58  1455.87  1570.83

This figure shows the original and adjusted daily precipitation values. Note how the adjusted values (dashed red line) are linearly scaled by a factor of 1.5.

data.frame(DATE=sim_days[['DATE']],
           ORIGINAL=orig_prcp,
           ADJUSTED=adj_prcp$adjusted) %>%
  gather(VAR, VALUE, -DATE) %>%
  ggplot(aes(DATE, VALUE, color=VAR, linetype=VAR)) +
  geom_line() +
  scale_color_manual('', values=c('ORIGINAL'='black', 'ADJUSTED'='orangered')) +
  scale_linetype_manual('', values=c('ORIGINAL'=1, 'ADJUSTED'=2)) +
  labs(x='Simulation Date', y='Daily Precipitation (mm/day)')

4.3 Adjustment Factors

After generating the simulated daily timeseries, the precipitation and temperature variables can be adjusted using the change factor parameters.

For this example, a 5-year simulated timeseries is first generated without using any change factors

sim <- wgen_daily(zoo_day, 
                  n_year=5, 
                  start_month=10, 
                  start_water_year=2000, 
                  include_leap_days=FALSE,
                  n_knn_annual=100,
                  dry_wet_threshold=0.3, 
                  wet_extreme_quantile_threshold=0.8,
                  adjust_annual_precip=TRUE,
                  annual_precip_adjust_limits=c(0.5, 1.5),
                  dry_spell_changes=1,
                  wet_spell_changes=1,
                  prcp_mean_changes=1,
                  prcp_cv_changes=1,
                  temp_mean_changes=0)

4.3.1 Precipitation Adjustment

The simulated precipitation timeseries can then be adjusted using the adjust_daily_gamma() function. This function performs a quantile mapping by first fitting the non-zero precipitation values to a gamma distribution, adjusting the scale and shape parameters of the distribution, and replacing the precipitation values by mapping the original values through the adjusted distribution.

The following code block computes three sets of adjusted timeseries by adjusting the mean, the coefficient of variation, or both parameters.

prcp.unadj <- sim[['out']][, c('DATE','MONTH','PRCP')]
prcp.adj.mean <- adjust_daily_gamma(prcp.unadj[['PRCP']], prcp.unadj[['MONTH']],
                                    mean_change=1.2, cv_change=1)
prcp.adj.cv <- adjust_daily_gamma(prcp.unadj[['PRCP']], prcp.unadj[['MONTH']],
                                  mean_change=1, cv_change=1.2)
prcp.adj.both <- adjust_daily_gamma(prcp.unadj[['PRCP']], prcp.unadj[['MONTH']],
                                    mean_change=1.2, cv_change=1.2)
prcp.df <- data.frame(DATE=prcp.unadj[['DATE']], 
                      UNADJESTED=prcp.unadj[['PRCP']],
                      ADJUST_MEAN=prcp.adj.mean[['adjusted']], 
                      ADJUST_CV=prcp.adj.cv[['adjusted']], 
                      ADJUST_BOTH=prcp.adj.both[['adjusted']]) %>%
  gather(GROUP, VALUE, -DATE)

This figure shows the unadjusted and three adjusted daily precipitation timeseries.

ggplot(prcp.df, aes(DATE, VALUE)) +
  geom_line() +
  facet_wrap(~GROUP) +
  labs(x='', y='Daily Precipitation (mm/day)', title='Unadjusted and Adjusted Daily Precipitation')

This figure shows the annual precipitation for the unadjusted and adjusted precipitation timeseries by water year.

mutate(prcp.df, WYEAR=wyear(DATE, start_month=10)) %>%
  group_by(WYEAR, GROUP) %>%
  summarise(VALUE=sum(VALUE)) %>%
  ggplot(aes(factor(WYEAR), VALUE, fill=GROUP)) +
  geom_bar(stat='identity', position='dodge') +
  scale_fill_discrete('') +
  labs(x='Water Year', y='Annual Precipitation (mm/yr)', title='Annual Precipitation With Adjustments')

This figure shows the cumulative distribution frequency of each timeseries. Note that the y-axis has been truncated to a maximum of 60 mm/yr in order to show the lower part of the distribution.

prcp.df %>%
  filter(VALUE>0) %>%
  group_by(GROUP) %>%
  arrange(VALUE) %>%
  mutate(ROW=row_number(),
         PROB=(ROW-0.5)/n()) %>%
  ggplot(aes(PROB, VALUE, color=GROUP)) +
  geom_line() +
  ylim(0, 60) +
  scale_color_discrete('') +
  labs(y='Daily Precipitation (mm/day)', x='Non-Exceedence Probability', title='Distributions of Adjusted Precipitation Timeseries')

## Warning: Removed 31 rows containing missing values (geom_path).

4.3.2 Temperature Adjustment

The temperature timeseries can also be adjusted using a linear additive factor through the adjust_daily_additive() function. This function simply adds the adjustment factor to each daily value. In this example, the mean temperature is adjusted by a factor of 5 meaning each daily mean temperature is increased by 5 degrees.

temp.unadj <- sim[['out']][, c('DATE','MONTH','TEMP')]
temp.adj <- adjust_daily_additive(temp.unadj[['TEMP']], temp.unadj[['MONTH']],
                                  mean_change=5)
temp.df <- data.frame(DATE=temp.unadj[['DATE']], 
                      UNADJESTED=temp.unadj[['TEMP']],
                      ADJUSTED=temp.adj[['adjusted']]) %>%
  gather(GROUP, VALUE, -DATE)

This figure shows the unadjusted and adjusted daily mean temperature.

ggplot(temp.df, aes(DATE, VALUE, color=GROUP)) +
  geom_line() +
  scale_color_discrete('') +
  labs(x='', y='Daily Mean Temperature (degC)', title='Unadjusted and Adjusted Daily Mean Temperature')

This figure shows the unadjusted and adjusted annual mean temperature. Note how the adjusted annual mean temperatures are 5 degrees higher than the unadjustd temperatures.

mutate(temp.df, WYEAR=wyear(DATE, start_month = 10)) %>%
  group_by(WYEAR, GROUP) %>%
  summarise(VALUE=mean(VALUE)) %>%
  ggplot(aes(factor(WYEAR), VALUE, fill=GROUP)) +
  geom_bar(stat='identity', position='dodge') +
  scale_fill_discrete('') +
  labs(x='', y='Annual Mean Temperature (degC)', title='Unadjusted and Adjusted Annual Mean Temperature')

Introduction to the weathergen Package

Jeffrey D Walker, PhD

March 31, 2016