I am testing LSTM models to predict numbers of infections over time. I am testing different input (“lookback”) and output (“pred_length”) lengths as well as delays between the last data point available and the data point first being forecast (“lead”). I am currently looking at data simulated under a multistrain epi model, which generates yearly fluctuations in cases of two viruses (eg flu H1 and H3). I am happy to edit to share code + data, but so as not to drown readers, I will share what I think is important.
I am carrying out rolling forecasts: at each forecasting timepoint (represented by lines on the graph), I re-train/test/forecast. Forecasts improve over time as the model sees more and more past data.
My question is: I would expect forecasts to do best for time points closest in time to the data available, so I would expect accuracy for the first time point in each forecast to be highest and then for accuracy to decrease for the following timepoints predicted, but I do not see that, why not? I think this is to do with the impact of the long memory in the LSTM, but I would love to know more. Or is this a coding/ model problem – Would changing the model help?
#model
model8 <- keras_model_sequential() %>%
layer_lstm(units = 50, input_shape = c(lookback, length(features)), return_sequences = FALSE) %>%
layer_repeat_vector(pred_length) %>%
layer_lstm(units = 50, return_sequences = TRUE) %>%
time_distributed(layer_dense(units = length(target_features)))
model8 %>% compile(loss = 'mse', optimizer = 'adam')
#forecast
forecast_output <- lapply(forecast_timepoints, function(x) {
train_and_forecast(out2, model, data_cutoff=x , lookback=52, lead=0, pred_length=0, target_features,
bs=80, epoch=80, vs=0.1, original_data = out)})