I have a dataset with values of explanatory variables and a target variable. All of them are different historical daily values. X’s are different economic indicators, Y is a forward looking change in a bond’s yield. Thus for day N, X’s are, for example, current unemployment rate and inflation, and Y is (yield_n+3 / yield_n) – 1, which is a 3 day change.
My question is, if I latter use a train_test_split from sklearn can I turn on shuffle = True?
I understand that for typical time series regressions this will lead to a data leakage but here I don’t use past values of Y and i don’t use any lags.
Theoretically I’d like to shuffle the data because from what i can see, the relationships between X’s and Y change over time so if I split the data just based on the earlier and latter dates I fear I would train the model on slightly outdated values.
By the way, I use Gradient Boosting as my model
So, can I use shuffle = True in my situation? If yes, what additional features could lead to leakage: lags, seasonal effects or something else?
sprnko is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.