I have a problem while making a predictive model, so I’m leaving a question.
I’m trying to create a predictive model using machine learning methodologies such as random forest, xgboost, etc.
At this time, the y value is the differentiated monthly time series data, and the x value is the differentiated daily time series data.
For reference, t, which means time, is timed to the trading day of the U.S. stock market.
My model consists of the following format.
Predictive value = y_(t+21) – y_(t)
explanatory value = y(t) – y(t-1), y(t-1) – y(t-2) … y(t-p) – y(t-p-1)
At this time, p is the last trading day of the month.
The problem here is that each month has a different number of trading days
For example, there are 23 trading days in January 1980, but there are 20 trading days in February 1981, and there is a possibility that fewer months exist for holidays.
In this case, when building a dataset of explanatory variables for predicting dependent variables, NaN values may be generated for some values in the column by row.
In this case, how should it be handled universally? Or is there a term or paper that refers to the issue of this?
y_(t+21) – y_(t) has two cases. One is to differentiated the end-of-month value and differentiated the average monthly value. For this reason, nothing has been touched yet.
최성렬 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
They are a few ways to handle missing data, the simplest being replacing the NaN values with the mean (average value) or a constant value (like 0 or 1).
However, if you think the data points closest in time are more similar (which is common with time series data like with your market trading data), it’s a good idea to use the surrounding data points to predict the missing data using techniques like:
- Rolling Statistics Imputation: imagine a rolling window over your data and the mean/median is used to replace missing data.
- K-Nearest Neighbors (KNN) Imputation: fill the values based on ‘k’ nearest values.
- etc.
These techniques and more are summarized in Data Imputation Demystified | Time Series Data.
Fats is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.