I have ten dataframes of event level data from ten different hockey seasons that each have a couple columns that I’d like to one hot encode with the intention of training a model on the historical data to make predictions about this years data . My current issue is that that columns that need to be encoded don’t contain all possible values in each dataframe. Just as an example there have been some expansion teams added, so dataframes for earlier years would not have these teams and thus they would not have an encoded column.
The first thing I tried was concatenating all ten dataframes and doing my one hot encoding that way, but my dataframe became too big and my jupyter notebook kernel kept dying. So now I’ve decided to try to approach each dataframe separately, but this creates the problem for me that I explained above and I’ve had a tough time finding a solution. Here’s an example of what I’m looking for. I would like Table 1 and Table 2 to be one hot encoded as follows.
Table 1
Team | year |
---|---|
A | 1 |
B | 1 |
Table 2
Team | year |
---|---|
B | 2 |
C | 2 |
Table 1 one hot encoded
Team | year | A | B | C |
---|---|---|---|---|
A | 1 | 1 | 0 | 0 |
B | 1 | 0 | 1 | 0 |
Table 2 one hot encoded
Team | year | A | B | C |
---|---|---|---|---|
B | 2 | 0 | 1 | 0 |
C | 2 | 0 | 0 | 1 |
So essentially the question is, how can I get that “C” column in Table 1 and that “A” column in Table 2, since those values don’t exist in those tables.