I have a table of observations, or rather ‘grouped’ observations, where each group represents a deal, and each row representing a product. But the prediction is to be done at a Deal level. Below is the sample dataset.
Sample Dataset :
df = pd.DataFrame({'deal': ['deal1', 'deal1', 'deal2', 'deal2', 'deal3', 'deal3'],
'product': ['prd_1', 'prd_2', 'prd_1', 'prd_2', 'prd_1', 'prd_2'],
'Quantity': [2, 1, 5, 3, 6, 7],
'Total Price': [10, 7, 25, 24, 30, 56],
'Result': ['Won', 'Won', 'Lost','Lost', 'Won', 'Won']})
My Approach:
Flatten the data to get one observation per row using pivot_table
, so that we get one row per Deal, and then proceed with the classification modelling, probably a logistic regression or gradient boosting.
But in the above case we had:
1 column (product, with 2 unique values) to be pivoted
2 measures (Quantity and Price) as the series/values.
resulting in 4 columns. The Wide format table is shown below:
Question/Problem/Thought:
Is this always the best way in cases like these? The problem (or maybe not?) I see is when number of columns to be pivoted is more than 1 and also if its combination of unique values in it is more, the table may get very very wide!
I would be grateful to hear alternative efficient ways to prepare the dataset to train, if any!