I am building a sales prediction model which consists of “Year”, “Month”, “Economy Indicator”, “Customer_Id”, “Product_Id”, “Quantity”, “Sales”, “Margin”.
The cleaned dataset contains about 1.5 millions of rows with the above 8 columns, which was the sales per month per customer per product for the past 6 years. My end goal is being able to predict the sales for the upcoming months for the entire up coming year, but more precisely, the prediction will be on product per customer level, which is a very detailed level.
However, Since my Customer_Id and Product_Id are TEXT, such as “A77BC”, and there are over 100000 unique product_id and 6000 unique customer_id, if I use one hot encoding to label them, the dimentionality will be too high for my device to handle, (for example, my laptop has 16G ram but label the customer_id already requires 24G ram) And I believe there must be a better way of handling such situation, but I am very new to machine learning.