Apologies if this is a daft question – I’m fairly new to machine learning.
Using scikit-learn, I’ve set up a regression model to predict customers’ maximum spend per transaction. The dataset I’m using looks a bit like this; the target column is maximum spend per transaction during the previous year:
customer_number | metric_1 | metric_2 | target
----------------|----------|----------|-------
111 | A | X | 15
222 | A | Y | 20
333 | B | Y | 30
I split the dataset into training & testing sets, one-hot encode the features, train the model, and make some test predictions:
target = pd.DataFrame(dataset, columns = ["target"])
features = dataset.drop("target", axis = 1)
train_features, test_features, train_target, test_target = train_test_split(features, target, test_size = 0.25)
train_features = pd.get_dummies(train_features)
test_features = pd.get_dummies(test_features)
model = RandomForestRegressor()
model.fit(X = train_features, y = train_target)
test_prediction = model.predict(X = test_features)
I can output various measures of the model’s accuracy (mean average error, mean squared error etc) using the relevant functions in scikit-learn. However, I’d like to be able to tell which customers’ predictions are the most inaccurate. So I want to be able to create a dataframe which looks like this:
customer_number | target | prediction | error
----------------|--------|----------- |------
111 | 15 | 17 | 2
222 | 20 | 19 | 1
333 | 30 | 50 | 20
I can use this to investigate if there is any correlation between the features and the model making inaccurate predictions. In this example, I can see that customer 333 has the biggest error by far, so I could potentially infer that customers with metric_1 = B end up with less accurate predictions.
I think I can calculate errors like this (please correct me if I’m wrong on this), but I don’t know how to tie them back to customer number.
error = abs(test_target - test_prediction)
Does anyone know how I can get the desired result please?