I’m working with a dataset for a project. Unfortunately, I don’t have access to detailed information about the variables (apart from the descriptions I’ll provide below).
After cleaning the data a bit (it was a real mess), I’m left with this dataset, which I’ll link here (along with a screenshot).
One of the variables is called “margin,” defined as “cumulative customer margin.” This makes me think the variable should be in absolute value. Then, I have two other variables: “price” and “number of transactions.”
When filtering for number of transactions = 1, I’d expect the values of price to always be higher than margin (assuming margin = price – cost). However, I’ve found many anomalous values. I’m attaching a few examples in the screenshot.
Any insights would be greatly appreciated!
Here is the translation of the table:
Variable Data Challenge | Description |
---|---|
EVENT_ID | Transaction ID |
N_ITEMS | Total number of items purchased in the transaction |
PROP_CONBINI | Proportion of “conbini” articles in the transaction |
FAV_GENRE | Preferred manga genre |
PHONE_NUMBER | Customer’s phone number available |
Customer’s e-mail address | |
YEAR | Transaction year |
MONTH | Transaction month |
PAYMENT_TYPE | Agreed payment method |
BOOKS_PAID | Number of manga paid for in previous transactions |
PRICE | Transaction price |
N_SUBSCRIPTIONS | Number of active manga series subscriptions |
SUBSCR_CANC | Number of canceled manga series subscriptions in the past |
POINT_OF_SALE | Point of sale |
AGE | Customer’s age |
DAYS_FROM_PROMO | Days since the last promo launch |
MARGIN | Cumulative customer margin |
N_TRANSACTIONS | Total number of transactions made by the customer |
CUSTOMER_SINCE | Date of the customer’s first transaction |
DATE_LAST_PURCHASE | Date of the customer’s last transaction |
PAID | Amount paid (target variable) |
screenshot
Does anyone have any ideas? Am I missing something about the “margin” variable?
I’ve also considered the possibility that it represents a relative value, but when I check the values of margin, there are many instances greater than 100, which doesn’t seem possible.
i didn’t find any significant pattern with the other variables
This variable is crucial for me because I need to infer the average cost, which I plan to use as a weight for false negatives (I’m building a classification model for credit scoring, where 1 = pays, 0 = doesn’t pay).
Any suggestions or insights would be incredibly helpful!