I have been learning a bit about machine learning and have used a few model types (xgboost, LogisticRegression) with some test data. The more I use these models the more I realize there is a specific type of data that they work with, columns that can be turned into numbers. Even things like Make/Model of cars can be turned into numbers because they are finite and repeating in datasets.
The dataset that I really want to work with has things like First and Last Name, Company Name, Email Address, etc. Strings that are unique. Here is an example
First and Last Name | Company Name | Email Address | Is Fraud |
---|---|---|---|
WHOLE FOODS CVS EVALUATION | WHOLE FOODS/CVS EVALUATION | [email protected] | True |
WHOLE FOODS STORE | WHOLE FOODS STORE | [email protected] | True |
Tina Rosen | Best Wares Shoes | [email protected] | False |
Joe John | WHOLE FOODS MARKET SURVEY | [email protected] | True |
Stacey Parket | S Parket Outlet | [email protected] | False |
Michael Phelan | KROGER | [email protected] | True |
This is a small set of the data that I have, but you can see it doesn’t fit into the normal dataset for the models I’ve learned about and worked with. I’ve tried things like OneHotEncoder and LabelEncoder, but they turn them into integers that don’t really mean anything, since they don’t repeat.
I know it’s easy to look at that sample and think “oh just write validators yourself looking for multiple periods in email, specific words in the name, etc” but there are thousands of iterations of the fraud accounts that wouldn’t fit.
So my question is, is there a machine learning model that takes in things like those emails addresses/names and learns what fraud email addresses/names look like?