I need a ‘domain specific’ feature and Python code implementation of Feature Engineering in the MLOps cycle for this data set?
I’m teaching Machine Learning to high school students. We are using the classic ML diabetes data set. I need to keep it simple using pandas, numpy and scikit-learn.
What I have done so far:
The feature engineering practical ideas I have are categorising gender, calculating age from DoB (I have added dates to the original data set) and calculating a risk percentage based on age * BMI
but these more focus on deriving new variables from existing features and feature interactions. I really need an idea/code for practically demonstrating ‘Creating Domain-Specific Features‘:
Code I have and need to build on with a domain specific example:
import pandas as pd
#Data is imported as CSV which students have done some basic wrangling on
data_frame = pd.read_csv(“2.2.1.wrangled_data.csv”)
data_frame[‘SEX’] = data_frame[‘SEX’].apply(lambda gender: -1 if gender.lower() == ‘male’ else 1 if gender.lower() == ‘female’ else None)
data_frame[‘Age’] = ((data_frame[‘DoTest’] – data_frame[‘DoB’]).dt.days / 365.25).round().astype(int)
data_frame[‘Risk’] = data_frame[‘BMI’] * data_frame[‘Age’]
data_frame[‘RiskPercentage’] = ((data_frame[‘Risk’] / data_frame[‘Risk’].max()) * 100).round(2)
Any help idea appreciated.
I should add my skill set is Web Dev, so ML is a new skill set for me.