I am building a classification model based on the Kaggle Dataset.
Ideas
- Convert the tabular data into text.
- Tokenize the text using
BertTokenizerFast
- Finetune
BertForSequenceClassification
using LoRA. - The dataset is imbalanced. 8.7% of the data is labelled with one.
Here is the sample text:
Identify the loan default status using the records of loan attributes. Respond with either 'yes' indicating payment difficulties or 'no' for no payment difficulties.
Applicant ID 100002 has a Cash loans.
The applicant's gender is M who does not own a car but owns a house or flat.
They have 0 children and an income of 202500.0.
The credit amount is 406597.5 with an annuity of 24700.5.
The loan was applied for Unaccompanied and the applicant is working with a secondary / secondary special education level,
single / not married, and lives in a house / apartment.
The normalized population of the region where they live is 0.018801.
They are 26 years old and have been employed for approximately 2 years.
They changed their registration about 10.0 years before the application and changed their identity document around 6 years before the application.
They have owned their car for 0 years.
Their mobile phone is reachable and their work phone is reachable, and they have a landline phone and do not have an email address.
They work as Laborers and live with 1.0 family member(s).
Their region rating is 2, and they applied for the loan on a wednesday at 10 o'clock.
Their permanent address matches their contact address and matches their work address.
Their contact address matches their work address.
Their city of permanent address matches their contact address and matches their work address.
Their city of contact address matches their work address.
They work for a business entity type 3.
Their external scores are 0.0830369673913225, 0.2629485927471776, and 0.1393757800997895 respectively.
The average size of the applicant's apartments, basements, years of exploitation, years of building, common areas, elevators, entrances, maximum floors, minimum floors, land area, living apartments, living area, non-living apartments, and non-living area are 0.0247, 0.0369, 0.9722, 0.6192, 0.0143, 0.0, 0.069, 0.0833, 0.125, 0.0369, 0.0202, 0.019, 0.0, and 0.0, respectively.
The mode size of the applicant's apartments, basements, years of exploitation, years of building, common areas, elevators, entrances, maximum floors, minimum floors, land area, living apartments, living area, non-living apartments, and non-living area are 0.0252, 0.0383, 0.9722, 0.6341, 0.0144, 0.0, 0.069, 0.0833, 0.125, 0.0377, 0.022, 0.0198, 0.0, and 0.0, respectively.
The median size of the applicant's apartments, basements, years of exploitation, years of building, common areas, elevators, entrances, maximum floors, minimum floors, land area, living apartments, living area, non-living apartments, and non-living area are 0.025, 0.0369, 0.9722, 0.6243, 0.0144, 0.0, 0.069, 0.0833, 0.125, 0.0375, 0.0205, 0.0193, 0.0, and 0.0, respectively.
The housing repair fund type is reg oper account.
The house type is block of flats, with a total area mode of 0.0149, and the wall material is Stone, brick.
The emergency state mode is No.
The number of observation points of social circle 30 days before the application is 2.0 with 2.0 defects.
The number of observation points of social circle 60 days before the application is 2.0 with 2.0 defects.
The applicant changed their phone 1134.0 days before the application.
The client provided documents: Document 3.
The number of enquiries to the Credit Bureau about the client one hour, day, week, month, quarter, and year before application are 0.0, 0.0, 0.0, 0.0, 0.0, and 1.0, respectively.
The loan should be classified as Yes
Below is the code:
# Load main datasets
train = pd.read_csv('data/home-credit-default-risk/train.csv')
train_oversampled = pd.read_csv('data/home-credit-default-risk/train_oversampled.csv')
train_dataset = Dataset.from_pandas(train)
train_oversampled_dataset = Dataset.from_pandas(train_oversampled)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
def preprocess_function(examples):
tokenized_inputs = tokenizer(examples['prompt'], add_special_tokens=True, max_length=128, padding='max_length', truncation=True)
tokenized_inputs['labels'] = examples['TARGET']
return tokenized_inputs
train_encodings = train_dataset.map(preprocess_function, batched=True)
train_oversampled_encodings = train_oversampled_dataset.map(preprocess_function, batched=True)
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
peft_config = LoraConfig(
task_type=TaskType.SEQ_CLS, # Sequence Classification
r=8, # Low-rank dimension
lora_alpha=16,
lora_dropout=0.1
)
model = get_peft_model(model, peft_config)
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=10,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
warmup_steps=500,
weight_decay=0.01,
logging_dir='./logs',
logging_steps=10,
evaluation_strategy='epoch',
)
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions[:, 1] # Get the predicted probabilities for the positive class
auc = roc_auc_score(labels, preds)
return {
'auc': auc,
}
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_oversampled_encodings,
eval_dataset=train_oversampled_encodings, # Normally, you would use a validation set
compute_metrics=compute_metrics
)
trainer.train()
outputs_eval = trainer.predict(train_encodings)
I am new to deep learning and here are my suspected reasons:
- Imbalanced sample: I tried to oversample. But the result is the same.
- Data processing: I cannot identify any problem