The results of my model always predicts that accuracy is 1.0000 and I don’t know why. Below is my entire code for fine-tuning in the hopes that someone can point out to me where I am going wrong.
I am using Huggingface’s TFBertForSequenceClassification for sequence classification task to predict 2 labels of sentences in English text.
I use the “distilbert-base-uncased” model .
I get my input from a csv file that I construct from an annotated corpus I received. Here’s a sample of that:
text class
0 There is a great deal of truth to the anti-vax... lie
1 Jenny mccarthy is a learned doctor who deserve... lie
2 Driving doesnt really require any practice.' lie
3 Drinking and driving is a winning and safe com... lie
4 Good hygiene isnt really important or attract... lie
The distinct labels are: Truth and Lie
This is the code I am using to fine tune my model:
# If running on Kaggle, make sure to have the necessary packages installed
!pip install transformers
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import DistilBertTokenizerFast, TFDistilBertForSequenceClassification
import tensorflow as tf
from sklearn.metrics import confusion_matrix
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
dataset_path = '/kaggle/input/open-domain/open domain - Copy.csv'
df = pd.read_csv(dataset_path)
# Display the first few rows of the dataset
print(df.head())
# Rename columns to 'text' and 'label' for consistency
df = df.rename(columns={'class': 'label'})
# Check for any missing values
print("Missing values:n", df.isnull().sum())
# Display the data types of the columns
print("Data types:n", df.dtypes)
# Verify the unique values in the label column
print("Unique values in 'label' column:", df['label'].unique())
# Ensure there are no missing values in 'text' and 'label' columns
df = df.dropna(subset=['text', 'label'])
# Preprocess the dataset
df['label'] = df['label'].apply(lambda x: 0 if x == 'Lie' else 1)
texts = df['text'].tolist()
labels = df['label'].tolist()
# Split the dataset into train, validation, and test sets
train_texts, temp_texts, train_labels, temp_labels = train_test_split(
texts, labels, test_size=0.3, random_state=42, stratify=labels)
val_texts, test_texts, val_labels, test_labels = train_test_split(
temp_texts, temp_labels, test_size=0.5, random_state=42, stratify=temp_labels)
# Initialize the tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
# Tokenize the datasets
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)
# Convert to TensorFlow datasets
train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings), train_labels))
val_dataset = tf.data.Dataset.from_tensor_slices((dict(val_encodings), val_labels))
test_dataset = tf.data.Dataset.from_tensor_slices((dict(test_encodings), test_labels))
# Initialize the model
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
# Compile the model with correct loss function and metrics
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
# Train the model
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, validation_data=val_dataset.batch(16))
# Evaluate the model on the test set
result = model.evaluate(test_dataset.batch(16))
# The result will be a list where the first element is the loss and the second is the accuracy
test_loss, test_accuracy = result[0], result[1]
print(f"Test loss: {test_loss}, Test accuracy: {test_accuracy}")
# Make predictions on the test set
predictions = model.predict(test_dataset.batch(16)).logits
predicted_labels = np.argmax(predictions, axis=1)
# Compute the confusion matrix
cm = confusion_matrix(test_labels, predicted_labels)
print("Confusion Matrix:")
print(cm)
# Plot the confusion matrix
plt.figure(figsize=(10, 7))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
# Implement cross-validation to further validate model performance
def create_model():
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
return model
# Define cross-validation
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
accuracy_scores = []
for train_index, val_index in kf.split(texts, labels):
train_texts_cv, val_texts_cv = np.array(texts)[train_index], np.array(texts)[val_index]
train_labels_cv, val_labels_cv = np.array(labels)[train_index], np.array(labels)[val_index]
# Tokenize
train_encodings_cv = tokenizer(train_texts_cv.tolist(), truncation=True, padding=True)
val_encodings_cv = tokenizer(val_texts_cv.tolist(), truncation=True, padding=True)
# Convert to TensorFlow datasets
train_dataset_cv = tf.data.Dataset.from_tensor_slices((dict(train_encodings_cv), train_labels_cv))
val_dataset_cv = tf.data.Dataset.from_tensor_slices((dict(val_encodings_cv), val_labels_cv))
# Create and train the model
model_cv = create_model()
model_cv.fit(train_dataset_cv.shuffle(1000).batch(16), epochs=3, validation_data=val_dataset_cv.batch(16))
# Evaluate the model
result_cv = model_cv.evaluate(val_dataset_cv.batch(16))
accuracy_scores.append(result_cv[1])
print("Cross-validation accuracy scores:", accuracy_scores)
print("Mean accuracy:", np.mean(accuracy_scores))
And here is the output from fine-tuning:
Epoch 1/3
314/314 [==============================] - 80s 155ms/step - loss: 0.0119 - accuracy: 0.9996 - val_loss: 2.9357e-05 - val_accuracy: 1.0000
Epoch 2/3
314/314 [==============================] - 41s 131ms/step - loss: 1.6639e-05 - accuracy: 1.0000 - val_loss: 4.4483e-06 - val_accuracy: 1.0000
Epoch 3/3
314/314 [==============================] - 41s 131ms/step - loss: 4.2477e-06 - accuracy: 1.0000 - val_loss: 1.4585e-06 - val_accuracy: 1.0000
68/68 [==============================] - 8s 37ms/step - loss: 1.4598e-06 - accuracy: 1.0000
Test loss: 1.4597594599763397e-06, Test accuracy: 1.0
68/68 [==============================] - 7s 35ms/step
Confusion Matrix:
[[1076]]
I’ve tried everything and ran the model multiple times, but I always get the same results. I do know that the data I am working with isn’t great and I am only training on abour 7k sentences with labels.
I posted everything I am using to run the model in the hopes someone can point me to where I am going wrong. Thank very much in advance for your help!
Omar Saeed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.