Thiết kế website giá rẻ

Question

The results of my model always predicts that accuracy is 1.0000 and I don’t know why. Below is my entire code for fine-tuning in the hopes that someone can point out to me where I am going wrong.

I am using Huggingface’s TFBertForSequenceClassification for sequence classification task to predict 2 labels of sentences in English text.

I use the “distilbert-base-uncased” model .

I get my input from a csv file that I construct from an annotated corpus I received. Here’s a sample of that:

                                                text class  
0  There is a great deal of truth to the anti-vax...   lie  
1  Jenny mccarthy is a learned doctor who deserve...   lie  
2      Driving doesnt really require any practice.'   lie  
3  Drinking and driving is a winning and safe com...   lie  
4  Good hygiene isnt really important or attract...   lie

The distinct labels are: Truth and Lie
This is the code I am using to fine tune my model:

# If running on Kaggle, make sure to have the necessary packages installed
!pip install transformers

import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import DistilBertTokenizerFast, TFDistilBertForSequenceClassification
import tensorflow as tf
from sklearn.metrics import confusion_matrix
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
dataset_path = '/kaggle/input/open-domain/open domain - Copy.csv'
df = pd.read_csv(dataset_path)

# Display the first few rows of the dataset
print(df.head())

# Rename columns to 'text' and 'label' for consistency
df = df.rename(columns={'class': 'label'})

# Check for any missing values
print("Missing values:n", df.isnull().sum())

# Display the data types of the columns
print("Data types:n", df.dtypes)

# Verify the unique values in the label column
print("Unique values in 'label' column:", df['label'].unique())

# Ensure there are no missing values in 'text' and 'label' columns
df = df.dropna(subset=['text', 'label'])

# Preprocess the dataset
df['label'] = df['label'].apply(lambda x: 0 if x == 'Lie' else 1)
texts = df['text'].tolist()
labels = df['label'].tolist()

# Split the dataset into train, validation, and test sets
train_texts, temp_texts, train_labels, temp_labels = train_test_split(
    texts, labels, test_size=0.3, random_state=42, stratify=labels)

val_texts, test_texts, val_labels, test_labels = train_test_split(
    temp_texts, temp_labels, test_size=0.5, random_state=42, stratify=temp_labels)



# Initialize the tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

# Tokenize the datasets
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

# Convert to TensorFlow datasets
train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings), train_labels))
val_dataset = tf.data.Dataset.from_tensor_slices((dict(val_encodings), val_labels))
test_dataset = tf.data.Dataset.from_tensor_slices((dict(test_encodings), test_labels))

# Initialize the model
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

# Compile the model with correct loss function and metrics
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

# Train the model
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, validation_data=val_dataset.batch(16))

# Evaluate the model on the test set
result = model.evaluate(test_dataset.batch(16))

# The result will be a list where the first element is the loss and the second is the accuracy
test_loss, test_accuracy = result[0], result[1]
print(f"Test loss: {test_loss}, Test accuracy: {test_accuracy}")

# Make predictions on the test set
predictions = model.predict(test_dataset.batch(16)).logits
predicted_labels = np.argmax(predictions, axis=1)

# Compute the confusion matrix
cm = confusion_matrix(test_labels, predicted_labels)
print("Confusion Matrix:")
print(cm)

# Plot the confusion matrix
plt.figure(figsize=(10, 7))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
# Implement cross-validation to further validate model performance
def create_model():
    model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
    optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
    return model

# Define cross-validation
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
accuracy_scores = []

for train_index, val_index in kf.split(texts, labels):
    train_texts_cv, val_texts_cv = np.array(texts)[train_index], np.array(texts)[val_index]
    train_labels_cv, val_labels_cv = np.array(labels)[train_index], np.array(labels)[val_index]

    # Tokenize
    train_encodings_cv = tokenizer(train_texts_cv.tolist(), truncation=True, padding=True)
    val_encodings_cv = tokenizer(val_texts_cv.tolist(), truncation=True, padding=True)
    
    # Convert to TensorFlow datasets
    train_dataset_cv = tf.data.Dataset.from_tensor_slices((dict(train_encodings_cv), train_labels_cv))
    val_dataset_cv = tf.data.Dataset.from_tensor_slices((dict(val_encodings_cv), val_labels_cv))
    
    # Create and train the model
    model_cv = create_model()
    model_cv.fit(train_dataset_cv.shuffle(1000).batch(16), epochs=3, validation_data=val_dataset_cv.batch(16))
    
    # Evaluate the model
    result_cv = model_cv.evaluate(val_dataset_cv.batch(16))
    accuracy_scores.append(result_cv[1])

print("Cross-validation accuracy scores:", accuracy_scores)
print("Mean accuracy:", np.mean(accuracy_scores))

And here is the output from fine-tuning:

Epoch 1/3
314/314 [==============================] - 80s 155ms/step - loss: 0.0119 - accuracy: 0.9996 - val_loss: 2.9357e-05 - val_accuracy: 1.0000
Epoch 2/3
314/314 [==============================] - 41s 131ms/step - loss: 1.6639e-05 - accuracy: 1.0000 - val_loss: 4.4483e-06 - val_accuracy: 1.0000
Epoch 3/3
314/314 [==============================] - 41s 131ms/step - loss: 4.2477e-06 - accuracy: 1.0000 - val_loss: 1.4585e-06 - val_accuracy: 1.0000
68/68 [==============================] - 8s 37ms/step - loss: 1.4598e-06 - accuracy: 1.0000
Test loss: 1.4597594599763397e-06, Test accuracy: 1.0
68/68 [==============================] - 7s 35ms/step
Confusion Matrix:
[[1076]]

I’ve tried everything and ran the model multiple times, but I always get the same results. I do know that the data I am working with isn’t great and I am only training on abour 7k sentences with labels.

I posted everything I am using to run the model in the hopes someone can point me to where I am going wrong. Thank very much in advance for your help!

Thiết kế website giá rẻ

Danh mục

Huggingface TFBertForSequenceClassification is overfiting