I am trying to evaluate the performance of a base model on a summarization task, using rouge score. My objective is to re-calculate the rouge score after fine-tuning.
from datasets import load_dataset
from huggingface_hub import login
import torch
from transformers import (
AutoModelForSeq2SeqLM,
pipeline)
hf_token = "XXXXX"
login(hf_token)
data_id= "samsum"
dataset = load_dataset(data_id, trust_remote_code=True)
model_id = "google-t5/t5-base"
llm = pipeline("summarization", model=model_id,device=0)
The dataset has a train,test and validation split. Each split has a “dialogue” and a “summary” field. I want to calculate the rouge score on the un-tuned model model by comparing the llm predicted summaries and the ground-truth summaries. So I run create a basic prompt by adding the prefix “summarize :” into the first 20 samples of the vlaidation set. This is followed by extacting the predictions into a list.
# A basic prompt to the dialogue to run though the pipeline
input_diags = [[f'summarize: {i}'] for i in dataset['validation']['dialogue']]
# Also extract the ground truths into a list
ground_truths = [i for i in dataset['validation']['summary']]
# Run the input dialogues through the pipeline
outputs = llm(input_diags[0:20], max_length=60, clean_up_tokenization_spaces=True)
# Extract the predictions into a list.
outputs = [i[0]['summary_text'] for i in outputs]
Now I have two lists , that I can feed into the rouge score evaluation.
import evaluate
rouge = evaluate.load('rouge')
foundation_model_results = rouge.compute(predictions= outputs, #List of predictions
references=ground_truths[0:20], #list of ground truths
use_aggregator=True,
use_stemmer=True,
)
print(foundation_model_results)
This works , but is there a better way to do this, so that I don’t have to explicitly create the lists outputs and ground_truths.