I’m working on an academic project where I need to fine-tune a text summarization model to handle a specific type of text. I decided to go with a dataset of articles, where the body of the article is the full text and the abstract is the summary. I’m storing the dataset in JSON format.
I initially started with the bart model, but it has a window size limit, and my dataset is much larger, so I switched to BigBird instead.
I’ve got a few questions and could really use some advice:
- Does this approach sound right to you?
- What should I be doing for text preprocessing? Should I remove everything except English characters? What about stop-words should I get rid of those?
- Should I be lemmatizing the words?
- Should I remove the abstract sentences from the body before fine-tuning?
- How should I evaluate the fine-tuned model? And what’s the best way to compare it with the original model to see if it’s actually getting better?
Would love to hear your thoughts. Thanks!
i used these models:
google/bigbird-pegasus-large-bigpatent
facebook/bart-large-cnn
and with bart
i didnt get a better model after fine-tune.
Alireza is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
1