Recently, I transitioned from computer vision to NLP tasks.
I have a collection of key press data history recorded as ASCII text files, capturing every keystroke made by users during their sessions. The goal is to classify this data to detect inappropriate content, such as attempts to download viruses, visit illegal websites etc.
NB:
-
High Noise Level: The key press data history files contain a significant amount of noise due to frequent noisy key presses (e.g., ‘wasd’ when user is gaming), making it difficult to extract meaningful text.
-
Multilingual Input: I want to do it for several languages.
For instance, a user might type “crfxfnm dbhec” (phrase for another language with meaning “download virus” in a form using the classic English keyboard layout)
I have tried performing text-classification on raw data using a pre-trained BERT model, predicting two classes: positive context and negative context. However, this approach did not work.
Additionally, I have explored token classification results on raw data (same pre-trained BERT model), and many words from the noisy data are being highlighted. There is still hope that the tokens can be accurately identified.
Can a transformer-based classifier, such as RoBERTa from Hugging Face, effectively handle and tokenize this type of noisy and multilingual input? Is it token-classification task is possible? So, we have token-classification task of raw data
Should the dataset be cleaned to remove these noisy characters and how? And after train model for text classification? So, we have 2 main steps: filtration + text-classification task.
Would it be necessary to train separate models for each language, potentially converting characters in their true language before training?
r73lan is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.