FastText language_identification returns multiple predictions per original text, and also fails to indicate which belong to which original document.
There are differing numbers of predictions per original document too — their GitHub forums are closed now, but does anyone know how to match the output to the original texts?
Code:
DF = data.frame(doc_id = seq(1, 5),
speechtext = c("Hello. Fake text entry 1.", "Fake text entry 2", "more text", "Text in a
different language", "Hola"))
library(fastText)
# download .ftz pretrained model from https://fasttext.cc/docs/en/language-identification.html
file_ftz = system.file("language_identification/lid.176.ftz", package = "fastText")
lang1 = language_identification(DF$speechtext,
pre_trained_language_model_path = file_ftz,
verbose = T)
I was expecting one prediction per original text, or at least a consistent number, or some way of marking which document the predictions align with.
Really I could guess based on the largest number per series of a few elements outputted, but this doesn’t seem optimal — it does seem like a bug.
(I tried adding intern = T as an argument per R – fasttext how to load output into a dataframe from command line — this is not recognized as an argument).