Morning all,
I am trying to conduct sentiment analysis on a large Swedish Twitter dataset with a custom lexicon specifically for Swedish research. However, even if the text I am wanting to analyse and the lexicon are encoded as UTF-8, the syuzhet package completely ignores words with the Swedish å ä and ö charachters. Is this an inherent issue with the package (i.e. I need to recode the charachters in the data + lexicon, or am I missing something glaringly obvious?
I tried the following dummy, and evidently the get_sentiment function does not accept the Swedish characters.
test <- c("kärnvapen", "kärnkraftsolycka", "glad", "körd", "kåk", "ledsen")
SWE_lexicon <- read.delim("sensaldo-base-v02.txt", header=FALSE, encoding = "UTF-8")
colnames(SWE_lexicon)[colnames(SWE_lexicon) == 'V1'] <- 'word'
colnames(SWE_lexicon)[colnames(SWE_lexicon) == 'V2'] <- 'value'
score <- get_sentiment(test, method="custom", lexicon = SWE_lexicon)
head(score)
[1] 0 0 1 0 0 -1
When recoding the charachters into e.g. ae for ä it assigned the correct values to the word.