I have a data frame with words and I want to extract the letter and bigram composition for each word.
Data:
df$text
[1] "table"
[2] "run"
[3] "mug"`
And in the end I want to receive the output:
1 a b c d e..z aa ab bb...zz
table 1 1 0 0 0..0 0 1 0 0
First, I was trying to extract all the letters using Quanteda:
text <- c("table", "run", "mug")
dict <- dictionary(list(a= "a",
b = "b",
c = "c",
d = "d",
e = "e",
f = "f",
g = "g",
h = "h",
i = "i",
j = "j",
k = "k",
l ="l",
m = "m",
n = "n",
o = "o",
p = "p",
q = "q",
r = "r",
s = "s",
t = "t",
u = "u",
v = "v",
w = "w",
x = "x",
y = "y",
z = "z"))
corp<- corpus(text)
tokens(corp) |>
tokens_lookup(dictionary = dict) |>
dfm()
But it did not work out:
Document-feature matrix of: 3 documents, 26 features (100.00% sparse) and 0 docvars.
features
docs a b c d e f g h i j
text1 0 0 0 0 0 0 0 0 0 0
text2 0 0 0 0 0 0 0 0 0 0
text3 0 0 0 0 0 0 0 0 0 0
[ reached max_nfeat ... 16 more features ]
I am totally new in this, and if you have any hint how to do that, please, help. Thank you!
Oksana Tsaregorodtseva is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.