I have a large data.frame
with about 4 million rows and 2 columns.
The two columns contain long character strings, texts representing recipes.
For each row, I am comparing the similarity of the recipes in column A and column B, using textSimilarity() from the text
library in R.
Performance is very slow. Are there ways of speeding this up? Or am I coding this wrong?
Example data – with way shorter texts:
df <- data.frame(
columnA= c("tomato sauce is very tasty to use", "without garlic, this dish is not chinese", "British food is as tasteless as it can get"),
columnB= c("pizza is the source of life", "a nice xiaolongbao is steamed until it is soft", "braised pork can be very healthy if prepared well")
)
> df
columnA columnB
1 tomato sauce is very tasty to use pizza is the source of life
2 without garlic, this dish is not chinese a nice xiaolongbao is steamed until it is soft
3 British food is as tasteless as it can get braised pork can be very healthy if prepared will
To get the similarity, I use:
df$sim <- textSimilarity(textEmbed(df$columnA)$texts$texts , textEmbed(df$columnB)$texts$texts)
In the current set-up, this process takes days rather than ours. How to speed this up? Or are there alternatives?