I have a basic question about the use of LightFM, apologies if this isn’t the right forum.
I’m building a recommender system that will recommend documents to users. There are no interactions yet and all we know about the users are the set of keywords they’re interested in.
I’ve built a prototype where I transform each document using TF-IDF. I then transform the user’s keywords with the same transformer and use cosine similarity to find the most relevant documents. It works reasonably well.
I’m now porting this to LightFM so that we can include interactions, but first I need the system to perform equally well as the TF-IDF solution, but I struggle to make it work. Here’s the current approach:
First I build a LightFM Dataset
object on all items in the corpus, using TF-IDF to build item features. (So each item is represented by a sparse vector of about 3300 elements, where each entry corresponds to a stemmed word and the TF-IDF weights of that word..)
When a request for recommendations for a new user comes in:
- get that user’s keywords. Form a pseudo-document containing just a string with all the keywords.
- get the TF-IDF features on that pseudo document, using the same vectorizer used to build the corpus features
- retrain the LightFM model, with a single interaction between the user and the pseudo document and
item_features
formed by concatenating the corpus’s item features and the pseudo document’s features - call the predict function to get the recommendations
In my unit tests I have 52 documents, which get transformed to a TF-IDF vector of about 3300 columns. The user’s pseudo document is transformed to a vector with a single 1.0 entry corresponding to that keyword.
So I would expect the prediction to score high those documents for which the TF-IDF entry corresponding to the keyword are also high. But instead, the scores are more or less the same, about -0.5.
Am I doing something wrong here?