I am using Cassandra for a data intensive app. With relatively little operations and deployment experience, the expertise I am looking for is someone that can read the example below and decide whether I am overlooking simpler solutions, or if the resources needed make this problem expensive or intractable.
-
~A million entries in a book table: each entry ~30 columns – name, array of themes, year, etc
-
~1-10 thousand book stores that each contain some subset of the main table in (1), perhaps containing the id field from (1). so a book store table for the store metadata, and a book store inventory table will be needed.
-
A million users – a million entries in a user table.
A sequential recommender algorithm is designed to rank the best choice out of all possibilities for a user in a certain store. first, it can easily score each book in the main book table with a 1 or 0 based on user tastes. thus it can “filter out” what it knows the user wont like, and the 1’s move on to the scoring round. second, it can take real time user data and rank the remaining books for the store the user visits.
the question is how to apply the first binary recommendation step to the data.
a) each of the 10,000 “book stores” has its own inventory subset of the main book list. at worst case if all stores have all books(just pretend), thats 10,000 stores X a million books. then a batch operation (spark perhaps) can pull a single store’s inventory to score for a user, and in the application logic, each book is checked against a hash-table for whether it passed the first binary recommender filter, which is queried from the user table.
b) create a user-store-book table (since user only has one or two favorite stores) that includes boolean results of first round of recommender for each book. this means a million users X a million books X 2 stores as entries in this table. then the batch job just queries directly for the recommended books in order to rank.
To put my question more succinctly, I am worried that in solution a, the CPU resources required and extra IO would make for a low-performant solution, and that the sheer amount of data in solution b might make this solution intractable.
1
Can’t you regroup the books by genre, or taste groups, to reduce the problem to lower scale ? It seems to me the major issue is that you are trying to use a large set of data that isn’t refined enough to provide a real-time insight. You should probably try to cluster the books in this situation. If you can’t, then you are down to full iteration, and limited to the two solution you listed.
Edit : i forgot, but clustering the users may make sense too