I’m working on a project where I have a large dataset containing billions of records. Each user can have one or more records, and each record can be associated with multiple users.
Given a specific user, I need to list users who share the highest number of the same records with this user.
Could you suggest an efficient way to achieve this, considering the size of the dataset and the fact that the dataset will continue to grow over time? When retrieving the list for a user, I want the results to be as fresh as possible, avoiding outdated data if possible
Here is a simplified example to illustrate the problem:
- User 1 has records: [A, B, C]
- User 2 has records: [A, B, D]
- User 3 has records: [B, D, E]
- User 4 has records: [A, B, C, D]
If the input user is User 1, the expected output would be a list of users with the counts of matching records, sorted by the highest count:
- User 4: 3 matching records (A, B, C)
- User 2: 2 matching records (A, B)
- User 3: 1 matching record (B)
What technology stack should I use (database, language, etc.) and what algorithm or approach should I follow to solve this problem efficiently?
I would prefer to use JavaScript as it’s my primary language, but I can pick up something like Python if necessary.
Please take into consideration that I am primarily a front-end developer, so detailed explanations or pointers to resources would be very helpful.
Thank you in advance for your help!