Our application requires an activity log of actions that a user has done. For example:
3/19/2013
————-
9:37 AM Chuck Norris scrapped 3 robots.
9:15 AM Chuck Norris fought 3 robots.
Currently this is done with a gigantic query that compiles all of a user’s activity on a given day. The query is so slow that that application literally forbids you from looking at more than a single day’s activity at a time. These results could be aggressively cached in Redis.
My question is: what is a good Redis key strategy for storing these activities?
I need to be able do a lookup primarily based on the user ID and a date range. I’ve played with the following possibilities, but nothing has really leaped out at me as the optimal solution.
-- psuedo-code, obviously --
HMSET user:5345:activity date "3/19/2013" time "9:37 AM" action "Some action"
LPUSH user:5345:activity (json)
LPUSH user:5345:activity:2013:03:19 (json)
All three of these have significant tradeoffs. Assume a user hasn’t logged in for four months – they might actually want to see their activity from four months ago at the top of the list. The second option gives me that, but there are trade-offs in that direction as well (like not being able to specifically query by date).
Any thoughts on a better way to approach this?
Interesting challenge (!) because dates are sparse.
Here’s a thought: if you can find a way to represent dates as numbers (minutes past the year 2000?), then you could try a Sorted Set
. The set would be keyed on the user ID. The score would be the numeric date representation. The value would be the log entry.
Then you can use ZRANGEBYSCORE to get log entries within a date range. Also, it should be easy/efficient to “get the N most recent log entries”.
Example: ZADD user:23432:activity 1363800946 data
2
I’m one of the authors of the Feedly package which was mentioned.
We built a solution for the 2nd largest online fashion community and open sourced our approach:
https://github.com/tschellenbach/Feedly
It’s currently the largest open source library aimed at solving this problem.
You can use both Redis and Cassandra as storage backends.
Our redis storage backend uses sorted sets for data storage.
https://github.com/tschellenbach/Feedly/tree/master/feedly/storage/redis
In addition we use pubsub capabilities for the realtime components.
The same team which built Feedly also offers a hosted API, which handles the complexity for you. Have a look at getstream.io There are client libraries for Python, PHP, Node and Ruby. (This is based on Cassandra and therefore substantially cheaper to operate than running your own system using Redis)
In addition have a look at this high scalability post were we explain some of the design decisions involved:
http://highscalability.com/blog/2013/10/28/design-decisions-for-scaling-your-high-traffic-feeds.html
This tutorial will help you setup a system like Pinterest’s feed using Redis. It’s quite easy to get started with.
To learn more about feed design I highly recommend reading some of the articles which we based Feedly on:
- Yahoo Research Paper
- Twitter 2013 Redis based, with fallback
- Cassandra at Instagram
- Etsy feed scaling
- Facebook history
- Django project, with good naming conventions. (But database only)
- http://activitystrea.ms/specs/atom/1.0/ (actor, verb, object, target)
- Quora post on best practises
- Quora scaling a social network feed
- Redis ruby example
- FriendFeed approach
- Thoonk setup
- Twitter’s Approach
2