Hypothetical scenario: Let’s say we are downloading JSON from Facebook with details of a user’s friend’s checkins, posts, etc… These come in as one document per friend per activity, so with 8 activities a user with 300 friends will cause our system to make 2400 requests to Facebook, downloading 2400 JSON documents.
Let’s say we want to merge these 2400 documents together, sort the activities by date_created descending and then page through them in a sort of pseudo newsfeed. Please do not comment on the wisdom of recreating a facebook newsfeed in this way.
Let’s also suppose that we want to re-download all of this data whenever we are notified that it has changed by Facebook. (FB has an update service you can subscribe to for users of your app). For argument sake let’s assume all of the data has to be refreshed every 5 minutes, and further assume we want to be able to support 1000 simultaneous users, and that the average JSON document size is 25kb.
I am curious as to how NoSQL techniques would be better than parsing the JSON on ingestion into a relational database? To me it seems like map/reduce are just synonyms for parse/aggregate and that both approaches will require the same thing to occur. What advantages would I get from using NoSQL?
What advantages would I get from using NoSQL?
NoSQL will scale better as the number of users grows.
Traditional RDBMS don’t really scale well. All that you can do is throw bigger machines at the problem. They aren’t really suited for distributed systems (cloud e.g.).
NoSQL is (under given circumstances) better at handling hierarchical structures like documents/JSON.
The key point to understand is that these storage mechanisms are key-value based and thus can retrieved data that is stored together very fast, as opposed to data that is “merely related” (what RDBMS were built for).
In your case that would mean, that you can easily retrieve all records for a certain user very fast for example. In traditional relational databases you would either have to denormalize your schema for performance or keep the schema clean but potentially suffer performance penalties caused by joins or heavy aggregations.
Look at it this way: Why is a hash map (key value store) fast? You can retrieve items from a hashmap in almost O(1) as the hash directly translates to a memory address (simplified). Looking up a binary index in contrast to that would yield O(log(n));
For your case, MongoDB or CouchDB might be good solutions, as it’s already based on JSON.
In my opinion, using a NoSQL solution here is a good choice. You want to retrieve all the activities of a user as a feed. If they’re properly written to your data storage, then NoSQL should, in theory, excell at this, without the need for joining anything or worrying about proper indexes. @Earlz also mentioned that you have no ACID guarantee for NoSQL databases. This makes NoSQL fast and you probably don’t need ACID properties for your application. Give it a try!
Moreover, there’s a good article from Martin Fowler on the subject. He’s made a nice diagram that I really like:
Go check out his pages to read some deep thoughts about NoSQL.
14
First of all a NoSQL database is a database that does not use an SQL interface. What all NoSQL databases have in common is that they do not use an SQL interface. Did I just repeat myself? Yes, but there is nothing else I can say about NoSQL databases as a group. Anything else being said about NoSQL databases around the internet is either wrong for some members of the group, or likely to become so at some point in the future with the release of a new database or a feature upgrade of an existing one.
All this to say that asking if a NoSQL database is a good choice for a particular job is really not an answerable question since different NoSQL databases have hugely different characteristics.
In the scenario that you describe the biggest issue would definitely be that you are pounding Facebook with 8000 HTTP requests per second, but let’s ignore that and focus on the quite common issue of having a large amount of tiny data pieces.
Data handling
All other things being equal, what is the performance difference between fetching an 8 byte string and a 16 byte string from a database? It is insignificant, and barring some obscure counterexample that is true for any database, SQL or not, the overhead of everything else that goes on in a request dwarfs the time it takes to copy 8 more bytes. If you want to shift data through a database fast then sorting it into big blocks that fit your use case is one of the most significant thing you can do, often way more important than what database software you use.
Of course there are cases where your use does not fit having big chunks of data, in some cases a caching strategy where data is kept both in the original split form and in chunks may work well, in other cases there is not a lot to do but keeping the small pieces separated.
Data manipulation
Databases are slow, that is to say, if you implement a data manipulation function in a common program, for instance taking a lot of small strings and joining them into one, and you implement similar functionality through a database request, then the database version will typically take 100 to 1000 times as long to do the operation. The exact figure of course depend on the database, some databases will not be able to do it, so you’d have to write a program that fetch all the data, perform the operation, and then write the result to the database, which is also a pretty slow method.
In general, don’t do on the database what you could reasonably do to the data before it’s written to the database.
What database to choose
Once you have taken all these considerations, what requirements for a database do you have left? Did you manage to come up with a structure that doesn’t need any of the fancy/slow features offered by some databases? If you did then an SQL database may be like a Swiss knife with a dull blade, lots of cool features, but not particularly good at what you need it for. Some of the NoSQL databases are simply faster and better when you only need the simple features, others fit the job just as bad as an SQL database.
The big question
Despite being written last in this post it is the question you should ask before all the other questions that I mentioned. Do you actually need a database?
It is a pretty common assumption that when you handle a significant amount of data, you should use a database. But with a modern computer you can store several gigabytes of data in application memory. This gives you fast and easy access, and the good tools for manipulation are right at hand. The one thing that it does not give you is persistence, if the program crash of there is a power loss the data is lost. In a lot of cases that is however perfectly acceptable, your example has data with a life span of ~5 minutes, it doesn’t need persistence, it doesn’t need a database.
5