As a relational database programmer (most of the time), I read articles about how relational databases don’t scale, and NoSQL solutions such as MongoDB do.
As most of the databases I have developed so far have been small to mid scale, I have never had a problem that hasn’t been solved by some indexing, query optimization or schema redesign.
What sort of size would I expect to see MySQL struggling with. How many rows?
(I know this is going to depend on the application, and type of data stored. the one that got me thing was basically a genetics database, so would have one main table, with 3 or 4 lookup tables. The main table will contain amongst other things, a chromosome reference, and a position coordinate. It will likely get queried for a number of entries between two potions on a chromosome, to see what is stored there).
1
How big a data?
There are two significant thresholds:
- whole data fits in the RAM
- whole index data fits in the RAM
With fast SSDs the first threshold became bit less of an issue, unless you have crazy high traffic.
ACIDity
One of the problem with scaling RDBMSes is that by design they are ACID, which means transactions and row level locks (or even table level in some older/simpler RDBMSes). It can be limiting factor if you have lot of queries modifying lot of data running at same time. NoSQL solutions usually go for eventual consistency model.
How do RDBMS scale on data size?
It’s not entirely true that RDBMS cannot scale on data size, there are two alternatives: vertical partitioning and horizontal partitioning (aka sharding).
Vertical partitioning is basically keeping unrelated tables on separate DB servers, thus keeping size of each one below thresholds mentioned above. This makes join these tables using plain SQL less straight forward and less efficient.
Sharding means distributing data from one table among various servers, based on specific key. This means that for look ups you know which server to query based on that key. However, this complicates queries that are not look ups on the sharding key.
In case of both kind of partitioning, if you go to extremes, you basically end up with same situation as NoSQL databases.
13
I don’t think that the size of data is the only factor. “Data model” is also a very important part.
E-Commerce catalog pages (Solr, ElasticSearch), web analytics data (Riak, Cassandra), stock prices (Redis), relationships connections in Social Networks (Neo4J, FleetDB) are just some examples when a NoSQL solution really shines.
IMHO, data model has more important role than the size of data when considering a NoSQL solution or RDBMS.
3
If relational databases don’t scale, nothing does. Don’t worry about scaling problems.
SQL has problems with some sorts of analysis, but it doesn’t take much data to trigger the problem. For example, consider a single table with a column that references other rows based on a unique key. Typically, this might be used to create a tree structure. You can write fast SQL statements that reference the related row. Or the related row’s related row. In fact you can make any specific number of jumps. But if, for each row, you want to select a field on the first related row in the chain that meets some criterion, then it gets complicated.
Consider a table of office locations at nation, province/state, county, town, and village levels, with each office referencing the office it reports to. There is no guarantee that each office’s reporting office is only one level up. For a selected set of offices, not all on one level, you want to list each one’s associated national office. This requires loops of SQL statments and will take a long time even today. (I used to get 30 seconds on a selection of 30 offices, but that was a long time ago–and switching to stored procedures helped a bit.)
So the alternative is to put the whole structure into one big block of data, label it, and store it. When you want to analyze the data, read all of it into memory at one go, setting up pointers to track the structure, and you can process a couple million offices in the blink of an eye.
None of this has much to do with the amount of data. The key is the nature of the data’s organization. If a relational layout helps, then a RDBMS is what you want. If not, some kind of bulk storage is going to be anything from slightly to a quadrillion times faster.
Note that if one of these sets of data becomes too big to fit into memory, your non-SQL database doesn’t work any more. Another problem is when you need data from more than one block at a time; you can do this if, and only if, all the blocks fit in memory at once. And the user has to wait while you load them up.
If your relational database is going to cause you problems, it will do so before you’ve put much data into it. The only scaling problem you might have is with your program when the block of data you are assembling for a nosql DB–if you have to use one–becomes too big for it. (Do read up on out-of-memory errors. The newer languages sometimes do strange things with memory.)
Mongo
can be installed on a number of computers/nodes. PostgreSQL
does not provide built-in tool for sharding, however citus is around.
MongoDB supports databases up to 64 terabytes and document size is 16 megabytes.
MySQL has a database limit of 256 terabytes, 64 terabytes the maximum size for a table and record limit of 4 gigabytes
PostgreSQL has no limit on database (4 terabytes does exist somewhere for testing) and it has a limit of 1 gigabytes for the size of any one field in a table and again 64 terabytes the maximum size for a table.
I think the first reason to go to a NoSQL or Distributed solution isn’t so much the size of all the data, but the size of the tables. What distributed solutions do well is split up tables to different nodes then when you need to query the tables, each node will process their piece of the table.
RDBMSs can do this, but the new wave of NoSQL databases have been built to do this. Oracle, MSSQL, MySQL took their centralized model and tweaked it to make it work in a distributed environment. However they still adhere to strict ACID rules while some of the new databases do not adhere to the strict rules such as by using eventual consistency.
There isn’t a set amount of data where you should choose one over the other. What needs to be taken into account are the needs of the database and the amount of use it receives. NoSQL databases can process larger data-sets more quickly while relational databases give you the confidence your data is correct with the ACID principles.
It might also be worthwhile in mentioning that your data model has a big influence on things. If you find yourself needing to create some form of tree structure (ie you have a self referencing foreign key on a table that contains said foreign key in a compounded primary key) you should probably look at doing that in some form of database that handles those types of data really well (such as mongodb or couchdb).
Like other people have said you should also take into consideration what is happening in your application. if you really need ACID across multiple tables then you really do need to stick with a RDBMS, but if you have something where you can have some slightly stale data and you need the flexibility of a NoSQL schema (call it schemaless if you like but it still has some form of implicit schema) then you might consider grabbing a NoSQL store (http://www.10gen.com/customers/craigslist here is an example of why craigslist switched over… but admittedly they are archiving ~10TB of data, which I know doesn’t fit into your small to mid sized database size at all. But the use case might be helpful).
Keep in mind that NoSQL systems are not necessarily there to replace RDMS’s but in many instances you can supplement your RDBMS through the idea of Polyglot Persistence and you can store most of your data in an RDBMS but in specific niche instances you can offload some of your data to some form of NoSQL store.