Whenever I’m doing web development, and a page takes longer than half a second to be generated, I know that somewhere my code is hitting the DB too many times. The normal way to fix this situation is to ask the DB for all the information all at once instead, by doing JOINs and the like.
My question is: Why do many database queries make a page slow? There must be considerable overhead to each query, but what is it?
EDIT: Alright, let’s take an example (it’s a bit silly and small, but it’ll do)
people
table:
| name | football_team_id |
+------+------------------+
| jim | 1 |
| mike | 3 |
| carl | 2 |
football_team
table:
| id | color |
+----+-------+
| 1 | red |
| 2 | blue |
| 3 | green |
We all know that this is slow:
SELECT name,football_team_id FROM people;
# start rendering the page, realise we need colors
SELECT color FROM football_team WHERE id=1
# oops, need mike's color
SELECT color FROM football_team WHERE id=3
# oh, and carl's
SELECT color FROM football_team WHERE id=2
This is a bit better:
SELECT name,football_team_id FROM people;
SELECT id,color FROM football_team WHERE id IN (1,3,2)
This is best:
SELECT name,football_team_id,color FROM people JOIN football_team ON people.football_team_id=football_team.id
In each example we’re getting the same amount of data, but the latter is easily the fastest.
You wouldn’t expect the same behaviour if you were reading from a file descriptor, for example.
4
I’ve profiled a number of applications and I have found that:
- Creating a database connection is usually the most expensive operation (between 700-1500+ ms on many major databases)
- On the database server, most simple queries like you listed in your question take very little time to execute (between 1-20ms measured on the server)
- A good portion of the time is spent transferring data from the database to the web page (about 100-300ms per simple query).
Armed with this information, if you aren’t currently caching connections then now is a great time to start. You can see that the actual time to execute a query really is negligible. The problem is the time to actually get the data back to your web app.
So what’s going on?
You’ll find that most database protocols are very “chatty”. Basically, they they send bytes back and forth so that the database and the client know they are still present, and that the client has proper permissions, etc. In some cases there is some overhead when cursors are shared between server and client.
Your database server returns results in chunks, and the driver may have to send acknowledgements to let the server know that the chunk was received properly. The driver then needs to take these chunks and represent in a way your application can use. All this processing takes time.
All communications have a couple properties that affect transmission time:
- Latency: the delay between the time a packet is sent to the time it is received.
- Transmission Speed: the number of bits/bytes per second that the wire can support.
The more firewalls, routers, and other infrastructure devices you have between your app and database the more that raises latency. Transmission speed is something we are more familiar with, because we know are servers are connected with 10baseT, 100baseT, or 1000baseT Ethernet (10, 100, 1000 million bits per second respectively).
If you have high bandwidth once the data moves, it moves very quickly. High latency can make communications with the database much slower than it should be due to the small packets moving back and forth between the database and application.
How do you deal with it?
One of the best ways to minimize the cost of dealing with the database is to minimize the number of times you call the database. Additionally, you’ll want to make sure you are only getting the data you actually need to display.
In some cases you can use some intelligent caching so you just don’t have to hit the database at all for some parts of the pages you have to render.
Why do many database queries make a page slow?
Why does a large number of anything make a page slow?
Do something once and it takes “some amount” of time.
Do the same thing a thousand times and yes; it’s going to take [roughly] a thousand times as long. There’s no magic here. Unless you start parallelising and multi-threading your programs, everything’s going to get done “one thing after another”.
Yes; getting a database connection and using it does have an overhead, although things like Connection Pooling serve to diminish the impact, but the more times you go to the database, the longer things are going to take.
Also, watch out for the amount of data you’re pulling back. “select *
” seems to be making something of a comeback in the “Newbie” coding communities at the moment. Great if your table has three columns and you want all three of them. Not so good if you only want three columns but your tables as “somehow” acquired another twelve of them; all of them massive text fields!
(Remember; you’re not the only user of “your” database).
1
Generally it’s what we call “on-the-wire overhead”. In a lot of servers, the database is located on a different machine than the server app. (This has scalability benefits, among other things.) That means that any database call has to go over a network connection, and all results have to be pushed back over the network. The cost of that overhead, even if the machine is only sitting a few feet away from the one hosting the server app, can add up very quickly.
2