A colleague of mine today suggested that we go through all of the queries in our application and to add indices accordingly.
I feel this is premature optimisation because our application is not even released yet. I suggested to monitor for slow queries once we go live and then add indices accordingly.
What is the general consensus when designing your database, should you add a matching index every time you write a new query? Or is it better to just monitor and see how it goes?
14
Premature optimization is “optimizing” something because of a vague, intuitive sense that, y’know, this will probably be slow, especially to the detriment of code readability and maintainability. It doesn’t mean willfully not following well-established good practices regarding performance.
Sometimes that’s a difficult line to draw, but I’d definitely say that not adding any indices before you go live is too-late optimization; this will punish early adopters–your most eager and most important users–and give them a negative view of your product, which they will then spread around in reviews, discussions, etc. Monitoring queries to find pain points that need indexing is a good idea, but I’d make sure to do that no later than the beta.
7
monitor for slow queries once we go live
because nothing says quality like making your users suffer for a lack of design!
You should know which queries need indexes when you design the tables, you know which columns are being queried on in where clauses and joins. These should be indexed already because what might not be apparent in a live environment may quickly become apparent when the load or data stored increases. What you don’t want to be doing when this happens is slapping indexes on every ‘slow’ query, you’ll end up with an index on everything.
8
“Premature optimization”, in its derogatory sense, means costly optimization that might not be needed. It doesn’t mean all optimization implemented before the latest possible point to prevent bankruptcy!
In particular, it’s legitimate to optimize based on performance tests before going live, to ensure you can meet some sensible (albeit approximate) requirements for your app to not completely suck.
At a absolute minimum you should load up your database with a plausible amount of test data and check the responsiveness of your app. This isn’t premature, since you know it’s going to happen, and it will catch any queries that trigger absurdly slow scans. As A E says in a comment:
Use indexes to avoid a full table scan for any query which the
end-user will commonly be doing in real-time
At least, for tables that are planned to grow in use.
Then as a shortcut to that, if you have significant experience with the database engine and you’ve already planned the tests when you write the first cut of the code, then often you’ll know without even running it that the query you’re writing will be too slow without an index. Of course you’re free to pretend you don’t know, and watch the test fail before adding the index to make it pass, but there’s no reason for known faulty code (because unresponsive) to go live.
I feel this is premature optimisation because our application is not even released yet. I suggested to monitor for slow queries once we go live and then add indices accordingly.
You can’t treat your end-users and production environment like quality-assurance. In more words, you’re saying that you’ll figure it out in production. I don’t think that’s the right way, and I see that approach go horribly wrong every day.
You need to keep one thing in mind, as you can’t paint this with a broad brush.
What is your common workload?
That might sound obvious or dull, but it’s significant in practice. If you have 10 queries that make up 98% of your workload (quite common, believe it or not), my recommendation would be a hard analysis before production. With realistic and representative data, make sure those 10 queries are as good as they possibly can be (perfect is a waste of valuable time, and almost not achievable).
For the other 200 queries that make up the 2% of the workload, those are ones that most likely aren’t worth a ton of effort, and will make up the corner-case perf troubleshooting oddities in production. That’s also a reality, and not a terribly bad thing. But that doesn’t mean ignore indexing best practices or make estimated assumptions about data retrieval.
It is common and good practice to figure out database performance prior to production. In fact, there is a relatively common position for this type of thing called a development DBA.
But…
Some take that too far and go crazy adding indexes “just in case”. Somebody recommends this is a missing index? Add it, and four other variations. Also a bad idea. You need to not only think about your data retrieval, but what about data modification? The more indexes you have on a table, generally speaking the more overhead you have when you modify data.
Like most things, there is a healthy balance.
As a fun little side note… The pluralization of “Index”
“Indices” are for financial people
“Indexes” are for us
6
No, it is not premature optimization, but it must be done correctly as any optimization should be.
Here is what I would do:
- Load the database with enough test data to mimic a production load. You cannot get this 100% accurate but that is fine: just put enough data in. Does one table have a fixed amount of data? Load it up. Do you have one table that holds a lot of data, e.g. whatever table holds questions on this site? Load a few million records even if just dummy data.
- Turn on profiling in your database server.
- Bang away at the application using a combination of automated scripts (provides volume) and real users (they know how to break things).
- Review the profiling data. Are specific queries slow? Check the explain plans and see if the database server is telling you it wants an index but it does not exist.
Database servers are complex and intelligent pieces of software. They can tell you how to optimize them if you know how to listen.
The keys are to measure performance before and after optimizing and and let the database tell you what it needs.
Following proven patterns for known problems (like finding a record by its ID) isn’t premature anything. It’s just sensible.
That said, indexes aren’t always a straightforward business. It’s often tough to know during the design phase which indexes your traffic will depend on and which will bottleneck write-operations. So, I’d argue for leveraging some “obvious” schema design best practices (use PK’s appropriate for the designed read/write patterns and index FK’s); but, don’t put an index on anything else until your stress testing demands it.
6
When your application is released, it is too late.
But any proper development process should include performance testing.
Use the results of your performance tests to decide which indexes to add, and verify their effectiveness by repeating the performance tests.
4
Although I don’t think every query should be optimized, indexes are so much a part of RDBMS that they need to be given consideration before releasing. When you execute a query, unlike other forms of programming you’re not telling the system how to execute it. They develop own plans and almost always base it on the availability of an index. The makeup and volume of data will be considered as well at later times.
Here are some things I would consider:
- There are some queries that you should identify in your early development that you just know will be used frequently. Focus on them.
- There will be slow queries. By indexing them first, you can then determine if the performance still isn’t fast enough and then consider a redesign (Denormalizing may be premature). I’d rather do this before a release. Nobody wants a system where it takes 10 minutes to find something in the inventory.
- Indexes may improve query performance but they’re not to hinder data modification.
- Many systems have tools to analyze your queries, so don’t be afraid to use them.
After your initial review, you should follow it up with some considerations for when you should review this again and how you’re going to be able to collect the information to do this (monitor usage, get copies of client data , etc.).
I realize you don’t want to prematurely optimize, but it’s almost certain you will have poor performance without indexing your database. By getting this out of the way, you can determine if there are other areas causing performance issues.
It also depends on how many users you expect. You should definitely do some load testing and make sure your database can keep up with 10s to 100s to 1000s of simultaneous requests. Again, it depends on how much traffic you expect, and what areas you expect to be used more than others.
In general, I would fine tune the areas that I expect the user to hit the most first. Then i’d fine tune anything that’s slow from the user experience standpoint. Whenever user has to wait for something, they get a bad experience and might get turned down. Not good!
It is a good practice to identify which columns definitely need an index by some upfront analysis. There is a real risk of gradual or unexpected performance degradation in production as database size increases if you have absolutely no indices. The situation you want to avoid is where a commonly run query requires scanning a large number of table rows. It is not premature optimization to add indices to critical columns since you have much of the necessary information available and the potential performance differences are significant (orders of magnitude). There are also situations where the benefit of indices is less clear or more dependent on the data – you can probably defer deciding for some of these cases.
Some questions you need to ask are:
- What are design limits for the size of each table going to be?
If tables are always going to be small (say < 100 rows), it isn’t a disaster if the database has to scan the entire table. It may be beneficial to add an index, but this requires a little more expertise or measurement to determine.
- How often will each query be run, and what is the required response time?
If the query is run infrequently and doesn’t have strict response time requirements (e.g. report generation) and the number of rows is not huge, then it is probably fairly safe to defer adding indices. Again, expertise or measurement can help tell if it’s going to be beneficial.
- Does the query require looking up the table by something aside from the primary key? E.g. filtering by date range, joining on a foreign key?
If these queries are run frequently and touch tables with many rows, then you should seriously consider preemptively adding an index. If you’re not sure whether this is the case for a query, you can populate the database with a realistic amount of data, then look at the query plan.