I have over the last 30 days been developing a Python application that utilizes a MySQL database of information (specifically about Norwegian addresses) to perform address validation and correction. The database contains approximately 2.1 million rows (43 columns) of data and occupies 640MB of disk space.
I’m thinking about speed optimizations, and I’ve got to assume that when validating 10,000+ addresses, each validation running up to 20 queries to the database, networking is a speed bottleneck.
I haven’t done any measuring or timing yet, and I’m sure there are simpler ways of speed optimizing the application at the moment, but I just want to get the experts’ opinions on how realistic it is to load this amount of data into a row-of-rows structure in Python. Also, would it even be any faster? Surely MySQL is optimized for looking up records among vast amounts of data, so how much help would it even be to remove the networking step? Can you imagine any other viable methods of removing the networking step?
The location of the MySQL server will vary, as the application might well be run from a laptop at home or at the office, where the server would be local.
7
I would avoid to load all the 2.1 million rows into memory (at least, it would not be my first choice to do so). Even if you can do this with one query, and even if you get around the memory issues, it will still take a while to get all that data to the client. Lookups in that data may be probably faster than by a lot of SQL queries over the network, but if that really pays in comparison to the initial loading is questionable (of course, you will have to measure to be sure).
If I got you right, you don’t need the all that data to perform the validation – you need only data for ~10.000 addresses, and a number of records from the database of comparable size. What I would probably try to is to reduce the number of queries to get exactly that data. Perhaps you don’t need 20 queries per address and can reduce it to 10? This would be a start. Perhaps you can get the data for multiple addresses in one query instead of many, using the IN
or OR
operators in the WHERE
clauses of your statements? That would probably help, too, since for small amount of data, often only the number of queries is the key factor, not if you get 10, 100 or 1000 bytes result set.
If you want more specific advice, you have to provide more details about the data model and the queries you are using.
EDIT: assumed your address database is frozen for several months, that would qualify to use a local copy of the whole database as a cache. Something like SqlLite should be the logical choice here: a very fast, server-less, in-process (and optionally in-memory) database, where the whole data can be stored in one file. You can still use SQL for all queries you need, and the transition from MySQL tables to SqlLite tables should be straightforward (as long as you did not use any too-MySQL-specific things in your current database so far).
Consider, if you are trying to keep the local copy of your db tables completely in Python row-of-rows, depending on the complexity of the data model, you would probably end up building something like your own “poor-mans” in-memory database on your own. So why reinvent the wheel?
Information about the SqlLite Python interface: http://docs.python.org/2/library/sqlite3.html. Another advantage: if the address data is updated on the server, this approach will more or less easily allow to create the database copy once on the server and send the data in one file over the network to the client.
4
Doc Brown’s answer and Mason Wheeler’s comment are probably the best, most practical ways to go here. That said, it’s an interesting conceptual question, so I’m inclined to comment.
Also, would it even be any faster? Surely MySQL is optimized for looking up records among vast amounts of data, so how much help would it even be to remove the networking step?
We’re essentially talking about two things: having the data client-side, and having “custom code” to do the lookups in your data set. The advantages I see for this arrangement:
- Decreased round trip time; with performance, you’d have to measure to be sure of anything, but if it turns out that your scenario is mostly dominated by waiting for 200 000 queries to be passed over the network (and back), having the data on the client would definitely help. You’d also be offloading the work to the client, so this might be useful in a scenario where you’re expecting many clients (say thousands or more), at the same time.
- Deep knowledge about the data and fine-tuning your data structures; MySQL’s engine is certainly very optimised, but it doesn’t really know what data it is storing (it doesn’t really care if it’s addresses or phone numbers or patient information). You can take advantage of this fact, and come up with clever ways to encode the information for quick lookup. E.g. if we know that most Norwegian streets start at some house number, and then they are mostly contiguous up to the last number in that street, you can create a pretty efficient lookup mechanism. Is the house number between the min and max for that street, and not on the exception list? Then it’s a valid address for that street. Coincidentally, this type of thing could be done on the DB as well, the main difference would be the ability to choose and tune the data structures yourself (e.g. having a custom hash algorithm, etc).
The second point could probably be used to come up with a smaller representation of your DB. I’m not sure how normalised your data is at the moment, but generally address DBs tend to have much of the address in text, which is then mostly duplicated many times over and over. This is not a very efficient way of storing things, and if this is true of your DB, it could probably be represented in a much more compact file (a quick test might be to ZIP or RAR the DB to see if there’s much duplication in it, although the distribution of data might mean that these algorithms don’t help as much as they could). Aside – if street numbers are being mixed with street names as text, I’m guessing a lot could be done to optimise the DB for indexing.
TL;DR: Yes, theoretically, you could get some benefits out of moving the data to the client side. Practically, however, you’re probably far better off just batching your queries, or using any of the other great ideas already posted.
You will run into troubles if you try to use python lists for this kind of job. You will have a big memory overhead which is likely to prevent you to do anything useful with your data.
You should check Arrays (http://docs.python.org/2/library/array.html) or if you can numpy (http://numpy.scipy.org/).
If you still can’t load all your data into memory, you can try to process it by chunks to limit the number of queries.