I’m not sure how to approach this problem.
I require a big chunk of data records from the SQL server. This chunk is based on variables, so I don’t know before what records I need.
I need to do a large series of calculations and each calculation requires one (or more) records from this chunk of data. Again: I do not know which records are required.
Should I:
A. Load this data into the application memory all at once
- This creates a single connection to the DB, loads ALL required data by a query command (and a forward only DataReader) and then doesn’t bother the SQL server anymore.
- The datafetch seems to be slow, since it’s reading hundred of thousands of lines into memory
B. Whenever the calculation needs data, retrieve it from the database
- This would open and close a connection to the SQL db multiple times per second.
- The initial datafetch is reduced to only a few miliseconds, but creates a massive load on the SQL server during calculation.
8
Firstly, if you can turn your whole calculation in to an SQL query (or a series of queries or a stored procedure) then do so. Databases are good at this stuff, and you or a DBA may be able to do a lot to improve the query if it’s still too slow.
If not:
- Use a connection pool. Not doing so is usually crazy, unless you’re writing a script that only connects once or twice.
- If you’re testing this in a development environment with a local DB, beware that there can be a big difference in performance characteristics compared to a production one and don’t over-optimize based on what you measure. Network delays in particular could catch you out. Fetching one row at a time may be fine with a low network latency, and awful with a high one.
- Database sizes are usually bigger in production, and go up over time. If you fetch all the data in advance you could get caught out and run out of memory (unless you know more about your data then we do…).
- As Pieter B suggests, you’re probably better fetching data in batches if you really need a large number of rows. Then you’ll neither blow everything else out of your server’s memory, nor have a network latency and query overhead on every row. It’ll also help if you want to report progress to the user.
- If you’re really serious about making it go as fast as possible and not using SQL to do it, then you could try parallelizing your code. Then you can be calculating with one set of data whilst fetching the next, and if your production DB has multiple cores and disks you can parallelize in the DB, too. You could also look at caching, if that’s appropriate (memcached and similar, or directly in your server if you know your data sizes well enough).