I am scanning a directory of files and want to add them to a database. I have two variants:
for file in walk(basedir):
files.append(file)
for file in files:
add_to_database(file)
versus
for file in walk(basedir):
add_to_database(file)
The former has the advantage that if both the database and the scanned dir reside on the same physical disk, jumping between two locations on disk (assuming spindels) is avoided and thus should be faster, at the cost of additional memory consumption.
The latter is much shorter. I am tending towards the latter, adding a note that the database should not reside on the same disk as the scanned dir.
Any thoughts on this?
1
Premature optimization is the root of all evil. — Donald Knuth
Any decent operating system is going to solve the jumping-around problem for you, using memory caches of disk blocks. Solving that problem by accumulating an unnecessary list of filenames creates a new problem of memory consumption – how well will the first version work with 50,000,000 files? Forget the note, unless your performance tests actually demonstrate a significant impact.
The core concern here should probably not be disk costs, but correctness and intent.
If you intend to do more than just insertion of the files in the list, you may want to prefer the accumulate-then-insert solution, as that gives you a consistent list of files to do further work with. It also lets you use bulk insertion functionality to your database instead of inserting exactly one file at a time. This also limits the duration of transactions and locks, as the insertion phase is separated from the gather phase.
If you on the other hand interleave calls, the database starts working earlier. A highly efficient approach would be to run two tasks, one that generates a sequence of entries for the other to consume as it goes, as slack in the generation can be buffered in the sequence.
In the end, just misquote Knuth and do what works. It’s probably not a bottleneck.
1