Assuming there is some bit of code that reads files for multiple consumers, and the files are of any arbitrary size: At what size does it become more efficient to read the file asynchronously? Or to put it another way, how small must a file be for it to be faster just to read it synchronously?
I’ve noticed (and perhaps I’m incorrect) that when reading very small files, it takes longer to read them asynchronously than synchronously (in particular with .NET). I’m assuming this has to do with set up time for things like I/O Completion Ports, threads, etc.
Is there any rule of thumb to help out here? Or is it dependent on the system and the environment?
3
Unfortunately, the answer is, “it depends.” It would be easy for you to write a small program to empirically determine the times of both async and sync reads.
It will depend on lots of factors. Are they stored on spinning disks, SSD, or a network drive? What kind of CPU are you using? How many sockets/cores? Are you running in an VM or bare metal? Are you running an ancient OS or modern one?
1
Async has 3 main advantages:
- It lowers CPU utilization. This could be useful if you are also doing CPU-heavy operations with data you just read.
- Using some kind of async infrastructure makes the code easy to paralelise. Especially if you are reading lots of files.
- By sending multiple read-write requests to OS, OS and HW can re-order those operations to be completed faster. SATA2 has such feature.
I believe main advantage of asynchronous read is when you are working with lots of files or you need lots of CPU power.
2
It depends
One thing to keep in mind is how expensive is a context switch between processes. Node.JS is designed the way it is because it assumes that doing a context switch is very expensive and you will otherwise have a lot of processes waiting on IE which will bog down the computer.
On the other hand Erlang makes a process context switch very cheap so everything can be synchronous and the Erlang run time can keep track of the whole thing.
So the factors to consider:
- the Cost of a context switch operation
- the speed of the disk for
seek operations - the speed of the disk for read operations
- are the files in cache
And I am sure I am leaving out a half a dozen factors
I’m not sure there’s a particular “point”, but it makes the most sense when you’ve got a lot of threads working, as it allows you to overlap your I/O with other work. If you’ve got spare threads going idle, then reading asynchronously isn’t going to give you any advantage. It’s only when you’ve got work queues filling up and your thread could be usefully doing other work instead of waiting for I/O that async file access gives any advantage.
1
I think the problem here is not so much read speeds, as it’s the latency.
If you’re reading from a network drive, or from a slow mechanical hard disk drive with long queues, the performance will take a nosedive for reading. And if your app is also doing the reading in GUI thread, in which case it’s a very bad application, then it will be awful for the user.