While testing thread creation/parallel processing, I met this strange situation where the multi-thread code is much slower than sequential version.
The code is simply counting words in files, then summing the result.
Sequential code :
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.Arrays;
public class WordCount {
public static int countWords(String filename) throws IOException {
long startTime = System.currentTimeMillis();
try (BufferedReader br = new BufferedReader(new FileReader(filename))) {
int total = 0;
for (String line = br.readLine(); line != null; line = br.readLine()) {
total += line.split("\s+").length;
}
System.out.println("Time for file "+filename+" : "+(System.currentTimeMillis()-startTime) + " ms for "+ total + " words");
return total;
}
}
public static void main(String[] args) {
long startTime = System.currentTimeMillis();
int [] wordCount = new int[args.length];
for (int i = 0; i < args.length; i++) {
try {
wordCount[i] = countWords(args[i]);
} catch (IOException e) {
System.err.println("Error reading file: " + args[i]);
e.printStackTrace();
}
}
System.out.println("Word count:" + Arrays.toString(wordCount));
int total = 0;
for (int count : wordCount) {
total += count;
}
System.out.println("Total word count:" + total);
System.out.println("Total time "+(System.currentTimeMillis()-startTime) + " ms");
}
}
This is the version that runs one thread per file.
public class WordCountMT {
public static int countWords(String filename) throws IOException {
// same code
}
private static class CounterWorker implements Runnable {
private String filename;
private int index;
private int[] wordCount;
public CounterWorker(String filename, int index, int[] wordCount) {
this.filename = filename;
this.index = index;
this.wordCount = wordCount;
}
@Override
public void run() {
try {
wordCount[index] = countWords(filename);
} catch (IOException e) {
System.err.println("Error reading file: " + filename);
e.printStackTrace();
}
}
}
public static void main(String[] args) {
long startTime = System.currentTimeMillis();
int[] wordCount = new int[args.length];
List<Thread> th = new ArrayList<>();
for (int i = 0; i < args.length; i++) {
th.add(new Thread(new CounterWorker(args[i], i, wordCount)));
th.get(i).start();
}
try {
for (Thread t : th) {
t.join();
}
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println("Word count:" + Arrays.toString(wordCount));
int total = 0;
for (int count : wordCount) {
total += count;
}
System.out.println("Total word count:" + total);
System.out.println("Total time "+(System.currentTimeMillis()-startTime) + " ms");
}
}
Ok, so it’s a very basic example used for teaching. Now I could understand that creating a lot of threads is sometimes counter productive, but these traces are strange :
Non MT
Time for file /tmp/data/data6-5.txt : 14 ms for 1468 words
Time for file /tmp/data/data5-6.txt : 3 ms for 1468 words
Time for file /tmp/data/data5.txt : 1 ms for 390 words
Time for file /tmp/data/data5-5.txt : 1 ms for 780 words
Time for file /tmp/data/data7.txt : 1 ms for 1078 words
Time for file /tmp/data/data6.txt : 2 ms for 1078 words
Time for file /tmp/data/data6-6.txt : 2 ms for 2156 words
Time for file /tmp/data/data8.txt : 1 ms for 1078 words
Time for file /tmp/data/data3.txt : 0 ms for 390 words
Time for file /tmp/data/data6-7.txt : 1 ms for 2156 words
Time for file /tmp/data/data4.txt : 1 ms for 390 words
Time for file /tmp/data/data5-4.txt : 1 ms for 780 words
Time for file /tmp/data/data1.txt : 0 ms for 390 words
Time for file /tmp/data/data2.txt : 0 ms for 390 words
Time for file /tmp/data/data9.txt : 1 ms for 1078 words
Time for file /tmp/data/data5-7.txt : 0 ms for 1468 words
Time for file /tmp/data/data6-4.txt : 1 ms for 1468 words
Word count:[1468, 1468, 390, 780, 1078, 1078, 2156, 1078, 390, 2156, 390, 780, 390, 390, 1078, 1468, 1468]
Total word count:18006
Total time 54 ms
With MT
Time for file /tmp/data/data3.txt : 46 ms for 390 words
Time for file /tmp/data/data5-4.txt : 54 ms for 780 words
Time for file /tmp/data/data5-5.txt : 58 ms for 780 words
Time for file /tmp/data/data1.txt : 45 ms for 390 words
Time for file /tmp/data/data6-5.txt : 70 ms for 1468 words
Time for file /tmp/data/data5-7.txt : 68 ms for 1468 words
Time for file /tmp/data/data5.txt : 45 ms for 390 words
Time for file /tmp/data/data7.txt : 65 ms for 1078 words
Time for file /tmp/data/data2.txt : 43 ms for 390 words
Time for file /tmp/data/data6.txt : 64 ms for 1078 words
Time for file /tmp/data/data5-6.txt : 70 ms for 1468 words
Time for file /tmp/data/data9.txt : 58 ms for 1078 words
Time for file /tmp/data/data6-7.txt : 71 ms for 2156 words
Time for file /tmp/data/data8.txt : 66 ms for 1078 words
Time for file /tmp/data/data6-6.txt : 72 ms for 2156 words
Time for file /tmp/data/data4.txt : 46 ms for 390 words
Time for file /tmp/data/data6-4.txt : 68 ms for 1468 words
Word count:[1468, 1468, 390, 780, 1078, 1078, 2156, 1078, 390, 2156, 390, 780, 390, 390, 1078, 1468, 1468]
Total word count:18006
Total time 98 ms
So total time is worse, but also time per file has completely exploded e.g. data8 was taking 1ms now takes 66ms !
How can I better diagnose why ? Can I measure CPU time consumption instead of wall time ?
Is it because parallel IO requests actually are slower ?
Thank you for an explanation or any additional tests/instrumentation I could add to better pinpoint the issue.
12
So total time is worse, but also time per file has completely exploded
e.g. data8 was taking 1ms now takes 66ms !
Yes, but the total time for all 17 files has only doubled, so much of that explosion is incurred in parallel.
Thread startup cost, such as you speculate in your own answer, is one conceivable explanation, but that’s testable: thread startup time is independent of the thread’s workload, so we can get a better bound on it by running the same number of threads with little or no workload. As I described in a comment, I performed such a test by modifying the MT version of your code such that countWords()
does nothing but return 0
. That ran 17 threads on my machine in about 2ms according to its own measurement (and it ran 50 threads in about 3ms), so that puts a cap of about 2ms on the per-thread startup time. I expect it actually to be only a fraction of that, and I don’t expect the cost to be too far different on your machine.
You’re more likely to be seeing a combination of other factors, but especially:
- you have about four times as many threads as physical cores in the MT case, so if you’re getting much concurrency then that program contends with itself for CPU cache space and memory bus bandwidth. The ST version has much less contention for those resources, especially when the system is lightly loaded.
Other, likely less significant factors include:
- context switching costs
- general thread-management overhead
- delays related to contention for
System.out
How can I better diagnose why ? Can I measure CPU time consumption
instead of wall time ?
Yes. Provided that your system supports JMX, you can use the VM’s ThreadMXBean
to measure the CPU time consumed by your threads.
Is it because parallel IO requests actually are
slower ?
Most disk devices are single-channel. Some can queue multiple read requests at a time, but a single-channel device can only deliver data for a single request at a time. The only way that parallelizing requests to such a device could produce a performance gain is if threads were unable individually to saturate the net bandwidth of device, interconnect, and memory bus, but that’s not typically the case.
For some kinds of devices (e.g. spinning disks), there are additional considerations for sequential (faster) versus random (much slower) access patterns. With such a device, the parallel I/O might be expected really to be slower, but that does not apply to SSDs. Also, it probably wouldn’t be a large effect for your particular test workload, given that it looks like few of the files occupy more than two disk blocks, and some only one.
The I/O in your particular MT program is unlikely to be faster than the I/O in your ST program, given that the latter probably saturates available I/O bandwith already. But the disk –> memory data transfer is probably not itself slower for your MT case, either.
Parallel I/O (or parallel anything) is not automatically faster than serial. You stand to gain from parallel I/O to the extent that additional threads can make use of system resources that otherwise would go unused. That’s not usually the case for concurrent disk I/O involving a single disk. Network I/O is the canonical use case, but not the only one.
So what we expect (hope for)
Suppose IO for file takes 10ms, and counting words takes 5 ms. Thread creation is free.
Because IO happens outside CPU, using memory bus, the sequential version for 10 files takes 150 ms
in parallel version, the IO bus is occupied continuously, for 100 ms, all except the last count could run in parallel with IO, we take roughly 105 ms total (good)
The issue is that thread creation is not free, here it seems to take roughly 70 ms
Each thread starts it’s run with an IO, hence is commuted by force
So most likely scheduling is threads sample wall clock time, then are switched, then main finishes creating threads, then the threads get to count. They report a time that includes the main time used to create threads as well as their own.
This is visible as the summed runtimes reported by threads is much greater than total wall time, so we are overlapping (more or less ok) but also counting in our time actions of other threads (most probably main creating the threads)
In conclusion while we do speed up slightly due to parallel IO requests (we end up with 50 ms penalty instead of 70ms) the time to create so many threads makes it not worth it.
6