Why is multi-threaded, chunked writing of a large file slower when writing from many cores rather than the same core?
Note: the question has undergone some edits, taking into account numerous suggestions and findings from the comments, which may now appear outdated. It initially focused on the number of threads, while the actual problem seems to be threads’ core affinities.