In a multithreaded programming environment, lock contention on the heap is often enough a performance bottleneck.
Theoretically at least, the cream-of-the-crop solution for this problem is to have the scalable/parallel[izing] allocator be entirely lock-free. However, it seems to me that besides a few research papers (e.g. [1][2][5]), which do contain promising experimental results, entirely lock-free allocators haven’t trickled down to production environments. I’d be glad if you could prove me wrong with counter-examples though. So what are the practical reasons for this slow (or non-existent) adoption of lock-free allocators? Note that more widely used scalable allocators like the one from Intel’s TBB are not lock-free although they use fine-grained locks (cf p. 315 in [3]).
For what’s worth, I also found a CMU student project/paper[4], claiming to have implemented a lock-free allocator that is “slightly better than [Google’s] tcmalloc” on up to 64-cores. Another interesting point in that paper is that “llalloc in these tests is Lockless Inc.’s LockLess allocator, which isn’t 100% lockfree (it has a lock around the global heap)”. jemalloc and ptmalloc are also benchmarked in there.
References:
[1] Michael, Maged M. “Scalable lock-free dynamic memory allocation.” ACM Sigplan Notices 39.6 (2004): 35-46. I found an independent [re]implementation of Michael’s algorithm at http://people.cs.vt.edu/~scschnei/streamflow/
[2] Huang, Xiaohuang, et al. “Xmalloc: A scalable lock-free dynamic memory allocator for many-core machines.” Computer and Information Technology (CIT), 2010 IEEE 10th International Conference on. IEEE, 2010. Free version of the paper as MS thesis.
[3] Kukanov, Alexey, and Michael J. Voss. “The Foundations for Scalable Multi-core Software in Intel Threading Building Blocks.” Intel Technology Journal 11.4 (2007).
[4] Alex Podolsky, Nah Lock: A Lock-Free Memory Allocator; apparently written in 2013 based on parent directory timestamps and the “S13” suffix in the course name.
[5] Gidenstam, Anders, Marina Papatriantafilou, and Philippas Tsigas. “NBmalloc: Allocating memory in a lock-free manner.” Algorithmica 58.2 (2010): 304-338. Source code available for this one.
As a footnote, I see there are 4 pending close votes for this question, but I see none for Why have hardware-accelerated vector graphics not taken off?. It would be interesting if someone could explain why a question that could potentially be answered by somewhat objective performance numbers is more opinion based than one where the main factor is the market orientation of 2-3 big companies.
7
Here are some possible answers:
-
IBM has patented their lock-free allocator… But then they also support Linux etc., so they might have an incentive to at least provide an implementation. And the XMalloc guys also filed for a patent on theirs. But I haven’t [yet] found any patents (or patent applications) for NBmalloc.
-
A more a localized issue is that M. Michael appears to have ceased publishing at IBM right after that paper. I’m not sure if he’s still with IBM, but having him leave or switch focus might be a reason why IBM didn’t try to productize his allocator, as IBM did with prior Watson/Watson2 allocators that made it into AIX. The other lock-free allocators don’t seem to have connections with a big company that could push them into a product.
-
Finally, a more general reason would be the relative ratio of code complexity vs. performance benefits. Quoting from “Proving That Non-Blocking Algorithms Don’t Block”
by A. Gotsman et al.Non-blocking data structures are generally much more complex than their lock-based counterparts, but can provide better performance in the presence of high contention between threads.
So, the code complexity would have to be justified by significant payoffs in performance. While some lock-free allocator papers have claimed these… other papers have come up with contrary or wish-wash results. In particular, the Streamflow paper, whose authors reimplemented Michael’s algorithm, claim to have beaten it soundly using an allocator that wasn’t lock-free. Podolsky’s course paper came up only with relatively marginal improvements. So I’m guessing the jury is still out if the lock-free allocators are worth the extra effort.
How do you know that lock-free allocators / deallocators are not used? For example, take malloc on MacOS X and iOS. Do you know it’s lock free or not? (Their documentation says that calling malloc and free on the same thread in short sequence is very fast, and that’s the case that you worry about).
7