I’ve read in many places (heck, I’ve even written so myself) that garbage collection could (theoretically) be faster than manual memory management.
However, showing is a lot harder to come by than telling.
I have never actually seen any piece of code that demonstrates this effect in action.
Does anyone have (or know where I can find) code that demonstrates this performance advantage?
34
See Link and follow all of the links to see Rico Mariani vs Raymond Chen (both very competent programmers at Microsoft) dueling it out. Raymond would improve the unmanaged one, Rico would respond by optimizing the same thing in the managed ones.
With essentially zero optimization effort, the managed versions started off many times faster than the manual. Eventually the manual beat the managed, but only by optimizing to a level that most programmers would not want to go to. In all versions, the memory usage of the manual was significant better than the managed.
7
The rule of thumb is that there are no free lunches.
GC takes away the headache of manual memory management and reduces the probability of making mistakes. There are some situations where some particular GC strategy is the optimal solution for the problem, in which case you’ll pay no penalty for using it. But there are others where other solutions will be faster. Since you can always simulate higher abstractions from a lower level but not the other way around you can effectively prove that there is no way higher abstractions can be faster than the lower ones in the general case.
GC is a special case of manual memory management
It may be a lot of work or more error prone to get better performance manually but that’s a different story.
3
It’s easy to construct an artificial situation where GC is infinitely more efficient than manual methods – just arrange that there is only one “root” for the garbage collector, and that everything is garbage, so the GC step is instantly completed.
If you think about it, that is the model used when garbage collecting the memory allocated
to a processes. The process dies, all it’s memory is garbage, we’re done. Even in practical terms, a process that starts, runs, and dies leaving no trace might be more efficient than one that starts and runs forever.
For practical programs, written in languages with garbage collection, the advantage of garbage collection is not speed but correctness, and simplicity.
17
I have done quite a bit of work on this and described some of it here. I benchmarked the Boehm GC in C++, allocating using malloc
but not freeing, allocating and freeing using free
and a custom-built mark-region GC written in C++ all vs OCaml’s stock GC running a list-based n-queens solver. OCaml’s GC was faster in all cases. The C++ and OCaml programs were deliberately written to perform the same allocations in the same order.
You can, of course, rewrite the programs to solve the problem using only 64-bit integers and no allocations. Although faster that would defeat the point of the exercise (which was to predict the performance of a new GC algorithm I was working on using a prototype built in C++).
I have spent many years in industry porting real C++ code to managed languages. In almost every single case I observed substantial performance improvements, many of which were probably due to GC trumping manual memory management. The practical limitation is not what can be accomplished in a microbenchmark but what can be accomplished before a deadline and GC-based languages offer such huge productivity improvements that I never looked back. I still use C and C++ on embedded devices (microcontrollers) but even that is changing now.
15
It should be considered that GC is not just a memory management strategy; it also makes demands on the entire design of the language and runtime environment, that impose costs (and benefits). For example, a language that supports GC has to be compiled into a form where pointers can’t be hidden from the garbage collector, and generally where they can’t be constructed except by carefully managed system primitives. Another consideration is the difficulty of maintaining response time guarantees, as GC imposes some steps that have to be allowed to run to completion.
Consequently, if you have a language that is garbage collected, and compare the speed with manually managed memory in the same system, you still have to pay the overhead to support garbage collection even if you’re not using it.
Faster is dubious. However, it can be ultra-fast, imperceptible, or faster if it’s hardware supported. There were designs like that for LISP machines a long time ago. One built the GC into the memory subsystem of the hardware as such that main CPU didn’t know it was there. Like many later designs, the GC ran concurrently with the main processor with little or no need for pauses. A more modern design is Azul Systems Vega 3 machines that run Java code way faster than JVM’s using purpose-built processors and a pauseless GC. Google them if you want to know how fast GC (or Java) can be.
Perhaps the first academic paper that asserts this was “Garbage collection can be faster than stack allocation” (Appel, 1987)†
The assertion was true for certain use cases – in particular from the abstract:
An old and simple algorithm for garbage collection gives very good results when the physical memory is much larger than the number of reachable cells. In fact, the overhead associated with allocating and collecting cells from the heap can be reduced to less than one instruction per cell by increasing the size of physical memory. Special hardware, intricate garbage-collection algorithms, and fancy compiler analysis become unnecessary.
And from the paper’s introduction:
This paper shows that, with enough memory on the computer, it is more expensive to explicitly free a
cell than it is to leave it for the garbage collector — even if the cost of freeing a cell is only a single
machine instruction.
In many practical use cases the live cells/objects at GC time is a small fraction of the total amount allocated in that GC period. (This of course is the observation that started all the R&D into generational garbage collectors.) And some language systems take this to an extreme: the functional languages. Functional languages are extremely heavy in their use of heap memory, and thus, fast GC is critically important.
The algorithm discussed in the paper is a stop-and-copy copying one in the context of one such functional programming system: An ML compiler.
A careful analysis is made showing that if the cost of popping a record from the stack, or of freeing heap allocated memory is (some) constant, then there’s a cross-over point where, if there’s sufficient free memory so that you don’t need to GC that often, it’s cheaper to free by copying active cells than by explicit frees of no-longer-needed allocated cells/objects. And then after that it’s shown that allocation overhead can be zero if you can arrange to use an inaccessible page – thus a page fault – to signal it is time to GC. (And most OSes let you do that.)
Experiments were run on a special machine that allowed them to test performance as available memory was increased. The project was the “Massive Memory Machine” (at Princeton). The special “massive memory” machine was a VAX-11 with 128MB of memory. (Yes, that’s right: Not terabytes, or even gigabytes. “Massive” was 128 megabytes. Well, it was 1986.)
† Official full-text copies are easily found on the web.
Such an example necessarily has a bad manual memory allocation scheme.
Assume the best garbage collector GC
. It internally has methods to allocate memory, determine what memory can be freed, and methods to finally free it. Together these take less time than all of GC
; some time is spent in the other methods of the GC
.
Now consider a manual allocator that uses the same allocation mechanism as GC
, and whose free()
call just sets aside the memory to be freed by the same method as GC
. It doesn’t have a scan phase, nor does it have any of the other methods. It necessarily takes less time.
11
There are more than just two methods: The ones that I have used are garbage collection, manual memory management, manual reference counting, and automatic (compiler generated) reference counting.
One thing when you look at “faster”: Something that is simpler is automatically faster. Say I write a program using manual memory management and it takes me ten days. And you write a program using garbage collection, and because this is easier, it takes you eight days. That means you have two days to make your code faster that I don’t have, so whether garbage collection is faster or slower doesn’t matter much. After ten days, your code will be faster. In the end, I don’t care about the time for manual memory management because it is a pain. (C++ has things to reduce the pain though).
Reference counting has the disadvantage that some work needs doing every time a reference to an object is created or removed. Automatic reference counting helps with this, the compiler will often be able to remove matching pairs of add/remove reference. Knowing how it works helps: If you pass five strings to a function, that’s five reference counts that need to be incremented / decremented. Pack the five strings into an object, and it’s only one reference. Five times faster, and better for other reasons as well. And reference counting takes time. (Rumour is that Apple has made hardware changes to its ARM chips to make changing a reference count five times faster than on an x86; that may make a substantial difference).
And then comes the time where 1000 objects have been allocated and need releasing. The actual releasing may be faster with garbage collection, because it knows all pointers involved are valid. It may know that a whole area filled with objects can be released which may be only one operation. On the other hand, the speed of the actual allocation/deallocation without reference counting can and should be optimised. Quite conceivable that garbage collection beats a bad allocator, but not a good one.
And then there is the question how much time is wasted for objects that actually remain allocated during garbage collection. For reference counting etc. they have zero cost.
So I would say this is implementation dependent, it is dependent on exactly what you are doing, you can’t change it so if you have a ton of Java or Objective-C code you have no choice anyway, it’s not enough to choose one language over another, and you should use the most convenient of the available languages.
Efficient concurrent data structures for read access come to mind where GC of some form (including ref counting) is still arguably quite relevant even to me as a C++ programmer.
Read locks of any form (ex: shared mutex but also other forms) tend to be relatively very expensive in our use cases (many operations measured multiple times slower, even with relatively intensive processing per read iteration) due to the fact that we read over the same data in parallel across threads frequently (although writes are relatively infrequent) making the read lock highly contended (even when it’s per-element). Since the read lock requires mutating data (effectively writing to shared memory atomically), this leads to false sharing all over the place just to read the elements of the concurrent structure with atomic writes involved to shared memory in what should conceptually be a read-only memory operation and definitely not an operation writing to memory that’s shared between threads.
Handling Removal of Elements Still Being Accessed Without Read Locks to Access Elements
I cannot think of any solution to avoid mutating shared read lock states besides some form of GC (although I’m all ears if people have better ideas) which defers element removal/deallocation to a collector, since we have to be guaranteed that no other thread is reading the element in order to safely free it from memory.
Avoiding Shared Mutable State for Read-Only Access Using TLS
Some people might then think and point out that the GC state itself would requiring mutating shared data for reads, but we don’t. We store a separate ref count in thread-local state per thread to avoid false sharing, and the collector then tallies them up in a stop-the-world fashion and frees the memory when the summed ref count is zero. This might sound explosive in memory use to have a whole separate set of thread-local ref counts per thread per element, but we don’t do it per element; we do it for entire blocks of elements that can often contain as many as 4k elements each depending on the size of an element.
A Real-World Use Case
This is at least the best-performing solution I’ve come up with so far for our read-heavy but write-lights needs to avoid requiring threads to write to shared data (read lock data) when merely accessing the elements of the concurrent container in a strict read-only fashion, and it requires some form of GC. We use this solution in computer graphics with a highly parallelized entity-component system, and just using this GC solution here sped up an operation to load in a million polygon mesh from disk in a parallel loop to create the resulting mesh from 2 secs down to ~320ms, as the top hotspot profiling the code showed the times dominated by thread contention on the read locks of our concurrent structures[*]. This was also code that had been repeatedly measured and optimized by myself and others.
Loading a mesh from disk might sound like a write-heavy operation, but it tends to still be write-light and read-heavy. It’s because meshes have to do a whole lot of reading in order to be created complete with not only polygons but edges, vertices, multiple texture UV coordinate maps, etc. For example, edge creation requires reading back the polygons of the mesh to determine what edges are shared between them. Even the process of creation of such structures are more read-heavy than write-heavy (unless we’re just talking about triangle soup mesh reps and not ones like half-edges which share edges and vertices between polygons).
Anyway, this is at least an example of a real-world production use case sped up by the use of GC. At least I cannot think of how to avoid it and still speed things up without although I’m always interested in new ideas. But at least the simplest way to optimize this after mulling over it for days was to use GC to eliminate the requirement for read-locks in our iterators and random-access operators while still allowing element removal to occur in parallel while other threads are still reading that element. Often in talks about what’s “faster” though, I think it’s often a balance between efficiency, practicality for implementation, and the generality of the solution. GC is at least the best solution I could come up with (it was far from my first solution, and I’ve measured all of them) to achieve a balance of all three.