This question got rather a freezing reception at SO, so I decided to delete it there and try here instead. If you think it does not fit here either, please at least leave a comment on suggestion how to find an example I’m after…
Can you give an example, where using C99 VLAs offers a real advantage over something like current standard heap-using C++ RAII mechanisms?
The example I am after should:
- Achieve an easily measurable (10% maybe) performance advantage over using heap.
- Not have a good workaround, which would not need the whole array at all.
- Actually benefit from using dynamic size, instead of fixed maximum size.
- Be unlikely to cause stack overflow in normal use scenario.
- Be strong enough to tempt a developer needing the performance to include a C99 source file in a C++ project.
Adding some clarification on the context: I mean VLA as meant by C99 and not included in standard C++: int array[n]
where n
is a variable. And I am after an example of use case where it trumps the alternatives offered by other standards (C90, C++11):
int array[MAXSIZE]; // C stack array with compile time constant size
int *array = calloc(n, sizeof int); // C heap array with manual free
int *array = new int[n]; // C++ heap array with manual delete
std::unique_ptr<int[]> array(new int[n]); // C++ heap array with RAII
std::vector<int> array(n); // STL container with preallocated size
Some ideas:
- Functions taking varargs, which naturally limits item count to something reasonable, yet is without any useful API-level upper limit.
- Recursive functions, where wasted stack is undesirable
- Many small allocations and releases, where heap overhead would be bad.
- Handling multi-dimensional arrays (like arbitrarily sized matrices), where performance is critical, and small functions are expected to get inlined a lot.
- From comment: concurrent algorithm, where heap allocation has synchronization overhead.
Wikipedia has an an example which does not fulfill my criteria, because the practical difference to using heap seems irrelevant at least without context. It is also non-ideal, because without more context, it seems item count could very well cause stack overflow.
Note: I’m specifically after an example code, or suggestion of an algorithm which would benefit from this, for me to implement the example myself.
6
I just hacked up a little program that generates a set of random numbers restarting at the same seed each time, to ensure that it’s “fair” and “comparable”. As it goes along, it figures out the min and max of these values. And when it has generated the set of numbers, it counts how many are above the average of min
and max
.
For VERY small arrays, it shows a clear benefit with VLA’s over std::vector<>
.
It is not a real problem, but we can easily imagine something where we’d be reading the values from a small file instead of using random numbers, and doing some other, more meaningful counting/min/max calculations with the same sort of code.
For VERY small values of the “number of random numbers” (x) in the relevant functions, the vla
solution wins by a huge margin. As the size goes larger, the “win” gets smaller, and given sufficient size, the vector solution appears to be MORE efficient – didn’t study that variant too much, as when we start having thousands of elements in a VLA, it’s not really what they were meant to do…
And I’m sure someone will tell me that there’s some way of writing all this code with a bunch of templates and get it to do this without running more than the RDTSC and cout
bits at runtime… But I don’t think that’s really the point.
When running this particular variant, I get about 10% difference between the func1
(VLA) and func2
(std::vector).
count = 9884
func1 time in clocks per iteration 7048685
count = 9884
func2 time in clocks per iteration 7661067
count = 9884
func3 time in clocks per iteration 8971878
This is compiled with: g++ -O3 -Wall -Wextra -std=gnu++0x -o vla vla.cpp
Here’s the code:
#include <iostream>
#include <vector>
#include <cstdint>
#include <cstdlib>
using namespace std;
const int SIZE = 1000000;
uint64_t g_val[SIZE];
static __inline__ unsigned long long rdtsc(void)
{
unsigned hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
int func1(int x)
{
int v[x];
int vmax = 0;
int vmin = x;
for(int i = 0; i < x; i++)
{
v[i] = rand() % x;
if (v[i] > vmax)
vmax = v[i];
if (v[i] < vmin)
vmin = v[i];
}
int avg = (vmax + vmin) / 2;
int count = 0;
for(int i = 0; i < x; i++)
{
if (v[i] > avg)
{
count++;
}
}
return count;
}
int func2(int x)
{
vector<int> v;
v.resize(x);
int vmax = 0;
int vmin = x;
for(int i = 0; i < x; i++)
{
v[i] = rand() % x;
if (v[i] > vmax)
vmax = v[i];
if (v[i] < vmin)
vmin = v[i];
}
int avg = (vmax + vmin) / 2;
int count = 0;
for(int i = 0; i < x; i++)
{
if (v[i] > avg)
{
count++;
}
}
return count;
}
int func3(int x)
{
vector<int> v;
int vmax = 0;
int vmin = x;
for(int i = 0; i < x; i++)
{
v.push_back(rand() % x);
if (v[i] > vmax)
vmax = v[i];
if (v[i] < vmin)
vmin = v[i];
}
int avg = (vmax + vmin) / 2;
int count = 0;
for(int i = 0; i < x; i++)
{
if (v[i] > avg)
{
count++;
}
}
return count;
}
void runbench(int (*f)(int), const char *name)
{
srand(41711211);
uint64_t long t = rdtsc();
int count = 0;
for(int i = 20; i < 200; i++)
{
count += f(i);
}
t = rdtsc() - t;
cout << "count = " << count << endl;
cout << name << " time in clocks per iteration " << dec << t << endl;
}
struct function
{
int (*func)(int);
const char *name;
};
#define FUNC(f) { f, #f }
function funcs[] =
{
FUNC(func1),
FUNC(func2),
FUNC(func3),
};
int main()
{
for(size_t i = 0; i < sizeof(funcs)/sizeof(funcs[0]); i++)
{
runbench(funcs[i].func, funcs[i].name);
}
}
16
The reason to use a VLA is primarily performance. It is a mistake to disregard the wiki example as having only an “irrelevant” difference. I can easily see cases where exactly that code could have a huge difference, for instance, if that function was called in a tight loop, where read_val
was an IO function that returned very quickly on some sort of system where speed was critical.
In fact, in most places where VLAs are used in this manner, they don’t replace heap calls but instead replace something like:
float vals[256]; /* I hope we never get more! */
The thing about any local declaration is that it is extremely quick. The line float vals[n]
generally only requires a couple of processor instructions (maybe just one.) It simply adds the value in n
to the stack pointer.
On the other hand, a heap allocation requires walking a data structure to find a free area. The time is probably an order of magnitude longer even in the luckiest case. (I.e. just the act of placing n
on the stack and calling malloc
is probably 5-10 instructions.) Probably much worse if there’s any reasonable amount of data in the heap. It would not surprise me at all to see a case where malloc
was 100x to 1000x slower in a real program.
Of course, then you also have some performance impact with the matching free
, probably similar in magnitude to the malloc
call.
In addition, there’s the issue of memory fragmentation. Lots of little allocations tend to fragment the heap. Fragmented heaps both waste memory and increase the time required to allocate memory.
7
Regarding VLAs versus a Vector
Did you consider that a Vector can take advantage of VLAs itself. Without VLAs, the Vector has to specify certain “scales” of arrays e.g. 10, 100, 10000 for storage so you end up allocating a 10000 item array to hold 101 items. With VLAs, if you resize to 200, the algorithm might assume that you’ll only need 200 and can allocate a 200 item array. Or it can allocate a buffer of say n*1.5.
Anyway, I’d argue that if you know how many items you’ll need at runtime, a VLA is more performant (as Mats’ benchmark demonstrated). What he demonstrated was a simple two pass iteration. Think of monte carlo simulations where random samples are taken repeatedly, or image manipulation (like Photoshop filters) where computations are done on each element multiple times and quite possibly each computation on each element involves looking at neighbors.
That extra pointer jump from the vector to its internal array adds up.
Answering the main question
But when you talk about using a dynamically allocated structure like a LinkedList, there is no comparison. An array provides direct access using pointer arithmetic to its elements. Using a linked list you have to walk the nodes to get to a specific element. So the VLA wins hands down in this scenario.
According to this answer, it is architecturally dependent, but in some cases memory access on the stack will be faster due to the stack being available on the cache. With a large number of elements this may be negated (potentially the cause of the diminishing returns Mats saw in his benchmarks). However, it’s worth noting that Cache sizes are growing significantly and you’ll potentially see more that number grow accordingly.
9