What was the reasoning behind not explicitly storing an array’s length with an array in C
?
The way I see it, there are overwhelming reasons to do so but not very many in support of the standard (C89). For instance:
- Having length available in a buffer can prevent buffer overrun.
- A Java-style
arr.length
is both clear and avoids the programmer from having to maintain manyint
s on the stack if dealing with several arrays - Function parameters become more cogent.
But perhaps the most motivating reason, in my opinion, is that usually, no space is saved without keeping the length. I would venture to say that most uses of arrays involve dynamic allocation. True, there may be some cases where people use an array allocated on the stack, but that’s just one function call* – the stack can handle 4 or 8 bytes extra.
Since the heap manager has to track the free block size used up by the dynamically allocated array anyway, why not make that information usable (and add the additional rule, checked at compile time, that one can’t manipulate the length explicitly unless one would like to shoot oneself in the foot).
The only thing I can think of on the other side is that no length tracking may have made compilers simpler, but not that much simpler.
*Technically, one could write some kind of recursive function with an array with automatic storage, and in this (very elaborate) case storing the length may indeed result in effectively more space usage.
12
C arrays do keep track of their length, as the array length is a static property:
int xs[42]; /* a 42-element array */
You can’t usually query this length, but you don’t need to because it’s static anyway – just declare a macro XS_LENGTH
for the length, and you’re done.
The more important issue is that C arrays implicitly degrade into pointers, e.g. when passed to a function. This does make some sense, and allows for some nice low-level tricks, but it loses the information about the length of the array. So a better question would be why C was designed with this implicit degradation to pointers.
Another matter is that pointers need no storage except the memory address itself. C allows us to cast integers to pointers, pointers to other pointers, and to treat pointers as if they were arrays. While doing this, C is not insane enough to fabricate some array length into existence, but seems to trust in the Spiderman motto: with great power the programmer will hopefully fulfill the great responsibility of keeping track of lengths and overflows.
16
A lot of this had to do with the computers available at the time. Not only did the compiled program have to run on a limited resource computer, but, perhaps more importantly, the compiler itself had to run on these machines. At the time Thompson developed C, he was using a PDP-7, with 8k of RAM. Complex language features that didn’t have an immediate analog on the actual machine code were simply not included in the language.
A careful read through the history of C yields more understanding into the above, but it wasn’t entirely a result of the machine limitations they had:
Moreover, the language (C) shows considerable power to describe important concepts, for example, vectors whose length varies at run time, with only a few basic rules and conventions. … It is interesting to compare C’s approach with that of two nearly contemporaneous languages, Algol 68 and Pascal [Jensen 74]. Arrays in Algol 68 either have fixed bounds, or are `flexible:’ considerable mechanism is required both in the language definition, and in compilers, to accommodate flexible arrays (and not all compilers fully implement them.) Original Pascal had only fixed-sized arrays and strings, and this proved confining [Kernighan 81].
C arrays are inherently more powerful. Adding bounds to them restricts what the programmer can use them for. Such restrictions may be useful for programmers, but necessarily are also limiting.
3
Back in the day when C was created, and extra 4 bytes of space for every string no matter how short would have been quite a waste!
There’s another issue – remember that C is not object-oriented, so if you do length-prefix all strings, it would have to be defined as a compiler intrinsic type, not a char*
. If it was a special type, then you would not be able to compare a string to a constant string, i.e.:
String x = "hello";
if (strcmp(x, "hello") == 0)
exit;
would have to have special compiler details to either convert that static string to a String, or have different string functions to take account of the length prefix.
I think ultimately though, they just didn’t choose the length-prefix way unlike say Pascal.
19
In C, any contiguous subset of an array is also an array and can be operated on as such. This applies both to read and write operations. This property would not hold if the size was stored explicitly.
6
The biggest problem with having arrays tagged with their length is not so much the space required to store that length, nor the question of how it should be stored (using one extra byte for short arrays generally wouldn’t be objectionable, nor would using four extra bytes for long arrays, but using four bytes even for short arrays might be). A much bigger problem is that given code like:
void ClearTwoElements(int *ptr)
{
ptr[-2] = 0;
ptr[2] = 0;
}
void blah(void)
{
static int foo[10] = {1,2,3,4,5,6,7,8,9,10};
ClearTwoElements(foo+2);
ClearTwoElements(foo+7);
ClearTwoElements(foo+1);
ClearTwoElements(foo+8);
}
the only way that code would be able to accept the first call to ClearTwoElements
but reject the second would be for the ClearTwoElements
method to receive information sufficient to know that in each case it was receiving a reference to part of the array foo
in addition to knowing which part. That would typically double the cost of passing pointer parameters. Further, if each array was preceded by a pointer to an address just past the end (the most efficient format for validation), optimized code for ClearTwoElements
would likely become something like:
void ClearTwoElements(int *ptr)
{
int* array_end = ARRAY_END(ptr);
if ((array_end - ARRAY_BASE(ptr)) < 10 ||
(ARRAY_BASE(ptr)+4) <= ADDRESS(ptr) ||
(array_end - 4) < ADDRESS(ptr)))
trap();
*(ADDRESS(ptr) - 4) = 0;
*(ADDRESS(ptr) + 4) = 0;
}
Note that a method caller could, in general, perfectly legitimately pass a pointer to the start of the array or the last element to a method; only if the method tries to access elements which go outside passed-in array would such pointers cause any trouble. Consequently, a called method would have to first ensure the array was large enough that the pointer arithmetic to validate its arguments won’t itself go out of bounds, and then do some pointer calculations to validate the arguments. The time spent in such validation would likely exceed the cost spent doing any real work. Further, the method could likely be more efficient if it was written and called:
void ClearTwoElements(int arr[], int index)
{
arr[index-2] = 0;
arr[index+2] = 0;
}
void blah(void)
{
static int foo[10] = {1,2,3,4,5,6,7,8,9,10};
ClearTwoElements(foo,2);
ClearTwoElements(foo,7);
ClearTwoElements(foo,1);
ClearTwoElements(foo,8);
}
The concept of a type which combines something to identify an object with something to identify a piece thereof is a good one. A C-style pointer is faster, however, if it’s not necessary to perform validation.
10
One of the fundemental differences between C and most other 3rd generation languages, and all more recent languages that I am aware of, is that C was not designed to make life easier or safer for the programmer. It was designed with the expectation that the programmer knew what they were doing and wanted to do exactly and only that. It does not do anything ‘behind the scenes’ so you do not get any surprises. Even compiler level optimisation is optional (unless you use a Microsoft compiler).
If a programmer wants to write bounds checking in their code, C makes it is simple enough to do it, but the programmer must choose to pay the corresponding price in terms of space, complexity and performance. Even though I haven’t used it in anger for many years, I still use it when teaching programming to get across the concept of constraint based decision making. Basically, that means you can choose to do anything you want, but every decision you make has a price that you need to be aware of. This becomes even more important when you starting telling others what you want their programs to do.
2
Short answer:
Because C is a low-level programming language, it expects you to take care of these issues yourself, but this adds greater flexibility in exactly how you implement it.
C has a compile-time concept of an array that is initialised with a length but at runtime the whole thing is simply stored as a single pointer to the start of the data. If you want to pass the array length to a function along with the array, you do it yourself:
retval = my_func(my_array, my_array_length);
Or you could use a struct with a pointer and length, or any other solution.
A higher level language would do this for you as part of its array type. In C you’re given the responsibility of doing this yourself, but also the flexibility to choose how to do it. And if all the code you’re writing already knows the length of the array, you don’t need to pass the length around as a variable at all.
The obvious drawback is that with no inherent bounds checking on arrays passed around as pointers you can create some dangerous code but that is the nature of low level/systems languages and the trade-off they give.
4
The problem of the extra storage is an issue, but in my opinion a minor one. After all, most of the time you are going to need to track the length anyway, although amon made a good point that it can often be tracked statically.
A bigger problem is where to store the length and how long to make it. There isn’t one place that works in all situations. You might say just store the length in the memory just before the data. What if the array isn’t pointing to memory, but something like a UART buffer?
Leaving the length out allows the programmer to create his own abstractions for the appropriate situation, and there are plenty of ready made libraries available for the general purpose case. The real question is why aren’t those abstractions being used in security-sensitive applications?
2
From The Development of the C Language:
Structures, it seemed, should map in an intuitive way onto memory in the machine, but in a structure containing an array, there was no good place to stash the pointer containing the base of the array, nor any convenient way to arrange that it be initialized. For example, the directory entries of early Unix systems might be described in C as
<code>struct {int inumber;char name[14];};</code><code>struct { int inumber; char name[14]; }; </code>struct { int inumber; char name[14]; };
I wanted the structure not merely to characterize an abstract object but also to describe a collection of bits that might be read from a directory. Where could the compiler hide the pointer to
name
that the semantics demanded? Even if structures were thought of more abstractly, and the space for pointers could be hidden somehow, how could I handle the technical problem of properly initializing these pointers when allocating a complicated object, perhaps one that specified structures containing arrays containing structures to arbitrary depth?The solution constituted the crucial jump in the evolutionary chain between typeless BCPL and typed C. It eliminated the materialization of the pointer in storage, and instead caused the creation of the pointer when the array name is mentioned in an expression. The rule, which survives in today’s C, is that values of array type are converted, when they appear in expressions, into pointers to the first of the objects making up the array.
That passage addresses why array expressions decay to pointers in most circumstances, but the same reasoning applies to why the array length isn’t stored with the array itself; if you want a one-to-one mapping between the type definition and its representation in memory (as Ritchie did), then there’s no good place to store that metadata.
Also, think about multidimensional arrays; where would you store the length metadata for each dimension such that you could still walk through the array with something like
T *p = &a[0][0];
for ( size_t i = 0; i < rows; i++ )
for ( size_t j = 0; j < cols; j++ )
do_something_with( *p++ );
The question assumes that there are arrays in C. There aren’t. Things that are called arrays are just a syntactic sugar for operations on continuous sequences of data and pointer arithmetics.
The following code copies some data from src to dst in int-sized chunks not knowing that it is actually character string.
char src[] = "Hello, world";
char dst[1024];
int *my_array = src; /* What? Compiler warning, but the code is valid. */
int *other_array = dst;
int i;
for (i = 0; i <= sizeof(src)/sizeof(int); i++)
other_array[i] = my_array[i]; /* Oh well, we've copied some extra bytes */
printf("%sn", dst);
Why C is so simplified it doesn’t have proper arrays? I don’t know correct answer to this new question. But some people often say that C is just (somewhat) more readable and portable assembler.
15