There used to be very good reasons for keeping instruction / register names short. Those reasons no longer apply, but short cryptic names are still very common in low-level programming.
Why is this? Is it just because old habits are hard to break, or are there better reasons?
For example:
- Atmel ATMEGA32U2 (2010?):
TIFR1
(instead ofTimerCounter1InterruptFlag
),ICR1H
(instead ofInputCapture1High
),DDRB
(instead ofDataDirectionPortB
), etc. - .NET CLR instruction set (2002):
bge.s
(instead ofbranch-if-greater-or-equal.short
), etc.
Aren’t the longer, non-cryptic names easier to work with?
When answering and voting, please consider the following. Many of the possible explanations suggested here apply equally to high-level programming, and yet the consensus, by and large, is to use non-cryptic names consisting of a word or two (commonly understood acronyms excluded).
Also, if your main argument is about physical space on a paper diagram, please consider that this absolutely does not apply to assembly language or CIL, plus I would appreciate if you show me a diagram where terse names fit but readable ones make the diagram worse. From personal experience at a fabless semiconductor company, readable names fit just fine, and result in more readable diagrams.
What is the core thing that is different about low-level programming as opposed to high-level languages that makes the terse cryptic names desirable in low-level but not high-level programming?
11
The reason the software uses those names is because the datasheets use those names. Since code at that level is very difficult to understand without the datasheet anyway, making variable names you can’t search is extremely unhelpful.
That brings up the question of why datasheets use short names. That’s probably because you often need to present the names in tables like this where you don’t have room for 25-character identifiers:
Also, things like schematics, pin diagrams, and PCB silkscreens often are very cramped for space.
10
Zipf’s Law
You yourself can observe by looking at this very text that word length and frequency of usage are, in general, inversely related. Words that are used very frequently, like it
, a
, but
, you
, and and
are very short, while words that are used less often like observe
, comprehension
, and verbosity
are longer. This observed relationship between frequency and length is called Zipf’s Law.
The number of instructions in the instruction set for a given microprocessor usually numbers in the dozens or hundreds. For example, the Atmel AVR instruction set appears to contain about a hundred distinct instructions (I didn’t count), but many of those are variations on a common theme and have very similar mnemonics. For example, the multiplication instructions include MUL, MULS, MULSU, FMUL, FMULS, and FMULSU. You don’t have to look at the list of instructions for very long before you get the general idea that instructions that start with “BR” are branches, instructions that start with “LD” are loads, etc. The same applies to variables: even complex processors provide only a limited number of places to store values: condition registers, general purpose registers, etc.
Because there are so few instructions, and because long names take longer to read, it makes sense to give them short names. By contrast, higher level languages allow programmers to create a huge number of functions, methods, classes, variables, and so on. Each of these will be used far less frequently than most assembly instructions, and longer, more descriptive names are increasingly important to give readers (and writers) enough information to understand what they are and what they do.
Additionally, instruction sets for different processors often use similar names for similar operations. Most instruction sets include operations for ADD, MUL, SUB, LD, ST, BR, NOP, and if they don’t use those exact names they usually use names that are very close. Once you’ve learned the mnemonics for one instruction set, it doesn’t take long to adapt to the instruction sets for other devices. So names that might seem “cryptic” to you are about as familiar as words like and
, or
, and not
to programmers who are skilled in the art of low level programming. I think that most people who work at the assembly level would tell you that learning to read the code is not one of the greater challenges in low level programming.
2
In general
Quality of naming is not just about having descriptive names it also has to consider other aspects, and that leads to recommendations like:
- the more global the scope, the more descriptive the name should be
- the more often it is used, the shorter the name should be
- the same name should be used in all contexts for the same thing
- different things should have different names even if the context is different
- variations should be easily detected
- …
Note that these recommandations are conflicting.
Instruction mnemonics
As an assembly language programmer, using short-branch-if-greater-or-equal
for bge.s
gives me the same impression than when I see, as an Algol programmer doing computational geometry, SUBSTRACT THE-HORIZONTAL-COORDINATE-OF-THE-FIRST-POINT TO THE-HORIZONTAL-COORDINATE-OF-THE-SECOND-POINT GIVING THE-DIFFERENCES-OF-THE-COORDINATE-OF-THE-TWO-POINTS
instead of dx := p2.x - p1.x
. I just can’t agree that the first are more readable in the contexts I care of.
Register names
You pick the official name from the documentation. The documentation picks the name from the design. The design uses a lot graphical formats where long names aren’t adequate and the design team will life with those names for months, if not years. For both reasons, they won’t use “Interrupt flag of the first timer counter”, they will abbreviate it in their schema as well as when they speak. They know it and they use systematic abbreviations like TIFR1
so that there is less chance of confusion. One point here is that TIFR1
isn’t a random abbreviation, it is the result of a naming scheme.
4
Apart from the “old habits” reasons, Legacy code that was written 30 years ago and is still in use is very common. Despite what some less experienced people think, refactoring these systems so they look pretty comes at a very high cost for a small gain and is not commercially viable.
Embedded systems that are close to the hardware – and accessing registers, tend to use the same or similar labels to those used in the Hardware data sheets, for very good reasons. If the register is called XYZZY1 in the hardware data sheets, it makes sense the Variable representing it is likely XYZZY1, or if the programmer was having a good day, RegXYZZY1.
As far as the bge.s
, it’s similar to assembler – to the few people who need to know it longer names are less readable. If you cannot get you head around bge.s
and think branch-if-greater-or-equal.short
will make a difference – you are merely playing with the CLR and do not know it.
The other reason that you will see short variable names is due to wide spread us of abbreviations within the domain the software is targeting.
In summary – short abbreviated variable names that reflect an External influence such as industry norms and hardware data sheets are expected. Short abbreviated variable names that are internal to the software are normally less desirable.
3
There are so many different ideas here. I can’t accept any of the existing answers as the answer: firstly, there are likely many factors contributing to this, and secondly, I can’t possibly know which one is the most significant one.
So here’s a summary of answers posted by others here. I’m posting this as CW and my intention is to eventually mark it accepted. Please edit if I missed something out. I tried to rephrase each idea to express it concisely yet clearly.
So why are cryptic short identifiers so common in low-level programming?
- Because many of them are common enough in the respective domain to warrant a very short name. This worsens the learning curve, but is a worthwhile tradeoff given the frequency of use.
- Because there is usually a small set of possibilities that is fixed (the programmer can’t add to the set).
- Because readability is a matter of habit and practice.
branch-if-greater-than-or-equal.short
is initially more readable thanbge.s
, but with some practice the situation becomes reversed. - Because they often have to be typed out in full, by hand, because low-level languages often don’t come with powerful IDEs that have good autocompletion, or a/c is not reliable.
- Because it’s sometimes desirable to pack a lot of information into the identifier, and a readable name would be unacceptably long even by high-level standards.
- Because that’s what low-level environments have looked like historically. Breaking habit requires conscious effort, runs the risk of annoying those who liked the old ways, and must be justified as worthwhile. Sticking with the established way is the “default”.
- Because many of them originate elsewhere, such as schematics and datasheets. Those, in turn, are affected by space constraints.
- Because the people in charge of naming things have never even considered readability, or don’t realize they are creating a problem, or are lazy.
- Because in some cases the names have become part of a protocol for interchange of data, such as the use of assembly language as an intermediate representation by some compilers.
- Because this style is instantly recognizable as low-level and thus looks cool to geeks.
I personally feel that some of these do not actually contribute to the reasons why a newly developed system would choose this naming style, but I felt it would be wrong to filter some ideas out in this type of answer.
I’m going to toss my hat into this mess.
High level coding conventions and standards are not the same as low level coding standards and practices. Unfortunately, most of those are holdovers from legacy code and old thought processes.
Some, however, do serve a purpose. Sure BranchGreaterThan would be much more readable than BGT, but there’s a convention there now, it’s an instruction and as such has gained some bit of traction in the last 30 years of use as a standard. Why’d they start with it, probably some arbitrary character width limit for instructions, variables and such; why do they keep it, it’s a standard. This standard is the same as using int as an identifier, it would be more legible to use Integer in all cases, but is it necessary for anyone that’s been programming more than a few weeks… no. Why? Because it’s a standard practice.
Second, as I said in my comment, many of the interrupts are named INTG1 and other cryptic names, these serve a purpose as well. In circuit diagrams it is NOT good convention to name your lines and such verbosely it clutters the diagram and hurts legibility. All verboseness is handled in documentation. And since all of the wiring/circuit diagrams have these short names for interrupt lines, the interrupts themselves also get the same name as to keep consistency for the embedded designer from the circuit diagram all the way up to the code to program it.
A designer has some control over this, but like any field/new language there is conventions that follow from hardware to hardware, and as such should stay similar across each assembly language. I can look at a snippet of assembly and be able to get the gist of the code without ever using that instruction set because they stick to a convention, LDA or some relation to it is probably loading a register MV is probably moving something from somewhere to somewhere else, it isn’t about what you think is nice or is a high level practice, it’s a language unto itself and as such has its own standards and means that you as the designer should follow, these are often not nearly as arbitrary as they seem.
I’ll leave you with this: Asking the embedded community to use verbose high level practices is like asking chemists to always write out chemical compounds. The chemist writes them short for themselves and anyone else in the field will understand it, but it may take a new comer a little time to adjust.
3
One reason they use cryptic short identifiers it’s because they are not cryptic for the developers. You have to realize they work with it every day and those names are really domain names. So they know by heart what exactly TIFR1 means.
If a new developer comes to the team he’ll have to read the datasheets (as explained by @KarlBielefeldt) so they’ll get comfortable with those.
I believe your question used a bad example because indeed on those kind of source codes you usually see a lot of unnecessary crypt identifiers for non-domain stuff.
I’d say mostly they do that because of bad habits that existed when the compilers did not auto-complete everything you type.
Summary
Initialism is a pervasive phenomenon in many technical and non-technical circles. As such it is not limited to low-level programming. For the general discussion, see the Wikipedia article on Acronym. My answer is specific to low-level programming.
Causes of cryptic names:
- Low-level instructions are strongly-typed
- Need to pack a lot of type information into the name of a low-level instruction
- Historically, single-character codes are favored for packing the type information.
Solutions and their drawbacks:
- There are modern low-level naming schemes that are more consistent than historical ones.
- LLVM
- However, the need to pack a lot of type information still exists.
- Thus, cryptic abbreviations can still be found everywhere.
- Improved line-to-line readability will help a novice low-level programmer pick up the language faster, but will not help with comprehending large pieces of low-level code.
Full answer
(A) Longer names are possible. For example, the names of C++ SSE2
intrinsics average 12 characters compared to the 7 characters
in the assembly mnemonic.
http://msdn.microsoft.com/en-us/library/c8c5hx3b(v=vs.80).aspx
(B) The question then moves on to: How long / non-cryptic does one need
to get from low-level instructions?
(C) Now we analyze the composition of such naming schemes. The following are two naming schemes for the same low-level instruction:
- Naming scheme #1:
CVTSI2SD
- Naming scheme #2:
__m128d _mm_cvtsi32_sd (__m128d a, int b);
(C.1) Low-level instructions are always strongly typed. There cannot be
ambiguity, type inference, automatic type conversion, or
overloading (reuse of instruction name to mean similar but non-equivalent operations).
(C.2) Each low-level instruction must encode a lot of type informations
into its name. Examples of information:
- Architecture Family
- Operation
- Arguments (Inputs) and Outputs
- Types (Signed Integer, Unsigned Integer, Float)
- Precision (Bit Width)
(C.3) If each piece of information is spelled out, the program will be
more verbose.
(C.4) The type-encoding schemes used by various vendors had long historical roots. As an example, in the x86 instruction set:
- B means byte (8-bit)
- W means word (16-bit)
- D means dword “double-word” (32-bit)
- Q means qword “quad-word” (64-bit)
- DQ means dqword “double-quad-word” (128-bit)
These historical references had no modern meanings whatsoever, but still sticks around. A more consistent scheme would have put the bit-width value (8, 16, 32, 64, 128) into the name.
On the contrary, LLVM is a right step in the direction of consistency in low-level instructions: http://llvm.org/docs/LangRef.html#functions
(D) Regardless of instruction naming scheme, low-level programs are
already verbose and hard to understand because they focus on the
minute details of execution. Changing the instruction naming scheme
will improve readability on a line-to-line level, but will not remove
the difficulty of comprehending the operations of a large piece of
code.
2
Humans read and write assembly only occasionally, and most of the time it’s just a communication protocol. I.e., it is most often used as an intermediate serialised text-based representation between compiler and assembler. The more verbose this representation is, the more unnecessary overhead is in this protocol.
In the case of opcodes and register names, long names actually harm readability. Short mnemonics are better for a communication protocol (between compiler and assember), and assembly language is a communication protocol most of the time. Short mnemonics are better for programmers, since compiler code is easier to read.
15
Mostly it’s idiomatic. As @TMN says elsewhere, just as you don’t write import JavaScriptObjectNotation
or import HypertextTransferProtocolLibrary
in Python, you don’t write Timer1LowerHalf = 0xFFFF
in C. It looks equally ridiculous in context. Everyone who needs to know already knows.
Resistance to change might arise, in part, from the fact that some C compiler vendors for embedded systems deviate from the language standard and syntax in order to implement features more useful to embedded programming. This means that you can’t always use the autocomplete feature of your favourite IDE or text editor when writing low level code, because these customisations defeat their ability to analyse code. Hence the utility of short register names, macros and constants.
For example, HiTech’s C compiler included a special syntax for variables that needed to have a user-specified position in memory. You might declare:
volatile char MAGIC_REGISTER @ 0x7FFFABCD;
Now the only IDE in existence that will parse this is HiTech’s own IDE (HiTide). In any other editor, you’ll have to type it out manually, from memory, every time. This gets old very quickly.
Then there’s also the fact that when you’re using development tools to inspect registers, you’ll often have a table displayed with several columns (register name, value in hex, value in binary, last value in hex, etc). Long names mean you have to expand the name column to 13 characters to see the difference between two registers, and play “spot the difference” across dozens of lines of repeated words.
These might sound like silly little quibbles, but isn’t every coding convention designed to reduce eye strain, decrease superfluous typing or address any one of a million other little complaints?
6
I’m surprised that no one has mentioned laziness and that other sciences are not discussed.
My daily work as programmer shows to me that naming conventions for any kind of variable in a program are influenced by three different aspects:
- The scientific background of the programmer.
- The programming skills of the programmer.
- The environment of the programmer.
I think it is of no use to discuss about low level or high level programming. In the very end it can always be pinned down to the former three aspects.
An explanation of the first aspect:
Many “programmers” are not programmers in the first place. They are mathematicians, physicists, biologists or even psychologists or economists but many of them are not computer scientists. Most of them have their own domain specific keywords and abbreviations which you can see in their naming “conventions”. They are often trapped in their domain and use those known abbreviations without thinking of readability or coding guides.
An explanation of the second aspect:
As most of the programmers are no computer scientists their programming skills are limited. Thats why they often dont care about coding conventions but more on domain specific conventions as stated as first aspect. Also if you do not have the skills of a programmer you do not have the understanding of coding conventions. I think most of them dont see the urgent need to write understandable code. Its like fire and forget.
An explanation of the third aspect:
It is unlikely to brake with the conventions of your environment which can be old code you have to support, coding standards of your company (run by economists who dont care about coding) or the domain you belong to. If someone started to use cryptic names and you have to support him or his code you are unlikely to change the cryptic names. If there are no coding standards at your company I bet almost every programmer will write their own standard. And last if you are surrounded by domain users you will not start to write another languange than they use.
2