The Kafkaesque interaction of the C standard and the main open source C compilers was concisely outlined by one of the main LLVM authors back in 2011:
“knowing that
INT_MAX+1
is undefined allows optimizingX+1 > X
to “true”. Knowing the multiplication “cannot” overflow (because doing so would be undefined) allows optimizingX*2/2
to X ( Chris Lattner)
The C standard defines signed integer types to be fixed length chunks of binary data in either 1s or 2s complement format so x+1 > x is not a theorem and neither is x*2/2= x. Those are elementary mathematical facts – facts that make some kinds of optimizations harder. Instead of trying to work with that reality, C compiler developers have chosen to use a loophole in the C Standard to insist that they can transform code using false theorems. Both the default versions of the GCC and clang compilers that come with current Ubuntu Linux compile the following program so that it works differently depending on optimizations level.
for (i = 0; i < i + 1; i++) {
if (i == INT_MAX) { flag = 1; }
if (flag > 0) {printf("%d |", i);flag++; }
if (flag > 5) { putchar('\n'); exit(0);}
}
exit(0);
When compiled without optimization, there is no output. When compiled with optimization output is:
2147483647 |-2147483648 |-2147483647 |-2147483646 |-2147483645 |
Deducing that i < i + 1
is always true, the “optimizer” omits the test and then produces code that calculates it was wrong. The compiler is deducing that it won’t produce the executable code that the compiler produces! Nice. According to the C standard, correct code would have used unsigned integers for the comparison because integer overflow is “undefined behavior” even though using signed integers infor
loops is ubiquitous in C programming. The curious thing here is that the compilers don’t warn, complain or reject the signed comparison, instead they generate a paradoxical executable that refutes itself. The program could run correctly for years until someone turns on the optimizer and finds paradoxical results. Similar compiler innovations have produced all sorts of failures in working code including security failures. This stupid technology is constructed from the interaction of two reasonable projects that went wrong and then amplified each others difficulties. One project is the C standard committee tightening up C’s loose semantics. The other project is an effort to import new optimization methods into the C compilers. The flaws in each project have resonated with each other. In this case, different processor architectures (or processing modes) will produce different behavior when signed words overflow and C is designed to allow use of native arithmetic close to the machine level for utmost efficiency. But instead of finding ways to make signed integer overflow alternatives precise, the standards committee tossed signed overflow into special category of things labeled “undefined behavior” which is different from “indeterminate” or not specified in that undefined behavior makes C code “non-conforming” and “this International Standard imposes no requirements” for undefined behavior. Then, in a ghastly miscalculation, compiler developers interpreted “imposes no requirements” as a license to produce incorrect code and call it optimization. The standard imposes no requirements and it is convenient to assume there will not be overflows, so QED we can conclude that FALSE is TRUE. In fact, the position seems to be that if the compiler produces code that encrypts all your files, mails the password to Moldavia, and prints a cryptic message about end times, it’s all good. No requirements means no requirements and you should have declared those variables unsigned. This is no exaggeration. Discussing the undefined behavior of shift operations Lattner writes: ” shifting by 32-bits on PowerPC could format your hard drive, it is not guaranteed to produce zero.”
The unsound logic of the compilers has drawn complaints for decades: from the most prominent author of cryptography code ( DJB ), former members of the C standards committee ( Derek Jones), groups of researchers and academics (Regehr) and many C programmers (this one is like horror movie about an engineer trapped in a department of motor vehicles office). As Derek Jones observed, the main C compiler groups don’t sell software to application developers, so they don’t have to care. And the compiler developers don’t have anyone who can do what Linus Torvalds did for Linux so many times: politely and humbly reject innovations that break applications without compelling justification. Even worse, the standards committee and compiler developers split responsibility – so they can and do blame each other. The example cited by DJB is truly horrendous and is based on the standards edict that dereferencing a null pointer is undefined behavior (even though it’s widely used in embedded code and in some UNIX distributions).
- The code erroneously, but harmlessly in this case, dereferences a pointer before checking whether the pointer is null.
- The compiler considers two possibilities: either the pointer is not null, in which case the null check that follows is redundant, or the pointer is null in which case undefined behavior has been detected.
- The compiler “optimizes” by removing the null check (silently). Note that the code deleted is not “undefined behavior” code, it just follows the offending null pointer dereference so all rules are off.
- A security breach is discovered as someone exploited the failure of the code to check for null pointers.
All programming languages have some places where they run into machine or architecture or resource dependencies. If you take the most over-the-top mathematically pretentious highest level programming language and write code to compute factorial, a fairly small input will cause the program run out of time or memory and produce some error. Because C is so close to the machine, it bumps into hardware a lot. Pointers are machine addresses – just numbers, shift operations are specified to compile down to single machine instructions, and arithmetic is so close to the hardware that number representation schemes are programmer visible. As another example, C structures may contain “padding” bytes added by the compiler to accommodate processor alignment requirements. A structure struct x{ char c; int i;};
could have 3 bytes of padding inserted after the 1 byte character so that the integer is on a 4 byte boundary. For efficiency, padding bytes are generally uninitialized – not set to any particular value – so copying the structure may involve copying uninitialized data. The C90 standard declared that reading uninitialized bytes was undefined behavior. The unintentional consequence was that it became impossible to write the C library in conforming C because copying a structure in the memcpy routine might involve copying (reading and writing) uninitialized data. The standards committee “solved” the problem with a dismally giant hack by making unsigned characters specially immune from undefined behavior. If you copy indeterminate bytes via unsigned character pointers, that’s ok, but otherwise it’s undefined behavior. Here’s Chris Lattner explaining a small part of the resulting mess:
It is undefined behavior to cast an int* to a float* and dereference it (accessing the “int” as if it were a “float”). C requires that these sorts of type conversions happen through memcpy: using pointer casts is not correct and undefined behavior results. The rules for this are quite nuanced and I don’t want to go into the details here (there is an exception for char*, vectors have special properties, unions change things, etc). – Chris Lattner
To anyone who has ever managed a programming project, the signs of a fatally broken design decision are unmistakable (a cascade of special cases is never good). Nuance is nice in poetry, not in engineering. But the C standards committee was not piling whole categories of C code into the undefined behavior basket for no reason. They were trying to deal with two different types of pressure. The first is that the definition of C had too many holes in it so that C code was often way too machine or compiler dependent or just sloppy. The second problem was that C is resistant to a number of useful types of optimization and compiler developers pushed for changes that would allow them to assume things about code that made certain optimizations more widely applicable. It seems like the idea was that by ruling certain types of constructions to be “non-conforming”, the standard would discourage their use and make the language more consistent and more suited to modern optimization methods. As an example, certain widely used crypto code apparently uses uninitialized variables to inject randomness into low level encryption code – relying on the variable to be allocated from the leftover garbage on the stack. That code will fail dismally if stack allocation changes or automatic variables are not allocated from the stack (or the previously called process filled the stack with something predictable). So warning about that code or even refusing to compile it might improve applications, but because of the nice split in responsibilities, the Standards body can rule such code non-conforming and look away as the compiler writers then perform arbitrary transformations on that code.
And all of this is so unnecessary. As Jones points out, some sort of undefined behavior is intrinsic to C, but implementations are not required to treat it as an excuse for sabotaging programs:
A willingness to accept whatever behavior happens to occur, in an error situation, is the price that has to be paid for efficient execution on a wide range of disparate processors. The C/C++ designers were willing to pay this price while the Java designers were not, with the Ada designers willing to tolerate less variability than C/C++.
Undefined behavior need not be nasty behavior, an implementation could choose to generate a helpful message or try to recover from it.
Changes to the C standard could be used to make it more optimizable. C would definitely benefit from increased use of “restricted pointers” (the definitions needs to be clarified but it’s a great step forward.) And the gcc “pure” modifier on function types is interesting and should be considered for the standard. The standard could promote methods of writing verifiable, optimizable code as opt-in. An additional easy step would be to move things like reading uninitialized data from “undefined” to “indeterminate” and make pointer casting well-defined whenever alignment and object size are compatible. The Committee could also modify the definition of undefined behavior by specifically limiting the freedom of conforming implementations. Replacing “this International Standard imposes no requirements” with “conforming implementations may reject non-conforming constructs or implement in a consistent implementation dependent manner.” C already provides machine dependent information to the programmer. The file limits.h tells the programmer about the native word sizes and number ranges on the target computer. There is no reason why that cannot be extended to specify signed integer overflow behavior (and in 99.999% of cases that behavior is rollover). The language has intrinsic access to nonportable machine dependent and/or implementation dependent information. Pretending otherwise only creates too much – well – nuance. Underlying all of this, it’s easy to get the feeling that the C standards process and C compiler development has lost the plot – that the objective has become a doomed effort to recreate C as a kind of low grade patched up Ada or (forgive me) Rust. Consider this note on uninitialized data from Lattner
Use of an uninitialized variable: This is commonly known as source of problems in C programs and there are many tools to catch these: from compiler warnings to static and dynamic analyzers. This improves performance by not requiring that all variables be zero initialized when they come into scope (as Java does). For most scalar variables, this would cause little overhead, but stack arrays and malloc’d memory would incur a memset of the storage, which could be quite costly, particularly since the storage is usually completely overwritten.
But the alternatives are not only to either impose mandatory initialization of variables or to allow the compiler to treat use of uninitialized memory as permission to arbitrarily rewrite code. Where compilers and tools can warn about use of uninitialized variables – that’s good. Where they cannot, it should be up to the programmer. The intent of C is to give the programmer enormous flexibility. C developers are not supposed to need to escape the language and invoke pragmas or “unsafe” keywords (or call out to libraries written in some scorned low level language). The Linux networking code (lib/checksum) below shows off one of the compelling things about C – the ability to work directly with memory and control the mapping from bit patterns to denotations. The code is computing a checksum for a network packet. Starting with the packet data structure it first looks at it as a sequences of unsigned characters and then does a type coercion to map sequences of 4 unsigned characters into “int” types so it can run through and compute the checksum.
do {
unsigned int w = *(unsigned int *) buff;
buff += 4;
result += carry;
result += w;
carry = (w > result);
} while (buff < end);
result += carry;
result = (result & 0xffff) + (result >> 16);
This type of code makes it easy to screw up and can complicate optimization for compilers and everything for static analyzers (changes in type, aliased pointers, demonical possession … ) but it’s the essence of systems programming and a lot of application programming too. It’s what C was designed to do and for expert C programmers gives C programming a flexible and fluid character. Is this code legal C under the current C standard? It appears not. But if the standard requires the mother of all hacks to make memcpy implementation conform to spec (and not a real one either, it actually seems to force C compilers to use escape to assembler) it’s going to trip up C. And if the implementations are going to claim that unsound deduction is a legitimate optimization then C is really doomed.
Bonus – from the C89 standard.
A pointer to an object or incomplete type may be converted to a pointer to a different object type or a different incomplete type. The resulting pointer might not be valid if it is improperly aligned for the type pointed to. It is guaranteed, however, that a pointer to an object of a given alignment may be converted to a pointer to an object of the same alignment or a less strict alignment and back again; the result shall compare equal to the original pointer.
rough notes for this are at https://www.yodaiken.com/2017/06/16/the-c-standard-committee-effort-to-kill-c-continues/
You’re free to use a language that does not have undefined behavior for these calculations. C just happens to be a language for which those calculations are undefined. BTW, it would take longer than the age of the universe for a compiler to detect whether code has any undefined behavior under all possible inputs.
So your theory is that people like DJB who have produced a large volume of code that is key to the operation of the Internet or the Linux dev team which has had to insist on flags to disable UB pessimizations lack elementary understanding of the purpose and meaning of C? Fascinating.