The C Standards process over the last few decades has addressed both optimization and pointer type safety largely through a concept called “undefined behavior”.  The idea is that instead of  positive rules for what compilers and programmers can do, the Standard will identify some program behavior as something compilers can interpret how they want — really without any guidance at all. These things are not syntax errors that must cause the compiler to reject code or even, necessarily things the compiler must warn about, but are instances of behavior that are outside the standard. It is not even clear whether “undefined behavior” is erroneous at all or even avoidable.  I think this whole process is due to a misreading of the C89 standard, but more importantly, it’s sloppy, makes it hard for programmers to have an intuition about the language, and leads to paradoxical results or nasty surprises.  For example, writing a value to a  variable from a pointer that has been advanced past the end of the object it was supposed to point to has been sort-of-kind-of designated as “undefined behavior”. Here’s  an example: the output of a simple C program compiled and “optimized” by gcc 11

pointer1 = 0x404039 pointer2=0x404039 they are not equal

They look equal, don’t they? Clang 12, on the same code uses  normal arithmetic to compare numbers.

pointer1 = 0x404039 pointer2=0x404039 they are equal

An earlier version of gcc, gcc 4.8.3 also knew how to compare numbers:

pointer1 = 0x404039 pointer2=0x404039 they are equal

And Gcc 11 again, but this time with the optimizer turned off (-O0):

pointer1 = 0x40403a pointer2=0x40403a they are equal

Here’s the code ( see NOTE1 in the code )

#include <stdint.h>
#include <stdio.h>

typedef char T;
T p[1], q[1] = {0};
int main(){
        T *ip = (p-1);//NOTE1. change +1 to -1 if gcc -O0
        T *iq = q;
        printf("pointer1 = %p pointer2=%p ",ip,iq);
        if (iq == ip) {
              printf("they are equal");
             }else printf("they are not equal\n");
}

The explanation is that gcc notices that ip has been set to a fishy address

:6:8: warning: array subscript -1 is outside array bounds of 'T[1]' {aka 'char[1]'} [-Warray-bounds]
6 | T *ip = (p-1);//add if gcc -O0
|  ^~

and, gcc 12 helpfully “optimizes” on the basis of “ip”  holding an “invalid” address, perhaps reasoning that the two pointers “cannot” be equal. Under the prevailing interpretation of the C Standard, compilers can assume “undefined behavior” cannot happen, even if they generate code that makes it happen  At least, in this case, there is a warning, but often these “optimizations” are silent. The rationale is that programmers should not be messing about looking at the order of variables in memory (even though that’s not unusual in C) and that compilers should be able to optimize under the assumption that pointer arithmetic does not ever steer a pointer between objects.  To be fair, most of the time that a pointer goes beyond the boundary of an array, it is an error.

Here’s the link on Compiler Explorer. The example is modified from one given in a post by  Ralf Jung  who attributes it to Chung-Kil Hur (I’m sure both disagree with my reaction).

If C was a new programming language, you might decide that the %p output should provide the address and some additional data.

pointer1 = 0x404039 (int) pointer2=0x404039 (int, invalid)  they are not equal

That would make some sense. But in this case, you’d also have to think through all the implications of this design decision, which nobody has done (see below).  For example, since the 1990s, C Standard has made it impossible to write memory allocators in conforming C. Suppose we have a network application that is performance constrained by “malloc” and we want to put a caching allocator for small and jumbo packets on top of malloc:

(small_packet_t *)newsmall();  (large_packet_t *)newjumbo();  freepacket(void *);

The idea is that when either allocator has too little memory, it calls malloc (or something else) to get a big chunk of memory, puts it into a list of free storage, and slices it up when it is asked to allocate. Maybe something like

free_block_t *p; …. p =malloc(K); p->size = K; p->next = Head; Head = p;

But according to the sloppy type rules in the Standard, after writing through a “free_block_t” pointer, that block now has “effective type”  of “free_block_t” and if we allocate part of it to a caller that then writes to it as a small_packet_t, the behavior is undefined and all bets are off because we have tried to store a value to an object with a different effective type. Even worse, imagine that “freepacket” tries to merge freed packets into blocks again! Suppose 4 small packets can be freed and then consolidated and allocated as a single jumbo packet.  Nope. No can do.  Once storage has an effective type, it is frozen to that type.  How does malloc/free work then? Oh, easy, it is special cased in the standard. Similarly, there is a special case for char pointers so that memcpy can be implemented, even if not very efficiently. To get other allocators to work, C programmers rely on “opaque” interfaces. If the allocators are in a separate file, and we do not have link time optimizers, then probably the compilers won’t be able to learn that the type rules are being violated. But then as link time optimization (LTO) comes in, the whole thing may silently be “optimized” to fail.   One of the other ridiculous  properties of the undefined behavior rules is that they are unstable or ephemeral and depend on the current capabilities of the compiler to see through opaque interfaces or otherwise figure out about some particular instance of undefined behavior.  That is, the rule is something like: “this will work as expected until the compiler figures out that it is undefined behavior”. As you can see above, gcc learned to recognize the undefined behavior somewhere between versions 4.8 and 11 and still can’t figure it out at lower levels of optimization.

Here’s another “optimization” from Clang-C

int are_we_out_of_memory(void) {
        int *y = malloc(2);
        return y == (int*)NULL;
}

Under -O3 optimization, this always returns 0 in Clang-C. Why?  Perhaps in this case Clang is reasoning that the pointer becomes invalid when the function returns because the storage, if any, returned by malloc is now inaccessible, so the comparison is invalid. Who knows? Optimization is easier if we don’t have to care about correctness.

Here’s a question  I asked some compiler and standards developers: can the compiler assume “b” is false and delete the whole branch?

int getadata(int b, char *buf, int n){
       int *p; //forget to initialize
       if(b){
           read(0,buf,n);
           *p =0;
       }
       return 1;
}

Since p is uninitialized, the compiler may decide the assignment, which is “undefined behavior” can’t happen, so the whole branch can be assumed away – and this reasoning would apply no matter what else we do inside the branch. So the code that calls “getdata(1,buff, 1000)” over and over will get nothing done and might even itself be “optimized” away.  Imagine:

char *buf = malloc(1000); getdata(1,buf,1000); if(buf[23]== 'x'){ ....}

 

On the basis of the first “optimization”, the compiler may see that the “if” depends on reading uninitialized memory and delete that case too. With a low level of optimization, the compiler may generate those reads and then silently delete on a higher level of analysis – and different C compilers can make different choices. Effectively, programmers have to guess.

Here is a hypothetical case:

char *b,good;
n = read(fd,b,m);
good = b[m-1];// undefined if n < m
if(n == m  )dosomething(good);
else badread(good);//can we delete this case

Right now, code  like this will compile as expected. But that’s only because the compilers do not incorporate knowledge about “read”. Suppose the compiler understands that if n != m, then b[m-1] is uninitialized and deletes the badread case, because otherwise “good” is the result of reading uninitialized memory which “can’t happen”? So this code may work, contingent on the compilers not figuring out what “read” does.

There is right now in the C Standards process an effort to incorporate pointer “provenance” so that compilers are permitted to consider pointers as pairs (provenance, abstract address) and an 80 page proposal that appears to have a lot of support.  You can read some of the ongoing discussion about this proposal in the message archive. One of the interesting things about this proposal, in the current state, is that it doesn’t provide any solution to the malloc issue – but suggests some solution should be found.  Similarly, it doesn’t discuss how mmap can work with the proposal. The proposal text is also ambivalent about how == should work for pointers. Why? Because it’s impossible for compilers to always correctly figure out the provenance of a pointer, however the proposers want the compilers to enforce provenance rules when they can. Programmers are supposed to just figure it out.

 

See also the purpose of C.

 

C is not a serious programming language
Tagged on:                 

One thought on “C is not a serious programming language

Comments are closed.