Linux Programming Internals: The Compilation Chain

Okay, this is the first technical post of this blog, so we'll start off with a thorough overview of the compilation chain in the C language using GNU tools.

Note: To run all the commands in this post, you'll need to have binutils and the GNU C compiler installed, or a compatible toolchain.

When most people say "compilation" they mean getting an executable file from a source file - and most of the time that's all we want to care about. There is, however, a lot more to it than that, especially with the GNU toolchain.

To illustrate, let's see the steps that occur when you compile a simple hello world program. Not that you need it, but just for reference:

$ cat > hello.c << 'EOF'

#include <stdio.h>

int main()
{
  puts("hello world");
}

EOF

Now, generating an executable straight from the source is fairly easy:

$ gcc hello.c -o hello
$ ./hello
hello world

The command gcc produces the executable hello (as specified by the -o hello option) from the source file hello.c. Surely there can't be much more to it? Well, there are four discrete steps that gcc uses to produce an executable file. They are:

Preprocessing the C source file using the cpp program,

Compiling the processed C source into assembler using the cc1 back-end,

Assembling the asm file into an object file using as, and finally

Linking the object file with other archives/libraries to produce an executable using the collect2 program, which is essentially a front-end to ld for simple programs

From the above, it might seem that the gcc program doesn't actually do anything that could be described as "compiling" at all - and you'd be right. gcc itself simply acts as a front-end to the above four operations. And what with gcc being the flexible beast that it is, you can get it to stop at any of those stages if you want to.

Preprocessing

First, let's get gcc to show us our source code after it's been run though the C preprocessor cpp:

$ gcc -o hello.i hello.c -E

The -E option tells gcc to stop after it's finished running the preprocessor. Alternatively, you could have just run the cpp program directly, with the same options as above.

Take a look at hello.i - it's our original hello.c file, except all the preprocessor directives (i.e. everything that starts with a '#') like #include and #define have been resolved. Most of the code is from the #include <stdio.h> statement in our original file, since all this directive does is simply start reading from the specified file and put it into our source. If you want to see our contribution, you have to go right to the last few lines of the file.

This is, of course, incredibly helpful if you want to make sure your macros expand correctly, or if you have problems with missing definitions you're sure should be in a certain header file - the thing you see could easily have been #undef'd out in a file included from a file, or not included because of some obscure #if statement you're not sure's true or not.

Compilation Proper

By "compilation proper" I mean the translation from our source language (C) to our target language for this stage (assembly language). For those who aren't familiar with assembly language, also called assembler or asm, it's a very low-level language, only one step up from machine language. Each assembly language instruction corresponds directly to a single machine instruction, and deals directly with hardware registers, instruction pointers and so on. It also exposes the bare symbols in your program, as we'll see in a bit.

We'll take the preprocessed source and compile it to assembly by passing the -S directive to gcc:

$ gcc -S hello.i

Now you'll have a file hello.s in the current directory, containing the generated assember. There are many assembly languages for different machine architectures, so how the assembler looks will depend on the architecture you're compiling for, but here's the listing of the code generated for my x86 machine:

.LC0:
        .string "hello world"
        .text
.globl main
        .type   main, @function
main:
        leal    4(%esp), %ecx
        andl    $-16, %esp
        pushl   -4(%ecx)
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %ecx
        subl    $20, %esp
        movl    $.LC0, (%esp)
        call    puts
        addl    $20, %esp
        popl    %ecx
        popl    %ebp
        leal    -4(%ecx), %esp
        ret
        .size   main, .-main
        .ident  "GCC: (Ubuntu 4.3.3-5ubuntu4) 4.3.3"
        .section        .note.GNU-stack,"",@progbits

If you're not interested in knowing a bit about the assembler, skip to the next section.

Note that the output is in AT&T syntax - this might look strange if you're used to intel syntax. One important difference between the two syntaxes is that operands go the other way - for example, the instruction movl %esp, %ebp moves data from the esp register to the ebp register.

Anyway, let's have a look at some of the highlights of the code above that'll help solidify certain things later on - first off, the first five lines aren't instructions, they're assembler directives. Then we come to the line "main:" that looks like a C-style label. It looks like that because that's basically what it is - it simply marks the location of the next instruction. As it happens, it marks the start of our main function, and it will eventually become a symbol in the object file we generate. When any function is called, execution simply jumps to the location of the relevant symbol, and that's all there is to a function call - anything else (such as passing arguments or receiving a return value) has to be coded in assembler.

There will be more on how arguments are passed to functions and how return values are generated in some future post, but we'll just leave that there for now.

So, when our main() function is called, and some instructions execute, until we get to the "money instruction":


        call    puts

This instruction moves the address of the symbol puts into a register called the instruction pointer (IP) register (it also pushes the current value of the IP register on the stack - more about that in a later post), which does exactly what it says on the tin - it points to the next instruction the processor should execute. Since the location of the puts function has been placed there, execution will jump to that function and obligingly print our message. When it returns, execution starts at the would-be next instruction (i.e. addl $20, %esp) and continues until we hit the ret instruction near the end of the listing. The last three lines are more directives.

So there we go - our assembler file, ready to be assembled into an object file.

Assembly and Linking

Well, I've written more than I suspected I would for the previous sections, and there's even more to write on assembling an object file and linking it to produce an executable, so I'll leave that for my next post. I'll also discuss how to generate, inspect and strip (ooh-er) objects and shared and static libraries.

I hope someone out there finds this at least mildly useful - any (constructive!) comments are appreciated.

2 comments:

Anonymous8 April 2011 at 08:00
Really appreciate the article. I am learning C on my own and I am one of those people that really need to know the "why's" behind the steps I take so I can think my way through instead of relying on my memory (which is not to good). Thanks again for taking your time to explain these things to those of us trying to get a grasp on our own.
Anonymous6 August 2019 at 08:05
Please make a post on assembly and linking.

Linux Programming Internals

Wednesday 14 October 2009

The Compilation Chain - Part 1

Preprocessing

Compilation Proper

Assembly and Linking

2 comments:

Followers

Blog Archive

About Me