Wednesday 14 October 2009

The Compilation Chain - Part 1

Okay, this is the first technical post of this blog, so we'll start off with a thorough overview of the compilation chain in the C language using GNU tools.

Note: To run all the commands in this post, you'll need to have binutils and the GNU C compiler installed, or a compatible toolchain.

When most people say "compilation" they mean getting an executable file from a source file - and most of the time that's all we want to care about. There is, however, a lot more to it than that, especially with the GNU toolchain.

To illustrate, let's see the steps that occur when you compile a simple hello world program. Not that you need it, but just for reference:

$ cat > hello.c << 'EOF'

#include <stdio.h>

int main()
{
puts("hello world");
}

EOF


Now, generating an executable straight from the source is fairly easy:

$ gcc hello.c -o hello
$ ./hello
hello world


The command gcc produces the executable hello (as specified by the -o hello option) from the source file hello.c. Surely there can't be much more to it? Well, there are four discrete steps that gcc uses to produce an executable file. They are:

  • Preprocessing the C source file using the cpp program,

  • Compiling the processed C source into assembler using the cc1 back-end,

  • Assembling the asm file into an object file using as, and finally

  • Linking the object file with other archives/libraries to produce an executable using the collect2 program, which is essentially a front-end to ld for simple programs


From the above, it might seem that the gcc program doesn't actually do anything that could be described as "compiling" at all - and you'd be right. gcc itself simply acts as a front-end to the above four operations. And what with gcc being the flexible beast that it is, you can get it to stop at any of those stages if you want to.


Preprocessing



First, let's get gcc to show us our source code after it's been run though the C preprocessor cpp:

$ gcc -o hello.i hello.c -E

The -E option tells gcc to stop after it's finished running the preprocessor. Alternatively, you could have just run the cpp program directly, with the same options as above.

Take a look at hello.i - it's our original hello.c file, except all the preprocessor directives (i.e. everything that starts with a '#') like #include and #define have been resolved. Most of the code is from the #include <stdio.h> statement in our original file, since all this directive does is simply start reading from the specified file and put it into our source. If you want to see our contribution, you have to go right to the last few lines of the file.

This is, of course, incredibly helpful if you want to make sure your macros expand correctly, or if you have problems with missing definitions you're sure should be in a certain header file - the thing you see could easily have been #undef'd out in a file included from a file, or not included because of some obscure #if statement you're not sure's true or not.


Compilation Proper



By "compilation proper" I mean the translation from our source language (C) to our target language for this stage (assembly language). For those who aren't familiar with assembly language, also called assembler or asm, it's a very low-level language, only one step up from machine language. Each assembly language instruction corresponds directly to a single machine instruction, and deals directly with hardware registers, instruction pointers and so on. It also exposes the bare symbols in your program, as we'll see in a bit.

We'll take the preprocessed source and compile it to assembly by passing the -S directive to gcc:

$ gcc -S hello.i

Now you'll have a file hello.s in the current directory, containing the generated assember. There are many assembly languages for different machine architectures, so how the assembler looks will depend on the architecture you're compiling for, but here's the listing of the code generated for my x86 machine:
.LC0:
.string "hello world"
.text
.globl main
.type main, @function
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ecx
subl $20, %esp
movl $.LC0, (%esp)
call puts
addl $20, %esp
popl %ecx
popl %ebp
leal -4(%ecx), %esp
ret
.size main, .-main
.ident "GCC: (Ubuntu 4.3.3-5ubuntu4) 4.3.3"
.section .note.GNU-stack,"",@progbits

If you're not interested in knowing a bit about the assembler, skip to the next section.

Note that the output is in AT&T syntax - this might look strange if you're used to intel syntax. One important difference between the two syntaxes is that operands go the other way - for example, the instruction movl %esp, %ebp moves data from the esp register to the ebp register.

Anyway, let's have a look at some of the highlights of the code above that'll help solidify certain things later on - first off, the first five lines aren't instructions, they're assembler directives. Then we come to the line "main:" that looks like a C-style label. It looks like that because that's basically what it is - it simply marks the location of the next instruction. As it happens, it marks the start of our main function, and it will eventually become a symbol in the object file we generate. When any function is called, execution simply jumps to the location of the relevant symbol, and that's all there is to a function call - anything else (such as passing arguments or receiving a return value) has to be coded in assembler.

There will be more on how arguments are passed to functions and how return values are generated in some future post, but we'll just leave that there for now.

So, when our main() function is called, and some instructions execute, until we get to the "money instruction":

call puts

This instruction moves the address of the symbol puts into a register called the instruction pointer (IP) register (it also pushes the current value of the IP register on the stack - more about that in a later post), which does exactly what it says on the tin - it points to the next instruction the processor should execute. Since the location of the puts function has been placed there, execution will jump to that function and obligingly print our message. When it returns, execution starts at the would-be next instruction (i.e. addl $20, %esp) and continues until we hit the ret instruction near the end of the listing. The last three lines are more directives.

So there we go - our assembler file, ready to be assembled into an object file.


Assembly and Linking


Well, I've written more than I suspected I would for the previous sections, and there's even more to write on assembling an object file and linking it to produce an executable, so I'll leave that for my next post. I'll also discuss how to generate, inspect and strip (ooh-er) objects and shared and static libraries.

I hope someone out there finds this at least mildly useful - any (constructive!) comments are appreciated.

2 comments:

  1. Really appreciate the article. I am learning C on my own and I am one of those people that really need to know the "why's" behind the steps I take so I can think my way through instead of relying on my memory (which is not to good). Thanks again for taking your time to explain these things to those of us trying to get a grasp on our own.

    ReplyDelete
  2. Please make a post on assembly and linking.

    ReplyDelete