Thursday, 15 October 2009

The Compilation Chain - Part 2

Hi there. In this post we'll be taking a look at how object files are generated from the assembler that gcc outputs, and how they're linked to produce executables and libraries. We'll be taking a look at what an object file actually is, and what the linker has to do to produce an executable file or a library. If you've not read my last post, which was about the first two stages of turning a source file into an executable, now might be a good time, but it's not essential reading if you only want to know about object files.


Assembling



From the assembler file we generated in the last post, hello.s, we can use GNU as (from the binutils package) to assemble it:

$ as -o hello.o hello.s

Alternatively, you get gcc to produce an object file by passing it the -c option, which stops gcc from trying to link the object file to produce an executable. This is essential if none of the files you're compiling contain a main() function.

Okay, so now we have an object file, what can we do with it? Well, there are a couple of things you can do, but by far the most useful for "general" programming is to query the symbol table. This is incredibly useful for resolving linker errors involving symbols that aren't defined, or are defined more than once. nm is a useful program included in binutils for looking at symbol tables, and is fairly simple - objdump is more heavyweight if you want to do more, and can even disassemble object files back into assembler. For our purposes, nm will do just fine.

If you simply run nm with our object as an input, you'll get a list of non-debugging symbols:

$ nm hello.o
00000000 T main
U puts


Note that if you want to use nm to look at shared libraries, you have to pass it the --dynamic flag or it'll just complain that there are no symbols.

Symbols mark the position of certain things in code, so you'll generally have a symbol for every function you define or use, and any variables that are either defined outside of any function, or declared static within a function - in short, any variable that's not an automatic variable, since automatic variables exist on the stack. For more detail, I'll be explaining about the stack, function calls and return values in a later post.

The output of nm above has three columns, which show (1) the value of the symbol, (2) the type of symbol, and (3) the name of the symbol. As you can see, the main symbol, which marks the start of our main() function, has the value 0 and type T, meaning it is a symbol that is defined in hello.o at the start of the .text section. Object files can contain many sections, the most important are:

  • The read-only .text section, which contains our program code,

  • The .data section, which contains data that's initialised by the programmer

  • The .bss section, which contains data is zeroed-out before entry to our main() function, and

  • The .rodata section which also contains data and is (can you guess?) read-only.


The other symbol nm displays is puts which doesn't have a value, because it's an undefined symbol, as indicated by the type, U. This brings us nicely to the most important job of the linker - resolving undefined symbols.


What a Linker Does




Many of you will know what linking is at a purely functional level, but when I was learning to program, the books I picked up never went into much detail - so first we'll have a brief description of what linking actually is before going into more detail later.

Basically, the job of a linker is to produce an executable or library file, whilst allowing you to import pre-compiled executable code from other libraries. Library files are located in various places on a Linux system, but the most important places are /lib/ and /usr/lib/. All libraries contain some code and/or data associated with a particular symbol - for example, a library file for the C standard library will mark some code and data for use with the puts function.

There are two types of linking and two different kinds of linker. The two types of linking are (1) static linking and (2) dynamic linking. The difference between them is when your program actually gets to see the code and/or data you need. When you link statically, the linker grabs the code from a static library (suffixed with .a) and copies it directly into the executable file you get at the end of linking. This way, all undefined symbols in your object file have been resolved, and your executable always has all the code it needs to run, and always runs the same code, wherever it's run from.

When you link dynamically, no code gets imported into your executable. Instead, the linker program (i.e. the compile-time linker, the canonical one being ld on a Linux system) simply tags the executable with the name of the shared library (suffixed with .so) it wants to use when it comes to run-time. This brings us to the other type of linker, the dynamic linker - on Linux systems at the time of writing, this will be /lib/ld-linux.so.*. So, when you link dynamically, your program doesn't actually get the code it needs until run-time, and this code is loaded into memory each time your program starts, when the dynamic linker resolves all the undefined symbols in your executable. This way, your program can run code that wasn't even written when you linked your object files - if the local libc.so.* library is updated for, say, a bug fix, your program will use the new code. It also greatly reduces the size of your code - if every program had to have its very own copy of the printf() function, as well as many other functions, the executables on your system would be much larger and more bloated. On the down-side, the user must have a shared library with the correct name (shared library versioning is touched on later in this post, but more in-depth in the next one) and with the appropriate symbols defined, or the program can't run.


Actually Linking




So, for our present purposes, the linker has to produce an executable from our object file. To do this, we can use gcc to pass some necessary options to the linker, ld, including the -static flag if we want to produce a static executable:

$ gcc -o hello-dynamic hello.o
$ gcc -o hello-static hello.o -static
$ ./hello-dynamic
hello world
$ ./hello-static
hello world
$ du -h hello-*
12K hello-dynamic
576K hello-static


In producing our executable, the linker has had to resolve any undefined symbols in our object file(s) - as we saw before, the only undefined symbol is puts. The linker does this by looking in certain library files to see if those libraries define the symbol it's after. The symbol can then be resolved either statically or dynamically, depending on what files the linker finds in it's search path and the options you pass on the command line to gcc. Generally, the linker will prefer dynamic linking, unless (1) it only finds a static library file, or (2) you specify the -static option.

The linker will insist on being able to resolve all undefined symbols, either statically or dynamically, before producing your executable program, or you'll get an error.

In our statically linked program, the linker has copied all the code and/or data associated with the puts from the static library into our executable. As you can see, this makes our executable quite a bit larger than the dynamically linked one. In the dynamic version, the linker marks the symbol as undefined in the executable, and also stores the name of the shared library where the symbol was found into the executable. When the program is started, the dynamic linker ensures that these libraries are all present in it's search path, or it will refuse to let the program start.

Let's have a look at the symbol table for our dynamic executable:

$ nm hello-dynamic
08049f20 d _DYNAMIC
08049ff4 d _GLOBAL_OFFSET_TABLE_
080484ac R _IO_stdin_used
w _Jv_RegisterClasses
08049f10 d __CTOR_END__
08049f0c d __CTOR_LIST__
08049f18 D __DTOR_END__
08049f14 d __DTOR_LIST__
080484bc r __FRAME_END__
08049f1c d __JCR_END__
08049f1c d __JCR_LIST__
0804a014 A __bss_start
0804a00c D __data_start
08048460 t __do_global_ctors_aux
08048340 t __do_global_dtors_aux
0804a010 D __dso_handle
w __gmon_start__
0804845a T __i686.get_pc_thunk.bx
08049f0c d __init_array_end
08049f0c d __init_array_start
080483f0 T __libc_csu_fini
08048400 T __libc_csu_init
U __libc_start_main@@GLIBC_2.0
0804a014 A _edata
0804a01c A _end
0804848c T _fini
080484a8 R _fp_hw
08048294 T _init
08048310 T _start
0804a014 b completed.6635
0804a00c W data_start
0804a018 b dtor_idx.6637
080483a0 t frame_dummy
080483c4 T main
U puts@@GLIBC_2.0


As you can see, there are quite a few more symbols defined in our executable than in our object file. This is because our program has been linked with several other files that provide code that runs before and after our main() function, as we'll see in a little bit. You'll notice there's a symbol _start defined in our executable - that's the real entry point to our program (again, more on that later).

If you really want to look at the symbols in hello-static, feel free - but because the entire puts function has been linked in, there are 1995 symbols, so I'll spare you a print out of them all.

You'll also notice the symbol puts in our dynamic executable has been changed to puts@@GLIBC_2.0. This is an example of symbol versioning - when we run our program, the dynamic linker will insist on seeing a shared library with the appropriate version string (in this case GLIBC_2.0), or it will simply print an error. We can demonstrate this by using sed to change the version string in our executable:

$ ./hello-dynamic
hello world
$ sed 's|GLIBC_2\.0|__xyzzy__|g' hello-dynamic > hello-bad
$ chmod u+x hello-bad
$ ./hello-bad
./hello: /lib/tls/i686/cmov/libc.so.6: version `__xyzzy__' not found (required by ./hello)


(Note that you may need to replace GLIBC_2.0 with the appropriate version string if it's different, and the string you're replacing it with has to be the same length as the version string, or the hard-coded addresses in the executable will point to garbage and "bad" will happen.)

As mentioned before, the linker tags the produced executable with the names of any shared libraries it needs at run-time. Actually, that's not quite true - it tags it with the names of any libraries you specify, even if it doesn't need any symbols from them, unless you pass the -as-needed flag to the linker before specifying the library with the -l flag. Anyway, it can be useful to extract this information, which we can do with the ldd program, which outputs the name of the library and the absolute path to where the library was found on the current system at the time the ldd program is run, if appropriate:

$ ldd hello
linux-gate.so.1 => (0xb779b000)
libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0xb7622000)
/lib/ld-linux.so.2 (0xb779c000)


As you can see, the executable we've produced relies on linux-gate.so.1, libc.so.6 and /lib/ld-linux.so.2. linux-gate.so.1 is a "virtual" shared library, which is exposed by the Linux kernel as a gateway between kernel- and user-space. libc.so.6 is the standard C library, and /lib/ld-linux.so.2 is the dynamic linker responsible for loading the above libraries at run-time.

Dynamic linking is by far the most widely used form of linking, so that's what we'll concentrate on in the next secion.


Using ld Directly




It can be informative to see how to directly use ld itself to produce an executable - it won't use any files we don't tell it to, so we'll have to point it to everything we want to include in our executable.

First, we'll use the linker to produce a dynamic executable without any initialization. To do this, we'll need to change our basic "hello world" program a bit. As mentioned earlier, the _start symbol is the "real" entry point into a program - gcc supplies code that provides this symbol and eventually calls our main function, but because we're not using any startup files, we'll have to provide this symbol ourselves. Secondly, we have to explicitly call exit(), as the normal function return instruction ret is not suitable for exiting a program - we need the exit() function to tell the kernel to terminate the program, or the ret function will leave us trying to execute code we're not allowed to execute, and we'll get a segfault.

So, to generate our new program source and the object file we need:

$ cat > skel.c << 'EOF'
#include <stdio.h>
#include <stdlib.h>
int _start()
{
puts("hello world");
exit(0);
}
EOF
$ gcc -c -o skel.o skel.c


Note that on some architectures (Mac OS X for example), an underscore is automatically prepended to symbol names - in this case, the function above should be called start() instead of _start(). If in doubt, use nm to see if your compiler does this.

Now, when linking, we need to specify two more things than usual. Firstly, we need to explicitly link with libc.so by passing the option -lc. Secondly, we need to specify the dynamic linker our executable should use - on a Linux system at the time of writing, this will generally be /lib/ld-linux.so.2. If you're unsure, use the ldd program on our hello executable to see what it uses.

So, to link our skel.o object file:

$ ld -o skel -lc -dynamic-linker /lib/ld-linux.so.2 skel.o
$ ./skel
hello world


If we want to "hand-link" our original hello-dynamic executable, we need to tell the linker where to find the start files gcc provides. We need to provide a system-dependent -L option which contains the libgcc.a and libgcc_s.so libraries. This path should be in /usr/lib/gcc/<some>/<thing>/ - on my system, it's /usr/lib/gcc/i486-linux-gnu/4.3/.

The full command to link our hello-dynamic program ourselves is:

$ ld -dynamic-linker /lib/ld-linux.so.2                 \
-o hello-dynamic \
/usr/lib/crt{1,i,n}.o \
/usr/lib/gcc/i486-linux-gnu/4.3.3/crt{begin,end}.o \
-L/usr/lib/gcc/i486-linux-gnu/4.3.3 \
-lgcc -lgcc_s -lc hello.o

$ ./hello-dynamic
hello world




I'll leave things there for just now - next time I'll talk about creating static and shared libraries, shared library versioning, and so on.

Wednesday, 14 October 2009

The Compilation Chain - Part 1

Okay, this is the first technical post of this blog, so we'll start off with a thorough overview of the compilation chain in the C language using GNU tools.

Note: To run all the commands in this post, you'll need to have binutils and the GNU C compiler installed, or a compatible toolchain.

When most people say "compilation" they mean getting an executable file from a source file - and most of the time that's all we want to care about. There is, however, a lot more to it than that, especially with the GNU toolchain.

To illustrate, let's see the steps that occur when you compile a simple hello world program. Not that you need it, but just for reference:

$ cat > hello.c << 'EOF'

#include <stdio.h>

int main()
{
puts("hello world");
}

EOF


Now, generating an executable straight from the source is fairly easy:

$ gcc hello.c -o hello
$ ./hello
hello world


The command gcc produces the executable hello (as specified by the -o hello option) from the source file hello.c. Surely there can't be much more to it? Well, there are four discrete steps that gcc uses to produce an executable file. They are:

  • Preprocessing the C source file using the cpp program,

  • Compiling the processed C source into assembler using the cc1 back-end,

  • Assembling the asm file into an object file using as, and finally

  • Linking the object file with other archives/libraries to produce an executable using the collect2 program, which is essentially a front-end to ld for simple programs


From the above, it might seem that the gcc program doesn't actually do anything that could be described as "compiling" at all - and you'd be right. gcc itself simply acts as a front-end to the above four operations. And what with gcc being the flexible beast that it is, you can get it to stop at any of those stages if you want to.


Preprocessing



First, let's get gcc to show us our source code after it's been run though the C preprocessor cpp:

$ gcc -o hello.i hello.c -E

The -E option tells gcc to stop after it's finished running the preprocessor. Alternatively, you could have just run the cpp program directly, with the same options as above.

Take a look at hello.i - it's our original hello.c file, except all the preprocessor directives (i.e. everything that starts with a '#') like #include and #define have been resolved. Most of the code is from the #include <stdio.h> statement in our original file, since all this directive does is simply start reading from the specified file and put it into our source. If you want to see our contribution, you have to go right to the last few lines of the file.

This is, of course, incredibly helpful if you want to make sure your macros expand correctly, or if you have problems with missing definitions you're sure should be in a certain header file - the thing you see could easily have been #undef'd out in a file included from a file, or not included because of some obscure #if statement you're not sure's true or not.


Compilation Proper



By "compilation proper" I mean the translation from our source language (C) to our target language for this stage (assembly language). For those who aren't familiar with assembly language, also called assembler or asm, it's a very low-level language, only one step up from machine language. Each assembly language instruction corresponds directly to a single machine instruction, and deals directly with hardware registers, instruction pointers and so on. It also exposes the bare symbols in your program, as we'll see in a bit.

We'll take the preprocessed source and compile it to assembly by passing the -S directive to gcc:

$ gcc -S hello.i

Now you'll have a file hello.s in the current directory, containing the generated assember. There are many assembly languages for different machine architectures, so how the assembler looks will depend on the architecture you're compiling for, but here's the listing of the code generated for my x86 machine:
.LC0:
.string "hello world"
.text
.globl main
.type main, @function
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ecx
subl $20, %esp
movl $.LC0, (%esp)
call puts
addl $20, %esp
popl %ecx
popl %ebp
leal -4(%ecx), %esp
ret
.size main, .-main
.ident "GCC: (Ubuntu 4.3.3-5ubuntu4) 4.3.3"
.section .note.GNU-stack,"",@progbits

If you're not interested in knowing a bit about the assembler, skip to the next section.

Note that the output is in AT&T syntax - this might look strange if you're used to intel syntax. One important difference between the two syntaxes is that operands go the other way - for example, the instruction movl %esp, %ebp moves data from the esp register to the ebp register.

Anyway, let's have a look at some of the highlights of the code above that'll help solidify certain things later on - first off, the first five lines aren't instructions, they're assembler directives. Then we come to the line "main:" that looks like a C-style label. It looks like that because that's basically what it is - it simply marks the location of the next instruction. As it happens, it marks the start of our main function, and it will eventually become a symbol in the object file we generate. When any function is called, execution simply jumps to the location of the relevant symbol, and that's all there is to a function call - anything else (such as passing arguments or receiving a return value) has to be coded in assembler.

There will be more on how arguments are passed to functions and how return values are generated in some future post, but we'll just leave that there for now.

So, when our main() function is called, and some instructions execute, until we get to the "money instruction":

call puts

This instruction moves the address of the symbol puts into a register called the instruction pointer (IP) register (it also pushes the current value of the IP register on the stack - more about that in a later post), which does exactly what it says on the tin - it points to the next instruction the processor should execute. Since the location of the puts function has been placed there, execution will jump to that function and obligingly print our message. When it returns, execution starts at the would-be next instruction (i.e. addl $20, %esp) and continues until we hit the ret instruction near the end of the listing. The last three lines are more directives.

So there we go - our assembler file, ready to be assembled into an object file.


Assembly and Linking


Well, I've written more than I suspected I would for the previous sections, and there's even more to write on assembling an object file and linking it to produce an executable, so I'll leave that for my next post. I'll also discuss how to generate, inspect and strip (ooh-er) objects and shared and static libraries.

I hope someone out there finds this at least mildly useful - any (constructive!) comments are appreciated.

Welcome!

Hi there, and welcome to this weblog, which is all about the internals of programming on a GNU/Linux system. There's already a wealth of material on how to program in C/C++, and of course on Linux - so that's not what this blog is about.

When I started programming in C, there was lots of information on the syntax of the language, and so on, but information was stark on things like the compile-assemble-link process, let alone what a shared/static library actually is and how to make one, and so on. All programming books gave me was just enough to compile a program ("you need to include both example1.c and example2.c on the command-line"), but I wanted a much more in-depth understanding, and that's something I had to scrape together and work out for myself.

So this blog is about the compilation chain, linkage, object files, symbol tables, and how to really use tools like ld, ar, nm, objdump, strip and so forth to program in a GNU/Linux environment. I managed to get by for a while without really understanding any of this, but there's only so far you can get without knowing how these things work.

In short: This blog is about all the things I wish someone had told me about programming when I started.

Hopefully you'll find some use for this, or at least find it interesting.

Thanks for reading!