Thursday 15 October 2009

The Compilation Chain - Part 2

Hi there. In this post we'll be taking a look at how object files are generated from the assembler that gcc outputs, and how they're linked to produce executables and libraries. We'll be taking a look at what an object file actually is, and what the linker has to do to produce an executable file or a library. If you've not read my last post, which was about the first two stages of turning a source file into an executable, now might be a good time, but it's not essential reading if you only want to know about object files.


Assembling



From the assembler file we generated in the last post, hello.s, we can use GNU as (from the binutils package) to assemble it:

$ as -o hello.o hello.s

Alternatively, you get gcc to produce an object file by passing it the -c option, which stops gcc from trying to link the object file to produce an executable. This is essential if none of the files you're compiling contain a main() function.

Okay, so now we have an object file, what can we do with it? Well, there are a couple of things you can do, but by far the most useful for "general" programming is to query the symbol table. This is incredibly useful for resolving linker errors involving symbols that aren't defined, or are defined more than once. nm is a useful program included in binutils for looking at symbol tables, and is fairly simple - objdump is more heavyweight if you want to do more, and can even disassemble object files back into assembler. For our purposes, nm will do just fine.

If you simply run nm with our object as an input, you'll get a list of non-debugging symbols:

$ nm hello.o
00000000 T main
U puts


Note that if you want to use nm to look at shared libraries, you have to pass it the --dynamic flag or it'll just complain that there are no symbols.

Symbols mark the position of certain things in code, so you'll generally have a symbol for every function you define or use, and any variables that are either defined outside of any function, or declared static within a function - in short, any variable that's not an automatic variable, since automatic variables exist on the stack. For more detail, I'll be explaining about the stack, function calls and return values in a later post.

The output of nm above has three columns, which show (1) the value of the symbol, (2) the type of symbol, and (3) the name of the symbol. As you can see, the main symbol, which marks the start of our main() function, has the value 0 and type T, meaning it is a symbol that is defined in hello.o at the start of the .text section. Object files can contain many sections, the most important are:

  • The read-only .text section, which contains our program code,

  • The .data section, which contains data that's initialised by the programmer

  • The .bss section, which contains data is zeroed-out before entry to our main() function, and

  • The .rodata section which also contains data and is (can you guess?) read-only.


The other symbol nm displays is puts which doesn't have a value, because it's an undefined symbol, as indicated by the type, U. This brings us nicely to the most important job of the linker - resolving undefined symbols.


What a Linker Does




Many of you will know what linking is at a purely functional level, but when I was learning to program, the books I picked up never went into much detail - so first we'll have a brief description of what linking actually is before going into more detail later.

Basically, the job of a linker is to produce an executable or library file, whilst allowing you to import pre-compiled executable code from other libraries. Library files are located in various places on a Linux system, but the most important places are /lib/ and /usr/lib/. All libraries contain some code and/or data associated with a particular symbol - for example, a library file for the C standard library will mark some code and data for use with the puts function.

There are two types of linking and two different kinds of linker. The two types of linking are (1) static linking and (2) dynamic linking. The difference between them is when your program actually gets to see the code and/or data you need. When you link statically, the linker grabs the code from a static library (suffixed with .a) and copies it directly into the executable file you get at the end of linking. This way, all undefined symbols in your object file have been resolved, and your executable always has all the code it needs to run, and always runs the same code, wherever it's run from.

When you link dynamically, no code gets imported into your executable. Instead, the linker program (i.e. the compile-time linker, the canonical one being ld on a Linux system) simply tags the executable with the name of the shared library (suffixed with .so) it wants to use when it comes to run-time. This brings us to the other type of linker, the dynamic linker - on Linux systems at the time of writing, this will be /lib/ld-linux.so.*. So, when you link dynamically, your program doesn't actually get the code it needs until run-time, and this code is loaded into memory each time your program starts, when the dynamic linker resolves all the undefined symbols in your executable. This way, your program can run code that wasn't even written when you linked your object files - if the local libc.so.* library is updated for, say, a bug fix, your program will use the new code. It also greatly reduces the size of your code - if every program had to have its very own copy of the printf() function, as well as many other functions, the executables on your system would be much larger and more bloated. On the down-side, the user must have a shared library with the correct name (shared library versioning is touched on later in this post, but more in-depth in the next one) and with the appropriate symbols defined, or the program can't run.


Actually Linking




So, for our present purposes, the linker has to produce an executable from our object file. To do this, we can use gcc to pass some necessary options to the linker, ld, including the -static flag if we want to produce a static executable:

$ gcc -o hello-dynamic hello.o
$ gcc -o hello-static hello.o -static
$ ./hello-dynamic
hello world
$ ./hello-static
hello world
$ du -h hello-*
12K hello-dynamic
576K hello-static


In producing our executable, the linker has had to resolve any undefined symbols in our object file(s) - as we saw before, the only undefined symbol is puts. The linker does this by looking in certain library files to see if those libraries define the symbol it's after. The symbol can then be resolved either statically or dynamically, depending on what files the linker finds in it's search path and the options you pass on the command line to gcc. Generally, the linker will prefer dynamic linking, unless (1) it only finds a static library file, or (2) you specify the -static option.

The linker will insist on being able to resolve all undefined symbols, either statically or dynamically, before producing your executable program, or you'll get an error.

In our statically linked program, the linker has copied all the code and/or data associated with the puts from the static library into our executable. As you can see, this makes our executable quite a bit larger than the dynamically linked one. In the dynamic version, the linker marks the symbol as undefined in the executable, and also stores the name of the shared library where the symbol was found into the executable. When the program is started, the dynamic linker ensures that these libraries are all present in it's search path, or it will refuse to let the program start.

Let's have a look at the symbol table for our dynamic executable:

$ nm hello-dynamic
08049f20 d _DYNAMIC
08049ff4 d _GLOBAL_OFFSET_TABLE_
080484ac R _IO_stdin_used
w _Jv_RegisterClasses
08049f10 d __CTOR_END__
08049f0c d __CTOR_LIST__
08049f18 D __DTOR_END__
08049f14 d __DTOR_LIST__
080484bc r __FRAME_END__
08049f1c d __JCR_END__
08049f1c d __JCR_LIST__
0804a014 A __bss_start
0804a00c D __data_start
08048460 t __do_global_ctors_aux
08048340 t __do_global_dtors_aux
0804a010 D __dso_handle
w __gmon_start__
0804845a T __i686.get_pc_thunk.bx
08049f0c d __init_array_end
08049f0c d __init_array_start
080483f0 T __libc_csu_fini
08048400 T __libc_csu_init
U __libc_start_main@@GLIBC_2.0
0804a014 A _edata
0804a01c A _end
0804848c T _fini
080484a8 R _fp_hw
08048294 T _init
08048310 T _start
0804a014 b completed.6635
0804a00c W data_start
0804a018 b dtor_idx.6637
080483a0 t frame_dummy
080483c4 T main
U puts@@GLIBC_2.0


As you can see, there are quite a few more symbols defined in our executable than in our object file. This is because our program has been linked with several other files that provide code that runs before and after our main() function, as we'll see in a little bit. You'll notice there's a symbol _start defined in our executable - that's the real entry point to our program (again, more on that later).

If you really want to look at the symbols in hello-static, feel free - but because the entire puts function has been linked in, there are 1995 symbols, so I'll spare you a print out of them all.

You'll also notice the symbol puts in our dynamic executable has been changed to puts@@GLIBC_2.0. This is an example of symbol versioning - when we run our program, the dynamic linker will insist on seeing a shared library with the appropriate version string (in this case GLIBC_2.0), or it will simply print an error. We can demonstrate this by using sed to change the version string in our executable:

$ ./hello-dynamic
hello world
$ sed 's|GLIBC_2\.0|__xyzzy__|g' hello-dynamic > hello-bad
$ chmod u+x hello-bad
$ ./hello-bad
./hello: /lib/tls/i686/cmov/libc.so.6: version `__xyzzy__' not found (required by ./hello)


(Note that you may need to replace GLIBC_2.0 with the appropriate version string if it's different, and the string you're replacing it with has to be the same length as the version string, or the hard-coded addresses in the executable will point to garbage and "bad" will happen.)

As mentioned before, the linker tags the produced executable with the names of any shared libraries it needs at run-time. Actually, that's not quite true - it tags it with the names of any libraries you specify, even if it doesn't need any symbols from them, unless you pass the -as-needed flag to the linker before specifying the library with the -l flag. Anyway, it can be useful to extract this information, which we can do with the ldd program, which outputs the name of the library and the absolute path to where the library was found on the current system at the time the ldd program is run, if appropriate:

$ ldd hello
linux-gate.so.1 => (0xb779b000)
libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0xb7622000)
/lib/ld-linux.so.2 (0xb779c000)


As you can see, the executable we've produced relies on linux-gate.so.1, libc.so.6 and /lib/ld-linux.so.2. linux-gate.so.1 is a "virtual" shared library, which is exposed by the Linux kernel as a gateway between kernel- and user-space. libc.so.6 is the standard C library, and /lib/ld-linux.so.2 is the dynamic linker responsible for loading the above libraries at run-time.

Dynamic linking is by far the most widely used form of linking, so that's what we'll concentrate on in the next secion.


Using ld Directly




It can be informative to see how to directly use ld itself to produce an executable - it won't use any files we don't tell it to, so we'll have to point it to everything we want to include in our executable.

First, we'll use the linker to produce a dynamic executable without any initialization. To do this, we'll need to change our basic "hello world" program a bit. As mentioned earlier, the _start symbol is the "real" entry point into a program - gcc supplies code that provides this symbol and eventually calls our main function, but because we're not using any startup files, we'll have to provide this symbol ourselves. Secondly, we have to explicitly call exit(), as the normal function return instruction ret is not suitable for exiting a program - we need the exit() function to tell the kernel to terminate the program, or the ret function will leave us trying to execute code we're not allowed to execute, and we'll get a segfault.

So, to generate our new program source and the object file we need:

$ cat > skel.c << 'EOF'
#include <stdio.h>
#include <stdlib.h>
int _start()
{
puts("hello world");
exit(0);
}
EOF
$ gcc -c -o skel.o skel.c


Note that on some architectures (Mac OS X for example), an underscore is automatically prepended to symbol names - in this case, the function above should be called start() instead of _start(). If in doubt, use nm to see if your compiler does this.

Now, when linking, we need to specify two more things than usual. Firstly, we need to explicitly link with libc.so by passing the option -lc. Secondly, we need to specify the dynamic linker our executable should use - on a Linux system at the time of writing, this will generally be /lib/ld-linux.so.2. If you're unsure, use the ldd program on our hello executable to see what it uses.

So, to link our skel.o object file:

$ ld -o skel -lc -dynamic-linker /lib/ld-linux.so.2 skel.o
$ ./skel
hello world


If we want to "hand-link" our original hello-dynamic executable, we need to tell the linker where to find the start files gcc provides. We need to provide a system-dependent -L option which contains the libgcc.a and libgcc_s.so libraries. This path should be in /usr/lib/gcc/<some>/<thing>/ - on my system, it's /usr/lib/gcc/i486-linux-gnu/4.3/.

The full command to link our hello-dynamic program ourselves is:

$ ld -dynamic-linker /lib/ld-linux.so.2                 \
-o hello-dynamic \
/usr/lib/crt{1,i,n}.o \
/usr/lib/gcc/i486-linux-gnu/4.3.3/crt{begin,end}.o \
-L/usr/lib/gcc/i486-linux-gnu/4.3.3 \
-lgcc -lgcc_s -lc hello.o

$ ./hello-dynamic
hello world




I'll leave things there for just now - next time I'll talk about creating static and shared libraries, shared library versioning, and so on.

No comments:

Post a Comment