Overview

In this part of the series we will be going over some basics to include:

  • How to install symbols on your Os (ubuntu 16.04)
  • How to compile a C program with symbols
  • How to start gdb
  • Common gdb command line options

 

Prerequisites

Some prerequisites that I will not be covering in this lesson are:

  • Have Ubuntu 16.04 64bit running, VirtualBox works fine
  • Understand how to get to the terminal
  • Have access to root of your system
  • Have gcc installed
  • Know How to use a basic editor on Ubuntu, gedit, nano, vi, vim, etc..

 

What's the point?

Most programmers only care if their programs work and if there is an issue they try to investigate a fix by examining the source code. Now you can diagnose many problems by carefully examining the source code but sometimes in complex situations such as race conditions, disc locking, etc it maybe easier to analyze the program using GDB. The GNU Debugger (GDB) is a portable debugger that runs on many Unix-like systems and works for many programming languages, including Ada, C, C++, Objective-C, Free Pascal, Fortran, Java[1] and partially others (Source: wiki). A solid understanding of how to reverse engineering a compiled binary can be very important in your career and if you would like to be come a security professional it could be your entire life reverse engineering malware.

 

But wait, what is a symbol?

A symbol in computer programming is a primitive datatype whose instances have a unique human-readable form. Symbols can be used as identifiers. In some programming languages, they are called atoms. Uniqueness is enforced by holding them in a symbol table. The most common use of symbols by programmers is for performing language reflection (particularly for callbacks), and most common indirectly is their use to create object linkages. (Source: wiki)

These symbols help us in gdb as we will get a description of a library rather than a memory address which is extremely helpful.

echo "deb http://ddebs.ubuntu.com $(lsb_release -cs) main restricted universe multiverse
deb http://ddebs.ubuntu.com $(lsb_release -cs)-updates main restricted universe multiverse
deb http://ddebs.ubuntu.com $(lsb_release -cs)-proposed main restricted universe multiverse" | sudo tee -a /etc/apt/sources.list.d/ddebs.list
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 428D7C01 C8CAB6595FDFF622
sudo apt-get update

 

C source code

An example program is below, C code is meant to be compiled. The code can't interact with the system until it has been transformed into an executable binary file. When the program is complied a binary is created comprised of machine language instructions - the language the CPU can actually understand. Compilers are designed to compile to specific CPU architectures (not just Intel!). Each of these CPU architectures have different machine language.

#include <stdio.h>

int foo();

int main(int argc, char **argv) {
    int count = 5;
    int newCount = foo(count);
    foo(newCount);

    return 0;
}

int foo(int count) {
    printf("Count is: %d \n", count);
    return count + 1;
}

The program above is quite simple as all it does is create a function called main(), the entry point into our program. Stepping through main leads a variable count being initialized and then being passed to a function called foo() which then returns a count +1. Let's compile this code, I put this c code in a file named reverseEngineer.c for subsequent commands. We use the gdwarf flag which produces debug information in the gdwarf format. More info here: . Normally, the symbols aren't complied with the binary as symbols change from CPU to CPU and it makes your application more portable (POSIX) to not bake them in. In our case for debugging we will bake them in, it is also acceptable to bake the symbols in if you know the architecture your application will be running on.

gcc -gdwarf-5 reverseEngineer.c

 

Compile Source

Now open the compiled binary with gdb

gdb a.out

Assembly Syntax

There can be many flavors of assembly syntax, hell you could create your own if you wanted. Most industry professionals stick to either Intel or AT&T flavors. I have been told nerd wars have been fought over the differences. It's easy to recognize AT&T syntax as it is default as uses % and $ as a prefix to many commands. The AT&T syntax shows commands from left to right, i.e. mov %eax, %esp, move the eax register to the esp register. The intel syntax doesn't have the same syntax characters and it moves from right to left, mov esp,eax. Move the eax register into the esp register. You can easily configure which one syntax to use using the following command:

set disassembly-flavor <type>
set disassembly-flavor att

Appropos

If you are having issues with any of the commands in this tutorial you can use the apropos which is a regex search finder for commands in gdb. Example usage below:

apropos set

Now that we are in gdb and we haven't executed any of program we can do whatever we want. The first thing we will do for learning purposes is allow back trace access before our main() function. But wait, how is there code above main()??. This will be the topic of a future article. To allow backtrace access before main we will use the following:

(gdb) set backtrace past-entry
(gdb) set backtrace past-main

Now that we have set back trace entry past main(), let's run some code. First, I'm going to set a break point at main() so we can see what code got executed prior to the breakpoint gets triggered. To do this:

(gdb) break main
Breakpoint 1 at 0x400535: file reverseEngineer.c, line 6.

Now let's run to our break point via the command run, note run will re-run the program every time so this is the only time we will be using the command today.

(gdb) run
Starting program: /home/mike/code/c/a.out 

Breakpoint 1, main (argc=1, argv=0x7fffffffe468) at reverseEngineer.c:6
6	    int count = 5;

Now let's print our backtrace before main since we enabled those flags earlier:

(gdb) bt
#0  main (argc=1, argv=0x7fffffffe468) at reverseEngineer.c:6
#1  0x00007ffff7a2d830 in __libc_start_main (main=0x400526 <main>, argc=1, argv=0x7fffffffe468, init=<optimized out="">, fini=<optimized out="">, rtld_fini=<optimized out="">, stack_end=0x7fffffffe458) at ../csu/libc-start.c:291
#2  0x0000000000400459 in _start ()

This is very interesting and hopefully trigger some kind of what duh moment. As we can see from the backtrace above, to execute our binary the _start() and the __libc_start_main(ARGS) functions get called first. This was all done by the compiler and had nothing to do with our actual code. We will refrain from diving into this today but as a previously mentioned a future article will be written about this. Okay let's continue and step through our main function. First, let's disassembly the main function into assembly language so it is easier to read. This is accomplished using the disas command which can be read about here, specifically the mr and flags. Doc:

(gdb) disas /mr main
Dump of assembler code for function main:
5	int main(int argc, char**argv) {
   0x0000000000400526 <+0>:	55	push   %rbp
   0x0000000000400527 <+1>:	48 89 e5	mov    %rsp,%rbp
   0x000000000040052a <+4>:	48 83 ec 20	sub    $0x20,%rsp
   0x000000000040052e <+8>:	89 7d ec	mov    %edi,-0x14(%rbp)
   0x0000000000400531 <+11>:	48 89 75 e0	mov    %rsi,-0x20(%rbp)

6	    int count = 5;
   0x0000000000400535 <+15>:	c7 45 f8 05 00 00 00	movl   $0x5,-0x8(%rbp)

7	    int newCount = foo(count);
   0x000000000040053c <+22>:	8b 45 f8	mov    -0x8(%rbp),%eax
   0x000000000040053f <+25>:	89 c7	mov    %eax,%edi
   0x0000000000400541 <+27>:	b8 00 00 00 00	mov    $0x0,%eax
   0x0000000000400546 <+32>:	e8 19 00 00 00	callq  0x400564 <foo>
   0x000000000040054b <+37>:	89 45 fc	mov    %eax,-0x4(%rbp)

8	    foo(newCount);
   0x000000000040054e <+40>:	8b 45 fc	mov    -0x4(%rbp),%eax
   0x0000000000400551 <+43>:	89 c7	mov    %eax,%edi
   0x0000000000400553 <+45>:	b8 00 00 00 00	mov    $0x0,%eax
   0x0000000000400558 <+50>:	e8 07 00 00 00	callq  0x400564 <foo>

9	
10	    return 0;
   0x000000000040055d <+55>:	b8 00 00 00 00	mov    $0x0,%eax

11	}
   0x0000000000400562 <+60>:	c9	leaveq 
   0x0000000000400563 <+61>:	c3	retq   

End of assembler dump.

Woah, hold up, what is all that garbage? Okay, from left to right. The first column is the line number of the source code. The second column contains the actual source code command and underneath that is the memory location in hexadecimal of the machine instruction. The middle is a hexadecimal representation of the binary machine language instruction. You can look up these hexadecimal machine language instructions using the specific architecture documentation. For example let's use the line from above 0x0000000000400531 with hexadecimal: 48 89 75 e0. Let's examine this hexadecimal machine code below.  I used this as a reference: OpCode Reference

   0x0000000000400531 <+11>:	48 89 75 e0	mov    %rsi,-0x20(%rbp)

Hex 48 represents: 01001000. This the prefix to Primary OpCode which in our case means it is a 64 bit size, see: 64 Bit Prefix Hex 89 represents: 10001001. This is the Primary OpCode, matches up to move, see: opCode 89 Hex 75 represents: 01110101. This corresponds to the Scaled Index Byte (SIB) which takes an argument for how many bytes to displace, these can be complex because Intel reused 16 bit and 32 bit address size buckets. Hex e0 represents: 11100000. This represents the value -0x20 which is equivalent to -32 in decimal. 

Let's break down how we got from e0 represents -0x20 (-32 decimal). This requires some explaining. To get there we need to use the first compliment and the two's compliment. Here is a detailed example courtesy of Cornell cs: Conrell but in reverse. Start with the value 32.

Binary Number:             0010        0000
1st Compliment:            1101        1111   
2nd Compliment:            1110        0000
Hex Values:                e           0

This is starting with the binary representation of 32, doing the first compliment which flips all the bits. The second compliment adds one bit from right to left. For example 1 + 1 = 0 until the 5th binary number which is 0 + 1 which is one. Now we flip that bit and stop the 2nd compliment. Leaving us with our final value.

We note that from above the hexadecimal can be useful but its very hard to read, it is much eaiser to use the assembly syntax that it corresponds too which is the last column above. The hex we just broke down was already broken out for us: mov %rsi,-0x20(%rbp) which effectively moves rsi into a local variable. Now that we have a better understanding of the output of disas let's move on to the next call. This happens at 0x0000000000400558 so let's set our breakpoint there so we can watch what happens when foo gets executed. After lets list our breakpoints to remind us where we have them set:

(gdb) break *0x0000000000400558
Breakpoint 2 at 0x400558: file reverseEngineer.c, line 8.
(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x0000000000400535 in main at reverseEngineer.c:6
	breakpoint already hit 1 time
2       breakpoint     keep y   0x0000000000400558 in main at reverseEngineer.c:8

You can see from above the debugger actually let's us know how many times a break point was hit. This is very helpful when debugging more complicated applications. Continuing, Let's step to our next call using stepi. To read more about this command see here:  Let's disassemble foo:

(gdb) stepi
7	    int newCount = foo(count);
(gdb) disas foo
Dump of assembler code for function foo:
   0x0000000000400564 <+0>:	push   %rbp
   0x0000000000400565 <+1>:	mov    %rsp,%rbp
   0x0000000000400568 <+4>:	sub    $0x10,%rsp
   0x000000000040056c <+8>:	mov    %edi,-0x4(%rbp)
   0x000000000040056f <+11>:	mov    -0x4(%rbp),%eax
   0x0000000000400572 <+14>:	mov    %eax,%esi
   0x0000000000400574 <+16>:	mov    $0x400614,%edi
   0x0000000000400579 <+21>:	mov    $0x0,%eax
   0x000000000040057e <+26>:	callq  0x400400 <printf@plt>
   0x0000000000400583 <+31>:	mov    -0x4(%rbp),%eax
   0x0000000000400586 <+34>:	add    $0x1,%eax
   0x0000000000400589 <+37>:	leaveq 
   0x000000000040058a <+38>:	retq   
End of assembler dump.

Another example of disas this time of our function foo(). Let's take a look at the registers before we step into function foo(). Registers are specific to CPU architectures. Let's print them:

(gdb) info registers
rax            0x400526	4195622
rbx            0x0	0
rcx            0x0	0
rdx            0x7fffffffe478	140737488348280
rsi            0x7fffffffe468	140737488348264
rdi            0x1	1
rbp            0x7fffffffe380	0x7fffffffe380
rsp            0x7fffffffe360	0x7fffffffe360
r8             0x400600	4195840
r9             0x7ffff7de7ab0	140737351940784
r10            0x846	2118
r11            0x7ffff7a2d740	140737348032320
r12            0x400430	4195376
r13            0x7fffffffe460	140737488348256
r14            0x0	0
r15            0x0	0
rip            0x40053c	0x40053c <main+22>
eflags         0x206	[ PF IF ]
cs             0x33	51
ss             0x2b	43
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0

If you haven't already guessed it this code was being executed on an Intel x64 bit architecture. The first four registers rax, rbx, rcx, rdx are general purpose registers. They are typically referred to as the Accumulator, Base, Counter, Data registers, respectively. The second four registers rsi, rdi, rbp, rsp are also general purpose registers but are typically used as pointers and or indexes. The names for these registers are Stack Index, Destination Index, Base Pointer and Stack Pointer. The pointer registers are called this because they typically store addresses which point to memory locations. The index registers are pointers which are commonly used to point to source and destination of memory locations that need to be written or read from. The rip register is the instruction pointer which points to the current instruction the processor is reading. Lastly, EFLAGS consists of several bit flags used for comparison of memory segments. The remaining registers are general purpose registers made available by the specific CPU.