Basics of x86-64 assembly language

Assembly language is a low-level programming language that is closely related to machine code. Each instruction in an assembly program corresponds to a specific operation that the CPU can perform. Writing programs in assembly allows for fine-grained control over the computer's hardware, but it requires a deep understanding of the CPU's instruction set and the system's architecture.

[!IMPORTANT] You will rarely, if ever, need to write assembly code, but it might well happen, especially if you are goinig to optimise code, that you will have to look at the assembly code produced by your compiler. Also, studying a bit of assembly and simple cases as those presented here will help you understand how the CPU works "under the hood" (well, at least a bit). Having some knowledge of assembly will be extremely helpful to sharpen your coding skills!

Below are three simple examples in x86-64 assembly: summing two numbers, a simple for loop, and making a function call. Each example includes the code, and instructions on how to assemble and link it using the GNU assembler (as) and GCC on a Linux system. Last, we will analyse in depth a "Hello, world!" code written in assembly, spending some time on the details on the mechanism behind invoking system calls, that is, requesting the linux kernel to do something for us.

[!TIP] If you want to put your assembly skills to work after this chapter, you can have a look at the section on the effect of optimisation flags.

Summing Two Numbers

This example demonstrates how to add two numbers and store the result in a register.

.section .data
    # No data needed

.section .text
    .globl _start

_start:
    movq $5, %rax          # Load 5 into rax
    movq $10, %rbx         # Load 10 into rbx
    addq %rbx, %rax        # Add rbx to rax, result in rax

    # Exit the program
    movq $0, %rdi          # Exit code 0
    movq $60, %rax         # sys_exit system call
    syscall

Let's see what every line means:

.section .data: the read-and-write data where strings, numbers etc are stored. In this case, we don't need any, so this is empty!

.section .text: in assembly and compiled programs, the .text section is a dedicated region of memory where the executable code (machine instructions) is stored. This is the part of the program that contains the actual instructions the CPU executes. The

_start: the _start label is the default entry point for programs that are written in assembly, and is the first code that gets executed when the operating system starts running the program. This is a "simpler" way of starting a program, unlike higher-level ones written in C, which start executing from the main function after some initialisation done by the C runtime (crt0 or crt1.o), such as setting up the stack and heap. (TODO: refer to an explanation of memory layout). Using this "simpler" way also means that it's not possible to use wrappers to the system calls as those provided in the libc (more on this in later, in the "Hello, world!" example).

Within the _start section we have, first, three instructions:

    movq $5, %rax          # Load 5 into rax
    movq $10, %rbx         # Load 10 into rbx
    addq %rbx, %rax        # Add rbx to rax, result in rax

These lines simply call two instructions:

  • movq to move quadword (that is, 64 bits). In this specific case the source is an immediate value (the numerical value 5, denoted $5) and the destination is the RAX 64-bit register
  • addq to add the binary values stored in the two registers and store the result in RAX.

[!NOTE] In assembly, an immediate value is a constant number embedded into the instruction itself by the assembler, as opposed to one loaded from, e.g., a register

The final part:

 # Exit the program
    movq $0, %rdi          # Exit code 0
    movq $60, %rax         # sys_exit system call
    syscall

is invoking the kernel, asking to execute the sys_exit system call. We will dissect this part later, in the "Hello, world!" example. For now, just consider that block as the way to exit the code.

How do you produce an executable from this assembly code? We first need to produce the object code with the assembler, then link it to make it executable. Let's say we save the assembly code in a file called sum.s, then this will assemble it, link it, and execute it.

> as -o sum.o  sum.s 
> ld -o sum    sum.o
> ./sum

Alternatively, one can use gcc to perform all these steps. As this is not a conventional code using the standard C runtime initialisation, we must provide

> gcc -nostartfiles -o sum  sum.s

Of course, nothing happens at the moment, because we're not doing producing any output! More on this later in our "Hello, world!" example!

[!NOTE] the "dialect" of assembly used above is the so-called AT&T syntax. This is the one natively recognised by the GNU assembler, although it can also read the other major dialect, the Intel syntax, which would look like this:

     mov rax, 5      ; Load 5 into rax
     mov rbx, 10      ; Load 10 into rbx
     add rax, rbx    ; Add rbx to rax, result in rax

Simple for Loop

The following example demonstrates how to write a simple for loop that counts from 0 to 9.

.section .data
    # No data needed

.section .text
    .globl _start

_start:
    xorq %rcx, %rcx       # Set counter (rcx) to 0

loop_start:
    cmpq $10, %rcx        # Compare counter with 10
    jge loop_end          # If counter >= 10, jump to loop_end

    # Body of the loop (No operation, just incrementing counter)

    incq %rcx             # Increment counter
    jmp loop_start        # Jump back to start of loop

loop_end:
    # Exit the program
    movq $0, %rdi         # Exit code 0
    movq $60, %rax        # sys_exit system call
    syscall

This case is a bit more complex than the one shown before. Let's go through it line-by-line:

xorq %rcx, %rcx: a common trick to zero-out a register with the exclusive or logical operation

cmpq $10, %rcx: compare the value stored in RCX with the immediate value 10. After the comparison, a conditional should follow (jge in this case)

jge loop_end: the jge instruction is a conditional jump, and evaluates the result of a preceeding test (cmpq in our case). If the result of the comparison is "greater or equal" (to ten), then jump to the label loop_end

[!IMPORTANT] cmpq Stands for compare a quadword. It performs the internal subtraction %rcx - 10 and sets the flags (see box below) based on the result:

  • If %rcx is less than 10, set the Carry Flag (CF)
  • If %rcx is equal to 10, set the Zero Flag (ZF)
  • If %rcx is greater than 10, set the Sign Flag (SF)

[!NOTE] Flags in a CPU refer to specific bits in the status register (also called the flags register or EFLAGS on x86 processors). Flags are set or cleared by various instructions to indicate the results of operations, and are critical for decision-making in assembly language because they affect how conditional branches and other logic are executed.

incq %rcx: if the jump is not executed, the next instruction just increments by one our counter, stored in register RCX. jmp loop_start: after incrementing RCX, jump to the beginning of the loop.

Function Call

This example demonstrates how to define and call a simple function that adds two numbers and returns the result. The only new instructions introduced here are call and ret:

.section .data
    # No data needed

.section .text
    .globl _start

_start:
    movq $5, %rdi         # First argument (5) -> rdi
    movq $10, %rsi        # Second argument (10) -> rsi
    call sum              # Call sum function

    # Result is in rax

    # Exit the program
    movq $0, %rdi         # Exit code 0 -> rdi
    movq $60, %rax        # sys_exit system call -> rax
    syscall

sum:
    movq %rdi, %rax       # Move first argument (rdi) to rax
    addq %rsi, %rax       # Add second argument (rsi) to rax
    ret                   # return (result in rax)

[!NOTE] Here we need to introduce the convention on calling functions. The System V ABI (Application Binary Interface) specifies the following registers for function calls:

  • Argument Passing Registers: %rdi, %rsi, %rdx, %rcx, %r8 and %r9 are used to pass arguments from the 1st to the 6th.
  • Return Value Register: %rax is used to store the return value of a function (for both integer and pointer return values). If the return value is larger than 64 bits, additional registers may be used, but typically %rax handles most cases.
  • The stack pointer %rsp: points to the top of the stack and is used to push and pop values (such as the return address, local variables, and extra arguments) during function calls.
  • The base pointer %rbp: is often used as a reference to access local variables but in modern compilers, it may be omitted.

When call is executed, the return address (the address of the instruction after the call) is automatically pushed onto the stack. The ret instruction pops this address off the stack to return control to the calling function.

Hello, world!

Let's make an example of a very simple assembly code to print Hello, world! to screen. First, we will compile it and run it, and then we will describe each line in detail. The more we proceed in the description of the code, the more we will add details, especially close to the end, where we will dissect it down to the machine code level. However, the initial part should give you a light introduction to the structure of an executable file.

The assembly code

.section    .data
    .msg:                      # msg is just a label, you can use anything else
    .string "Hello, world!\n"

.section    .text
    .globl  _start             # this is the default entry point

_start:
    movl    $14, %edx          # Length of the string ("a\n")
    leaq    .msg(%rip), %rsi   # Go to the label 'msg' and load the string address
    movl    $1, %edi           # File descriptor 1 (stdout)
    movl    $1, %eax           # 1 is the syscall number for sys_write
                               # defined in, e.g.,
                               # /usr/include/x86_64-linux-gnu/asm/unistd_64.h
    syscall                    # Make the system call (x86-64 style)
                               # On older x86 (32-bit) systems, system calls
                               # were typically made with the 'int 0x80' interrupt.

    mov $60, %rax              # sys_exit (syscall number 60)
    xor %rdi, %rdi             # set the sxit status to 0 (success)
    syscall                    # Make the system call

Paste this code into the file hw.s. You can compile it in different ways:

compiling it with gcc (including linking the libraries)

> gcc -o hw hw.s
> ./hw
Hello, world!

assembling with as and linking with ld

> as -o hw.o hw.s
> ld -o hw hw.o
> ./hw
Hello, world!

Notice that while both hw.o (the 'object' file) and hw are ELF files, but one is relocatable, the other executable and is also linked:

> file hw.o
hw.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped
> file hw
hw: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, not stripped

If you disassemble both ELF files, you will see that they differ only in their address

> objdump -S hw.o

hw.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <_start>:
   0:   ba 0e 00 00 00          mov    $0xe,%edx
   5:   48 8d 35 00 00 00 00    lea    0x0(%rip),%rsi        # c <_start+0xc>
   c:   bf 01 00 00 00          mov    $0x1,%edi
  11:   b8 01 00 00 00          mov    $0x1,%eax
  16:   0f 05                   syscall
  18:   48 c7 c0 3c 00 00 00    mov    $0x3c,%rax
  1f:   48 31 ff                xor    %rdi,%rdi
  22:   0f 05                   syscall
> objdump -S hw

hw:     file format elf64-x86-64

Disassembly of section .text:

0000000000401000 <_start>:
  401000:   ba 0e 00 00 00          mov    $0xe,%edx
  401005:   48 8d 35 f4 0f 00 00    lea    0xff4(%rip),%rsi        # 402000 <.msg>
  40100c:   bf 01 00 00 00          mov    $0x1,%edi
  401011:   b8 01 00 00 00          mov    $0x1,%eax
  401016:   0f 05                   syscall
  401018:   48 c7 c0 3c 00 00 00    mov    $0x3c,%rax
  40101f:   48 31 ff                xor    %rdi,%rdi
  401022:   0f 05                   syscall

In 64-bit Linux systems, ELF executables are loaded starting at the address 0x400000. The ELF headers, which describe the structure of the executable, are located at the beginning of the address space (0x400000), and the actual program code starts after the headers, with an offset of 0x1000 (which is 4096 bytes or one memory page), which is why 0x401000 is a common starting address.

Line-by-line explanation

.section .data: switches to the read-and-write data where data such as strings, are stored.

.msg:: this label is a reference to the location of the string "Hello, world!\n" that comes after.

.string "Hello, world!\n": defines the string (including a newline) and stores it at the location of .msg. The string is null-terminated automatically.

.section .text: the .text section is where the executable code is stored, as described before. This is typically read-only and the code cannot modify it at runtime (this being enforced by the operating system). This protection ensures that executable code is not accidentally or maliciously altered. The .text section is typically marked as executable upon linking, meaning the CPU is allowed to execute instructions from this section. In a linked executable, the .text section typically starts at a well-defined address, which is determined by the linker during the linking phase. For example, in many Linux systems using ELF format, the .text section is placed at an address like 0x401000, as explained earlier.

.globl: this directive is used to declare a symbol (label) as global, meaning that it can be referenced from other files or outside the current assembly file.

_start: as described above, the _start label is the default entry point for programs that bypass the standard C runtime initialisation.

movl $14, %edx: stores the value 14 in the register EDX. This is the lower 32-bit part of the full, general purpose register RDX. The conventional naming of x86-64 general purpose registers are:

  • RDX: The full 64-bit register.
  • EDX: The lower 32 bits of RDX.
  • DX: The lower 16 bits of RDX.
  • DL: The lower 8 bits of RDX.
  • DH: The upper 8 bits of the lower 16 bits of RDX.

Here, we store a 32-bit integer (14). This is the length of the string "Hello, world!\n", including the newline. In this particular case, however, EDX plays a particular role (see below at the syscall section) as this represents the third argument passed to system calls.

leaq .msg(%rip), %rsi: the leaq instruction is the Load Effective Address instruction for 64-bit registers. It is commonly used to compute memory addresses or perform arithmetic without actually accessing memory. The instruction leaq computes the address or value of the memory operand and stores that computed address into the destination register.

The instruction leaq .msg(%rip), %rsi is used to compute the effective address of the label .msg relative to the Instruction Pointer register (RIP), and then store that computed address into the RSI register (which is a full 64-bit general purpose register: addresses are be 64 bit long on a 64 bit machine...). The RSI register is also a special one, because it is used to pass the second argument to system calls (see the syscall section).

The %rip register contains the address of the next instruction to be executed. RIP-relative addressing calculates the memory address as an offset from the current instruction pointer, making it particularly useful in position-independent code (common in shared libraries and executables).

In this particular case, RSI plays a particular role as this represents the first argument passed to system calls (see the syscall section).

movl $1, %edi: stores the value 1 in the register EDI. This will be used to indicate the stdout file descriptor ( stdin being typically 0, stdout 1 and stderr 2). EDI (and RDI) is a special register, because it is used to pass the first argument to system calls (see the syscall section).

movl $1, %eax: stores the value in the register EAX. This will be used to indicate the numerical value of the system call sys_write that performs binary I/O. The syscall numbers are defined in some system header files, for example /usr/include/x86_64-linux-gnu/asm/unistd_64.h, where the first 7 syscalls are defined as:

#define __NR_read 0
#define __NR_write 1
#define __NR_open 2
#define __NR_close 3
#define __NR_stat 4
#define __NR_fstat 5
#define __NR_lstat 6

Here, since we know already that sys_write is associated to the numerical constant 1, we store it directly.

Rhe EAX register (or the full RAX one) is a special one that is used by the syscall instruction to determine which system call to invoke.

syscall: execute the system call. The syscall instruction in x86-64 is a special CPU instruction used to transition from user mode to kernel mode, allowing user-space programs to request services from the operating system kernel. This instruction is crucial for executing system calls, which are controlled entry points into the kernel, enabling user programs to interact with hardware or perform privileged operations like file I/O, memory management, or process control. Notice that syscall is the modern replacement for the older int 0x80 interrupt instruction used in 32-bit systems.

In x86-64 based Linux systems, system calls use specific registers to pass the system call number and its arguments. The registers are loaded with values before invoking the syscall instruction. The RAX register contains the system call number, which tells the kernel which service is being requested. Other registers contain the arguments to the system call, as one can see from the linux kernel source linux/arch/x86_64/entry.S

/*
 * Registers on entry:
 * rax  system call number
 * rcx  return address
 * r11  saved rflags (note: r11 is callee-clobbered register in C ABI)
 * rdi  arg0
 * rsi  arg1
 * rdx  arg2
 * r10  arg3 (needs to be moved to rcx to conform to C ABI)
 * r8   arg4
*/

By calling syscall, the kernel takes over and executes the sys_write system call, passing to it the values 1 (stdout) as first argument (stored in EDI) and the address of the string Hello, world! as second argument (stored in RSI).

As we mentioned, this way we are not making use of the libc wrappers like write(), but are invoking directly the syscall from the kernel. This is roughly equivalent to the following C code:

#include <unistd.h>  // for syscall function
#include <sys/syscall.h>  // for syscall numbers (SYS_write)

int main() {
    const char *message = "Hello, world!\n";
    long bytes_written;

    // Calling the sys_write syscall using the syscall function
    bytes_written = syscall(SYS_write, 1, message, 14);  // 1 is the file descriptor for stdout

    return 0;  // Exit the program
}

The system call sys_write writes in binary format to the standard output file descriptor the content of the string, starting from it's initial address and ending 14 bytes afterwards.

Summarising, the registers used here (EDX, RSI, EDI, EAX) are used to:

  • EDX: select the system call (1 being sys_write)
  • EDI: pass the first argument to write (the stdout identifier 1)
  • RSI: pass the address where the first character (byte) of the string is stored
  • EDX: pass the number of bytes that need to be written to stdout.

mov $60, %rax: here we're getting ready to calling another syscall, namely sys_exit (whose identifying number is 60), by storing the value 60 in RAX.

xor %rdi, %rdi: again, the RDI register is used to pass the first argument to the system call. The sys_exit syscall accepts only one argument, and calling xor %rdi, %rdi is a quick way to set it identically to zero.

[!Note] Using xor to set a register to zero is a typical pattern because it ensures minimal usage of resources. Using mov could potentially be slower and/or requiring more resources. In particular:

  • xor %rdi, %rdi
    • Operation: Clears the RDI register by XOR-ing it with itself, effectively setting RDI to 0.
    • Latency: On modern CPUs, this is an extremely fast instruction. Since no memory access is required and it only operates
    • on the register itself, it is typically 1 clock cycle.
    • Micro-optimization: The xor %rdi, %rdi is recognized by modern processors as a common way to zero out a register. It takes advantage of internal optimizations that recognize this pattern, so it typically has a latency of 1 cycle and may even be executed without requiring any write-back (micro-ops fusion).
  • mov $0, %rdi
    • Operation: Moves the immediate value 1 into the RDI register.
    • Latency: The mov instruction with an immediate value usually takes 1 clock cycle as well on modern CPUs. However, it involves loading an immediate value (1) into the RDI register, so it requires slightly more resources compared to xor.
    • Instruction Size: The size of mov $1, %rdi is typically 5 bytes (since it has to encode the immediate value). Although this doesn't affect the latency directly, it can impact instruction fetching and decoding.

Also note that:

  • xor %rdi, %rdi is 2 bytes long in machine code: (0x31, 0xFF).
    • 0x31 is the opcode for a xor operation between two 64-bit registers
    • 0xFF (11111111 in binary) is the so-called ModR/M Byte and encodes the mode of operation: 11 for register-to-register, then 111 twice to identify the RDI register twice.
  • mov $0, %rdi is 5 bytes long in machine code: (0x48, 0xC7, 0xC7, 0x00, 0x00, 0x00, 0x00).
    • 0x48 is the REX prefix that tells the machine to operate on full 64-bit registers (RDI instead of EDI)
    • 0xC7 is the opcode for mov
    • 0xC7 (11000111 in binary) is the ModR/M Byte: 11 for register-to-register, 000 to specify an immediate value (0), and 111 to specify the RDI register
    • 0x00 Lower byte of the immediate value (0), padded with zeros (the other three 0x00 bytes)

Both instructions typically take 1 cycle on modern processors, but xor %rdi, %rdi is generally more efficient due to microarchitectural optimizations and smaller instruction size. This might not have a particular performance effect in this context, but it is a good excuse to explain how assembly code is translated to machine language!

syscall: finally, we call sys_exit passing zero as an argument, signaling the outer world that everything was OK.

We are now at the end our our tour on the intruduction to the assembly language. We have also introduced the structure of an ELF executable, and some details on how assembly code is translated to machine code, ready to be executed by the CPU.