Section 1: Fundamentals of Assembly Language

Section 1 Summary

This section introduces the basics of low-level programming, the role of the assembler, and essential CPU concepts. By the end, you will have the foundational vocabulary and context needed for reading assembly code.

Lesson 1.1: Introduction to Low-level Programming Concepts

Learning Objectives

  • Define what assembly language is in the context of low-level programming.
  • Distinguish between high-level and low-level code, and explain why assembly is necessary.

Prerequisites

  • Basic programming knowledge (e.g., in C/C++).
  • Understanding of what a compiler does.
Key Concepts
  • Low-level vs. High-level: Assembly is close to machine code and hardware.
  • Abstraction Layers: Higher-level languages vs. direct hardware instructions.
  • Assembly Language: A human-readable representation of machine instructions.

Detailed Explanation

Assembly language is a symbolic representation of the instructions executed directly by a CPU. Each instruction translates to a specific binary pattern (opcode). Reading assembly bridges the gap between what high-level code says and what the CPU does.

Consider this progression from high-level to low-level:

// High-level C code
int result = a + b;

// Compiler translates to assembly (simplified)
mov eax, [a]      ; Load value of 'a' into EAX register
add eax, [b]      ; Add value of 'b' to EAX
mov [result], eax ; Store result back to memory

// Assembly translates to machine code (hexadecimal)
8B 45 FC          ; mov eax, [ebp-4]
03 45 F8          ; add eax, [ebp-8]
89 45 F4          ; mov [ebp-12], eax

Why Learn Assembly?

  • Performance Analysis: Understanding what the CPU actually executes
  • Debugging: Low-level debugging often requires reading disassembly
  • Reverse Engineering: Analyzing software without source code
  • System Programming: Writing drivers, operating systems, embedded code
  • Security: Vulnerability research and exploit development

Exercises & Practice Problems

Question: What is the difference between assembly language and machine language?

Answer: Assembly language uses symbolic mnemonics (like mov, add) and human-readable register names. Machine language is the binary encoding (opcodes) that the CPU executes directly. Assembly is translated to machine language by an assembler.

Question: Why might a developer need to read assembly code in real-world scenarios?

Answer: For performance tuning (understanding compiler optimizations), debugging complex issues (especially when stepping through disassembly), reverse engineering (analyzing software without source), security research (finding vulnerabilities), and embedded/systems programming where direct hardware control is needed.

Recommended Resources

  • Intel® Developer Manuals - Volume 1: Basic Architecture
  • "Introduction to Assembly Language" – MIT OpenCourseWare
  • "Programming from the Ground Up" by Jonathan Bartlett

Lesson 1.2: Role of the Assembler in the Development Process

Learning Objectives

  • Understand the role of assemblers in converting assembly to machine code.
  • Distinguish between assemblers, compilers, and disassemblers.

The Development Toolchain

The assembler is a crucial part of the software development process:

Source Code (.c, .cpp, .rs, etc.)
           ↓ [Compiler]
Assembly Code (.s, .asm)
           ↓ [Assembler]
Object Code (.o, .obj)
           ↓ [Linker]
Executable (.exe, ELF, Mach-O)

What Assemblers Do

  • Translation: Convert mnemonics to opcodes
  • Symbol Resolution: Handle labels and symbolic addresses
  • Addressing: Calculate memory addresses and offsets
  • Object File Generation: Create relocatable object files

Common Assemblers

Assembler Platform Syntax Use Case
NASM Cross-platform Intel Learning, x86 development
GAS (as) Unix/Linux AT&T GNU toolchain, GCC output
MASM Windows Intel Microsoft development
YASM Cross-platform Intel NASM-compatible

Syntax Differences: Intel vs AT&T

Two major syntax styles exist for x86 assembly:

Aspect Intel Syntax AT&T Syntax
Operand Order mov dest, src mov src, dest
Register Prefix None: eax Percent: %eax
Immediate Prefix None: 5 Dollar: $5
Memory Addressing [base+index*scale+disp] disp(base,index,scale)
Size Suffixes Explicit: DWORD PTR Mnemonic: movl (l=long)
Key Differences
  • Operand Order: Intel (dest, src) vs AT&T (src, dest)
  • Prefixes: AT&T uses % for registers, $ for immediates
  • Suffixes: AT&T uses b/w/l/q for byte/word/long/quad
  • Memory: Intel [base+offset] vs AT&T offset(base)

Exercises & Practice Problems

Question: Convert this Intel syntax to AT&T syntax: mov eax, [ebx+8]

Answer: movl 8(%ebx), %eax - Note the reversed operand order, register prefix %, and memory addressing format.

Question: What's the difference between an assembler and a disassembler?

Answer: An assembler converts assembly language to machine code, while a disassembler does the reverse - converts machine code back to assembly language. Disassemblers are used for reverse engineering and debugging.

Lesson 1.3: CPU Architecture - Registers, Memory, and Execution Model

Learning Objectives

  • Understand the x86-64 register set and their purposes.
  • Grasp the basic CPU execution model and memory hierarchy.

x86-64 Register Set

x86-64 provides 16 general-purpose registers, each 64 bits wide:

64-bit 32-bit 16-bit 8-bit Purpose
RAX EAX AX AL Accumulator (arithmetic, return values)
RBX EBX BX BL Base (general purpose)
RCX ECX CX CL Counter (loops, string operations)
RDX EDX DX DL Data (arithmetic, I/O)
RSI ESI SI SIL Source Index (string operations)
RDI EDI DI DIL Destination Index
RSP ESP SP SPL Stack Pointer
RBP EBP BP BPL Base Pointer (frame pointer)
R8-R15 R8D-R15D R8W-R15W R8B-R15B Additional general-purpose registers

Special Purpose Registers

  • RIP: Instruction Pointer (program counter)
  • RFLAGS: Status flags (zero, carry, sign, etc.)
  • Segment Registers: CS, DS, ES, FS, GS, SS

Memory Hierarchy

Understanding the memory hierarchy helps in reading assembly code:

CPU Registers     (fastest, smallest)
    ↓
L1 Cache         (very fast, ~32KB)
    ↓  
L2 Cache         (fast, ~256KB)
    ↓
L3 Cache         (fast, ~8MB)
    ↓
Main Memory      (slower, GBs)
    ↓
Storage          (slowest, TBs)

Basic CPU Execution Cycle

  1. Fetch: Get instruction from memory at RIP address
  2. Decode: Interpret the instruction opcode and operands
  3. Execute: Perform the operation
  4. Write-back: Store results to registers/memory
  5. Update RIP: Point to next instruction
; Example execution trace
mov rax, 42          ; 1. Fetch this instruction
                     ; 2. Decode: move immediate 42 to RAX
                     ; 3. Execute: load value 42
                     ; 4. Write-back: RAX = 42
                     ; 5. RIP += instruction_length

add rax, 8           ; Next instruction...

Exercises & Practice Problems

Question: What's the difference between RAX, EAX, AX, and AL?

Answer: They refer to different portions of the same register: RAX (64-bit), EAX (lower 32 bits), AX (lower 16 bits), AL (lower 8 bits). When you write to EAX, the upper 32 bits of RAX are zeroed automatically.

Question: Why does x86-64 have both general-purpose and special-purpose registers?

Answer: General-purpose registers (RAX, RBX, etc.) can be used for various operations, while special-purpose registers (RSP, RIP, RFLAGS) have specific functions that the CPU hardware depends on for proper operation. This design balances flexibility with functionality.

Lesson 1.4: Basic Assembly Instructions

Learning Objectives

  • Classify different instruction types (data movement, arithmetic, etc.).
  • Apply fundamental assembly instructions in short code snippets.

Prerequisites

  • Understanding of registers and CPU architecture.
  • Familiarity with basic binary and hexadecimal notation.
Key Concepts
  • Data Movement: mov, push, pop
  • Arithmetic: add, sub, mul, div
  • Logical/Bitwise: and, or, xor, not
  • Comparison/Testing: cmp, test

Detailed Explanation

Assembly instructions can be categorized into several fundamental types:

Data Movement Instructions
  • The mov instruction copies data from one place to another
  • push and pop work with the stack
  • lea (Load Effective Address) calculates addresses
Arithmetic Instructions
  • Arithmetic instructions modify registers based on operation
  • add, sub perform basic arithmetic
  • mul, imul for multiplication (unsigned/signed)
  • div, idiv for division (unsigned/signed)
Logical Instructions
  • Logical instructions perform bitwise operations
  • and, or, xor for boolean logic
  • not for bitwise negation
  • shl, shr for bit shifting
Comparison Instructions
  • cmp sets flags (in rflags) for subsequent conditional jumps
  • test performs bitwise AND and sets flags without storing result
; Example instruction sequence
mov rax, 10         ; rax = 10
add rax, 5          ; rax = 15  
cmp rax, 20         ; compare rax with 20, sets condition flags
sub rax, 3          ; rax = 12
and rax, 0xF        ; rax = 12 & 15 = 12 (no change in this case)

Flag Effects

Many instructions affect the CPU flags register (RFLAGS):

  • ZF (Zero Flag): Set if result is zero
  • SF (Sign Flag): Set if result is negative
  • CF (Carry Flag): Set if unsigned overflow occurs
  • OF (Overflow Flag): Set if signed overflow occurs

Exercises & Practice Problems

Question: After executing add rax, rbx, which flags might be set?

Answer: Zero flag (ZF) if the result is 0, sign flag (SF) if the result is negative, overflow flag (OF) if there is signed overflow, carry flag (CF) if there is unsigned overflow.

Exercise: Write assembly that computes 3 * 5 + 2. Verify in a debugger that rax holds the correct result.

Solution:

mov rax, 3          ; rax = 3
mov rbx, 5          ; rbx = 5  
imul rax, rbx       ; rax = 3 * 5 = 15
add rax, 2          ; rax = 15 + 2 = 17

Question: What's the difference between mul and imul?

Answer: mul performs unsigned multiplication, while imul performs signed multiplication. The interpretation of the operands and result differs based on whether they're treated as signed or unsigned values.

Recommended Resources

Lesson 1.5: Instruction Format and Syntax

Learning Objectives

  • Describe how x86-64 instructions are structured (mnemonic + operands).
  • Interpret opcodes, mnemonics, and operands in various disassembly outputs.

Prerequisites

  • Knowledge of x86-64 instruction classification.
  • Familiarity with a disassembler (e.g., objdump, gdb).
Key Concepts
  • Opcode: The machine code that represents the instruction
  • Mnemonic: Human-readable name (e.g., mov, add)
  • Operands: Register, immediate, or memory references
  • AT&T vs Intel Syntax: Differences in operand order, prefix usage, etc.

Detailed Explanation

Instruction Structure

Every x86-64 instruction follows this general format:

[PREFIX] MNEMONIC [OPERAND1], [OPERAND2], [OPERAND3]
Syntax Differences: Intel vs AT&T

Two major syntax styles exist for x86 assembly:

Aspect Intel Syntax AT&T Syntax
Operand Order mov dest, src mov src, dest
Register Prefix None: eax Percent: %eax
Immediate Prefix None: 5 Dollar: $5
Memory Addressing [base+index*scale+disp] disp(base,index,scale)
Size Suffixes Explicit: DWORD PTR Mnemonic: movl (l=long)
Examples Side by Side
; Intel Syntax (NASM, MASM, disassemblers)
mov eax, ebx          ; destination, source
add eax, 5            ; register, immediate  
mov eax, [ebx+4]      ; memory addressing
mov DWORD PTR [eax], 42

# AT&T Syntax (GAS, GCC output)
movl %ebx, %eax       # source, destination
addl $5, %eax         # immediate, register
movl 4(%ebx), %eax    # memory addressing  
movl $42, (%eax)
Operand Types
  • Register Operands: CPU registers (rax, rbx, etc.)
  • Immediate Operands: Constant values embedded in instruction
  • Memory Operands: References to memory locations
Machine Code Representation

Assembly instructions are encoded as machine code (opcodes):

Assembly:     mov eax, 42
Machine Code: B8 2A 00 00 00
              │  └─────────── 32-bit immediate value (42)
              └─── Opcode for "mov eax, immediate"

Disassembly Output Examples

Different tools show assembly in various formats:

objdump (AT&T syntax):
$ objdump -d program
  400546: 48 89 e5    mov    %rsp,%rbp
  400549: c7 45 fc 2a movl   $0x2a,-0x4(%rbp)
gdb (can switch between syntaxes):
(gdb) set disassembly-flavor intel
(gdb) disas
   0x400546: mov    rbp,rsp
   0x400549: mov    DWORD PTR [rbp-0x4],0x2a

Exercises & Practice Problems

Question: In Intel syntax, is the destination typically on the left or right?

Answer: The destination is on the left (dest, src). This is the opposite of AT&T syntax where source comes first.

Exercise: Convert this Intel syntax to AT&T syntax: mov eax, [ebx+8]

Answer: movl 8(%ebx), %eax - Note the reversed operand order, register prefix %, and memory addressing format.

Question: What does the 'l' suffix mean in AT&T syntax (e.g., movl)?

Answer: The 'l' suffix indicates "long" or 32-bit operation. Other suffixes are 'b' (byte/8-bit), 'w' (word/16-bit), and 'q' (quad/64-bit).

Recommended Resources