This post is a dive into understanding how the boot process on an x86-based CPU works, however there might still be some holes in my knowledge and stuff that I’ve misunderstood.

All of the boot code was tested using Bochs.


Real Mode - 16-bit

more about the boot process that i didn't talk about: https://yangbolong.github.io/2017/02/12/lab1/<--->

When the CPU is powered, it will initialise itself into a known good state and begin executing instructions at the default starting address of 0xFFFFFFF0. This exists within the portion of memory which is mapped to a ROM - specifically the one which contains the BIOS (Intel, 2021, chapter 9.10).

Intel, 2021

Intel, 2021

From this point onwards, the BIOS has control of the CPU & will begin performing tasks which get the computer ready for use, such as (Pellegrini, 2018, pp.18-19):

  • perform a power-on self-test
  • load & execute any boot configurations
  • initalise video adapters & any other devices
  • shadowcopy itself into RAM for faster access
  • identify the bootloader using the boot configuration & load it at 0x7C00

The figure below is state the RAM is after a sucessful boot:

Pellegrini, 2018

Pellegrini, 2018

You may have noticed that there hasn’t been much user intervention - the CPU has gotten itself into a known good state and is ready to execute stuff at address 0x7C00 (provided that some form of bootable media was provided). This is great because it means that we are in real mode (AMD, n.d.)! Code now can be executed in a 16 bit environment.

Real Mode has its positives and negatives (OSDev, n.d. [a]) such as:

  • restricted to less than 1MiB of total usable space
  • access to BIOS interrupts which provide us with a collection of functions for drawing to the screen or changing CPU modes
  • programs only utilise one core
  • unprotected memory space

Below is a small code snippet, which if sucessfully compiled, should boot your computer into 16-bit mode and write “Hello World!” to the screen.

Real Mode - Hello World Assembly Example

; nasm instruction
bits 16

; starting position
org 0x7c00 

entry:
    jmp boot

; ascii bytes: 10 = new line, 13 = carridge return, 0 = null-termination
boot_msg db 10, "Hello World!", 13, 10, 0

printer:
    lodsb               ; loads a byte at address DS:SI into AL
    or al, al           ; if the byte at AL is 0 (end-of-string)
    jz printer_end      ; we can return/exit this function
    int 0x10            ; call interupt 0x10 - video BIOS services
    jmp printer         ; loop

printer_end:
    ret ; return flow

boot: 
    mov si, boot_msg    ; move the message into source register (SI)
    mov ah, 0x0e        ; select video services mode
                        ; -> "write text in teletype mode"
    call printer        ; call printer function to print string in SI

    hlt                 ; end execution

; bootloader has to be padded up to 512
times 510 - ($ - $$) db 0

; magic bootable key
dw 0xaa55

You may choose to do this differently, however I created a bootable ISO by passing the above code through NASM and passing the output BIN file through a custom utility to create it - I talk about this more in my post here.

If all goes well, you should see something like this:

16bit bootloader working



Protected Mode - 32-bit

In order to make the transition from 16-bit into 32-bit, there are two things that must first be done:

  • Enabling the A20 line
  • Loading the Global Descriptor Table (GDT)

The A20 Line

The 8086 is a 16 bit CPU, so it should be able to access 216 bits… right? Actually it turns out that the CPU has a 20 bit wide address bus, allowing it to access 220 bits instead (that’s 64KiB vs 1MiB) (UMBC, 2011). So… how are you meant to access anything if the register size is smaller than the address bus?

Segment:Offset Addressing

Due to the size difference between the address bus and registers, Intel devised the segment:offset addressing method. This allowed the 8086 to access the 1MiB of RAM but it came with a quirk - the addressing method allowed various different combinations of segments & offsets to refer to the same absolute memory position.

Calculating the destination address is simple:

destination address = (segment * 0x10) + offset

To illustrate how different combinations can end up at the same address, we’ll use 0x7C00 as the destination:

segment offset calculation result
0007 7B90 (0x7 * 0x10) + 0x7B90 0x7C00
0008 7B80 (0x8 * 0x10) + 0x7B80 0x7C00
0009 7B70 (0x9 * 0x10) + 0x7B70 0x7C00
000A 7B60 (0xA * 0x10) + 0x7B60 0x7C00
0201 5BF0 (0x201 * 0x10) + 0x5BF0 0x7C00
01FF 5C10 (0x1FF * 0x10) + 0x5C10 0x7C00

…and so on.

This was called “memory wrap-around” and programs either intentionally or unintentionally relied on it (Necasek, 2018) – this meant bad things for backwords compatability if it wasn’t implemented!

And such was the case - when the 8088 failed to perform the appropriate memory translations in order to implement backwards compatability with the 8086, IBM decided that it would be a good idea to implement a switch which would enable/disable the 21st address line and thus the A20 switch was born.

Enabling the A20 Line

By enabling the A20 line, the 21st bit would no longer be always set to zero. This meant that a 32 bit CPU would now be able to sucessfully access 232 bits, or also known as 4GiB!

The OSDev Wiki describes multiple ways to enable ways to enable the A20 line and the best way to go about it (OSDev, n.d. [b]), however in my implementation I only utilise the BIOS method.

The 32-bit Global Descriptor Table

A Global Descriptor Table (also known as the GDT) is loaded by the user into the CPU - it is a special data structure that describes controlled memory access and is required in order to move into protected mode.

Intel, 2021

Intel, 2021

The Intel Developers Manual describes the different sections (Intel, 2021, pp 3-10; 3-12) like so (paraphrased):

  • Segment Limit is a combination of two fields to form a 20 bit value. The segment size depends on the granularity (G) flag - if disabled then the granularity is 1 byte; if enabled then the granularity is 4KiB – this is what allows 4GiB of memory to be addressed.

  • Base Address defines the location of byte 0 of the segment within the 4GiB address space. This is put together from three base address fields to form a single 32bit value.

  • Type defines the segment type and specifies what kind of access can be made on that segment - it has 3 options: code, data or system.

  • S Flag specifies whether the segment descriptor is for a system segment S = 0 or code/data segment S = 1

  • Descriptor Type Flag is used to set the privilege level of the segment ranging from 0 to 3 - this relates to the privilege ring where ring 0 = kernel (most privileged) to ring 3 = user space (least privileged).

  • P Flag specifies whether the segment is present P = 1, or not P = 0.

  • D/B Flag is set to 1 for 32 bit code & data segments.

  • G Flag specifies the scaling of the segment limit field. When G = 0 then the limit is interpreted in byte units, when G = 1, it is interpreted in 4KiB units.

  • L Flag is used for indicating whether the segment contains native 64 bit code - since we’re trying to get into 32 bit, we set this to zero.

The GDT must contain at least three entries: a null segment, code segment and data segment.

32-bit GDT Example

; GLOBAL DESCRIPTOR TABLE FOR 32 BIT MODE
; GDT32.asm

GDT32:
    .Null: equ $ - GDT32
    dq 0            ; defines 32 bits of zeroes for the null entry
    .Code: equ $ - GDT32
    dw 0xFFFF       ; segment limit
    dw 0            ; base address
    db 0            ; base address (again)
    
    ; [from right to left]
    ; 0 = accessed flag (set to 1 on first access by the cpu)
    ; 1 = readable segment
    ; 0 = 'conforming' - is less privelleged code allowed to run this segment
    ; 1 = code or data segment (1 = code, 0 = data)
    ; 1 = segment is code/data segment? (true(1)/false(0))
    ; 00 = privilege level (00 = ring 0/kernel/os)
    ; 1 = is the segment present?
    db 0b10011010

    ; [from right to left]
    ; 1111 (0xF) = last bits in the segment limit
    ; 0 = 'available to system programmers' but apparently the cpu ignores it anyway
    ; 0 = intel reserved, should always be zero
    ; 1 = size - 1 = 32bit, 0 = 16bit
    ; 1 = granularity - 0: access in 1 byte blocks, 1: access in 4KiB blocks
    ;           TODO: what's the math for enabling the 4GB limit???
    db 0b11001111

    db 0            ; last remaining 8 bits on the base address
    .Data: equ $ - GDT32
    dw 0xFFF        ; --|
    dw 0            ;   | - identical to code segment
    db 0            ; --|

    ; [from right to left]
    ; 0 - accessed flag
    ; 1 - write access?
    ; 0 - segment expands upwards from the base address
    ; 0 - code(1)/data(0) segment
    ; 1 - is a code/data segment?
    ; 00 - privilege level (ring 0)
    ; 1 - is the segment present?
    db 0b10010010

    ; [from right to left]
    ; 1111 - last bits in the segment limit
    ; 0 - 'available to system programmers'?
    ; 0 - intel reserved, should always be zero
    ; 1 - 'big'? should be set to allow for 4GB
    ; 1 - granularity
    db 0b11001111
    
    db 0
    .Pointer:
    dw $ - GDT32 - 1
    dd GDT32

How do we know if we have actually set it to address 4GiB? 13

  1. Take the two segment limit values and combine them: 0xFFFF & 0xF gives us 0xFFFFF
  2. Multiply this value by the granularity flag (if G = 0 then multiply by 0x4, if G = 1 then multiply by 0x1000): 0xFFFFF * 0x1000 = 0xFFFFF000
  3. Add the segment limit from the data entry: 0xFFFFF000 + 0xFFF = 0xFFFFFFFF

Booting into 32-bit

With the GDT ready to be used, we now only need to load it into the CPU using the lgdt instruction, set bit 0 of control register 0 to enable protected mode and finally perform a long jump (Intel, 2021, pp 9-13).

bits 16 ; instruction for nasm
org 0x7c00

entry:
    jmp boot

%include "GDT32.asm"

boot:
    ; enabling a20 gate
    mov ax, 0x2401
    int 0x15

    ; changing to text mode
    mov ax, 0x3
    int 0x10

    cli

    ; load global descriptor table (gdt) with a pointer to the descriptor
    lgdt [GDT32.Pointer] 

    ; enabling protected mode
    mov eax, cr0
    or eax, 1
    mov cr0, eax

    ; long jump to clear instruction pipeline
    jmp GDT32.Code:now_protected_boot


bits 32 ; nasm instruction

printer:
    printer_loop:
        lodsb
        or al, al
        jz printer_end
        or eax, 0x0F00
        mov word [ebx], ax
        add ebx, 2
        jmp printer_loop

    printer_end:
        ret


now_protected_boot:
    mov ax, GDT32.Data      ; --|
    mov ds, ax              ;   |
    mov ss, ax              ;   | - loading up the segment registers with the data segment position
    mov fs, ax              ;   |
    mov gs, ax              ; --|

    mov esi, boot_msg
    mov ebx, 0xb8000 ; vga memory start
    call printer

    hlt

boot_msg    db "Hello World in 32 Bit!", 0
times 510 - ($ - $$) db 0
dw 0xaa55


If successful, you should see something like this:

Long Mode - 64-bit

We can build off the fact that we have existing code which takes us from 16 bit into 32 bit, and now move into 64 bit mode.

The 64-bit Global Descriptor Table

It is based on the 32 bit GDT, however the most notable change is that the null segment now has some information, instead of being all zeroes.

64-bit GDT Example

; GLOBAL DESCRIPTOR TABLE FOR 64 BIT MODE
; GDT64.asm

; sources
; https://github.com/sedflix/lame_bootloader/
; https://wiki.osdev.org/Setting_Up_Long_Mode

GDT64:
    .Null: equ $ - GDT64
    dw 0xFFFF
    dw 0
    db 0
    db 0
    db 1
    db 0
    .Code: equ $ - GDT64
    dw 0
    dw 0
    db 0
    db 10011010b         
    db 10101111b         
    db 0                 
    .Data: equ $ - GDT64 
    dw 0                 
    dw 0                 
    db 0                 
    db 10010010b         
    db 00000000b         
    db 0                 
    .Pointer:            
    dw $ - GDT64 - 1     
    dq GDT64

Booting into 64-bit

(The next portion is heavily based on the OSDev wiki page for entering long mode (OSDev, n.d. [c]))

In order to enter long mode, the CPU must have a suitable GDT loaded, but also PAE must be enabled via the control registers and set up properly with special data structures.

PAE requires 4 tables:

  • Page-Map Level-4 Table (PML4T) which forms the root for PAE
  • Page-Directory Pointer Table (PDPT)
  • Page-Directory Table (PDT)
  • Page Table (PT)

We can set up the tables like so (OSDev, n.d. [c]) (this is an example, the full code (excluding the 64 bit GDT is after this):

mov edi, 0x1000 ; starting address of 0x1000 
mov cr3, esi    ; move base address of page entry into control register 3 (https://wiki.osdev.org/CPU_Registers_x86)

xor eax, eax ; set eax to 0
mov ecx, 4096 

rep stosd ; for ECX times, store EAX value at whatever position EDI points to, incrementing/decrementing as you go
          ; (https://stackoverflow.com/questions/3818856/what-does-the-rep-stos-x86-assembly-instruction-sequence-do)
          ; this effectively sets the tables to zero

mov edi, cr3 ; restore the original starting address

; according to https://wiki.osdev.org/Setting_Up_Long_Mode , this will set up the pointers to the other tables
; using an offset of 0x0003 from the destination address supposedly sets the bits to indicate that the page is present
;   and is also readable/writeable
mov dword [edi], 0x2003
add edi, 0x1000
mov dword [edi], 0x3003
add edi, 0x1000
mov dword [edi], 0x4003
add edi, 0x1000

; at this stage:
; PML4T is at 0x1000
; PDPT is at 0x2000
; PDT is at 0x3000
; PT is at 0x4000

; used to identity map the first 2MiB (see https://wiki.osdev.org/Setting_Up_Long_Mode)
mov ebx, 0x00000003
mov ecx, 512

.set_entry:
    mov dword [edi], ebx
    add ebx, 0x1000
    add edi, 8
    loop .set_entry

; enable PAE paging by changing the control register value
mov eax, cr4
or eax, 1 << 5
mov cr4, eax

; setting the long mode bit and enabling paging (this enters us into compatability mode)
mov ecx, 0xC0000080     ; magic value actually refers to the EFER MSR 
                            ;       -> 'extended feature enable register : model specific register
rdmsr                   ; read model specific register
or eax, 1 << 8          ; set long-mode bit (bit 8)
wrmsr                   ; write back to model specific register

mov eax, cr0
or eax, 1 << 31 | 1 << 0         ; set PG bit (31st) & PM bit (0th)
mov cr0, eax

We now load the 64 bit GDT which has the 64 bit flags enabled, and make a long jump.

Here’s a complete example of booting into real mode, switching to protected and then switching to long mode:

org 0x7c00

entry:
    jmp real_to_protected

%include "GDT32.asm"
%include "GDT64.asm"


bits 16 ; nasm instruction

; 16 bits to 32 bits
real_to_protected:

    ; enable a20 gate
    mov ax, 0x2401
    int 0x15

    ; change video mode
    mov ax, 0x3
    int 0x10

    cli
    lgdt [GDT32.Pointer]

    ; enable protected mode
    mov eax, cr0
    or eax, 1
    mov cr0, eax

    ; perform long jump
    jmp GDT32.Code:protected_to_long


[bits 32]
protected_to_long:

    ; set up registers
    mov ax, GDT32.Data
    mov ds, ax
    mov fs, ax
    mov gs, ax
    mov ss, ax

mov edi, 0x1000 ; starting address of 0x1000 
mov cr3, esi    ; move base address of page entry into control register 3 (https://wiki.osdev.org/CPU_Registers_x86)

xor eax, eax ; set eax to 0
mov ecx, 4096 

rep stosd ; for ECX times, store EAX value at whatever position EDI points to, incrementing/decrementing as you go
          ; (https://stackoverflow.com/questions/3818856/what-does-the-rep-stos-x86-assembly-instruction-sequence-do)
          ; this effectively sets the tables to zero

mov edi, cr3 ; restore the original starting address

; according to https://wiki.osdev.org/Setting_Up_Long_Mode , this will set up the pointers to the other tables
; using an offset of 0x0003 from the destination address supposedly sets the bits to indicate that the page is present
;   and is also readable/writeable
mov dword [edi], 0x2003
add edi, 0x1000
mov dword [edi], 0x3003
add edi, 0x1000
mov dword [edi], 0x4003
add edi, 0x1000

; at this stage:
; PML4T is at 0x1000
; PDPT is at 0x2000
; PDT is at 0x3000
; PT is at 0x4000

; used to identity map the first 2MiB (see https://wiki.osdev.org/Setting_Up_Long_Mode)
mov ebx, 0x00000003
mov ecx, 512

.set_entry:
    mov dword [edi], ebx
    add ebx, 0x1000
    add edi, 8
    loop .set_entry

; enable PAE paging by changing the control register value
mov eax, cr4
or eax, 1 << 5
mov cr4, eax

; setting the long mode bit and enabling paging (this enters us into compatability mode)
mov ecx, 0xC0000080     ; magic value actually refers to the EFER MSR 
                            ;       -> 'extended feature enable register : model specific register
rdmsr                   ; read model specific register
or eax, 1 << 8          ; set long-mode bit (bit 8)
wrmsr                   ; write back to model specific register

mov eax, cr0
or eax, 1 << 31 | 1 << 0         ; set PG bit (31st) & PM bit (0th)
mov cr0, eax

    lgdt [GDT64.Pointer]
    jmp GDT64.Code:real_long_mode

[bits 64]

printer:
    printer_loop:
        lodsb
        or al, al ; if zero
        jz printer_exit

        or rax, 0x0F00
        mov qword [rbx], rax
        add rbx, 2
        jmp printer_loop

    printer_exit:
        ret

real_long_mode:
    cli

    mov ax, GDT64.Data
    mov ds, ax
    mov fs, ax
    mov gs, ax
    mov ss, ax

    xor rax, rax ; clears out register RAX - if commented out then weird orange square is drawn
                 ; at the end of the string

    mov rsi, boot_msg
    mov rbx, 0xb8000
    call printer

    hlt

boot_msg db "Hello World in 64 bit!",0
times 510 - ($ - $$) db 0
dw 0xaa55

Bibliography

Intel (2021) Intel 64/IA-32 Developer Manual Volume 3: System Programming. Intel. Available from https://www.intel.co.uk/content/www/uk/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html

Pellegrini, A. (2018) x86 Initial Boot Sequence. Universita Di Roma. Available from https://alessandropellegrini.it/didattica/2017/aosv/1.Initial-Boot-Sequence.pdf

AMD (n.d.) AMD64 Architecture Programming Manual Volume 2: System Programming. AMD. Available from https://www.amd.com/system/files/TechDocs/24593.pdf

OSDev (n.d. [a]) Real Mode. OSDev Wiki. Available from https://wiki.osdev.org/Real_Mode

UMBC (2011) Segments and Registers. University of Maryland, Baltimore County. Available from https://courses.cs.umbc.edu/undergraduate/CMSC211/fall01/burt/lectures/Chap12/segmentsOffsets.html

Necasek, M (2018) The A20-Gate Fallout. OS/2 Museum. Available from https://www.os2museum.com/wp/the-a20-gate-fallout/

OSDev (n.d. [b]) A20 Line. OSDev Wiki. Available from https://wiki.osdev.org/A20_Line

OSDev (n.d. [c]) Setting Up Long Mode. OSDev Wiki. Available from https://wiki.osdev.org/Setting_Up_Long_Mode