This post is a dive into understanding how the boot process on an x86-based CPU works, however there might still be some holes in my knowledge and stuff that I’ve misunderstood.
All of the boot code was tested using Bochs.
Real Mode - 16-bit
more about the boot process that i didn't talk about: https://yangbolong.github.io/2017/02/12/lab1/<--->When the CPU is powered, it will initialise itself into a known good state and begin executing instructions at the default starting address of 0xFFFFFFF0
. This exists within the portion of memory which is mapped to a ROM - specifically the one which contains the BIOS (Intel, 2021, chapter 9.10).
From this point onwards, the BIOS has control of the CPU & will begin performing tasks which get the computer ready for use, such as (Pellegrini, 2018, pp.18-19):
- perform a power-on self-test
- load & execute any boot configurations
- initalise video adapters & any other devices
- shadowcopy itself into RAM for faster access
- identify the bootloader using the boot configuration & load it at
0x7C00
The figure below is state the RAM is after a sucessful boot:
You may have noticed that there hasn’t been much user intervention - the CPU has gotten itself into a known good state and is ready to execute stuff at address 0x7C00
(provided that some form of bootable media was provided). This is great because it means that we are in real mode (AMD, n.d.)! Code now can be executed in a 16 bit environment.
Real Mode
has its positives and negatives (OSDev, n.d. [a]) such as:
- restricted to less than 1MiB of total usable space
- access to BIOS interrupts which provide us with a collection of functions for drawing to the screen or changing CPU modes
- programs only utilise one core
- unprotected memory space
Below is a small code snippet, which if sucessfully compiled, should boot your computer into 16-bit mode and write “Hello World!” to the screen.
Real Mode - Hello World Assembly Example
; nasm instruction
bits 16
; starting position
org 0x7c00
entry:
jmp boot
; ascii bytes: 10 = new line, 13 = carridge return, 0 = null-termination
boot_msg db 10, "Hello World!", 13, 10, 0
printer:
lodsb ; loads a byte at address DS:SI into AL
or al, al ; if the byte at AL is 0 (end-of-string)
jz printer_end ; we can return/exit this function
int 0x10 ; call interupt 0x10 - video BIOS services
jmp printer ; loop
printer_end:
ret ; return flow
boot:
mov si, boot_msg ; move the message into source register (SI)
mov ah, 0x0e ; select video services mode
; -> "write text in teletype mode"
call printer ; call printer function to print string in SI
hlt ; end execution
; bootloader has to be padded up to 512
times 510 - ($ - $$) db 0
; magic bootable key
dw 0xaa55
You may choose to do this differently, however I created a bootable ISO by passing the above code through NASM and passing the output BIN file through a custom utility to create it - I talk about this more in my post here.
If all goes well, you should see something like this:
Protected Mode - 32-bit
In order to make the transition from 16-bit into 32-bit, there are two things that must first be done:
- Enabling the A20 line
- Loading the Global Descriptor Table (GDT)
The A20 Line
The 8086 is a 16 bit CPU, so it should be able to access 216 bits… right? Actually it turns out that the CPU has a 20 bit wide address bus, allowing it to access 220 bits instead (that’s 64KiB vs 1MiB) (UMBC, 2011). So… how are you meant to access anything if the register size is smaller than the address bus?
Segment:Offset Addressing
Due to the size difference between the address bus and registers, Intel devised the segment:offset
addressing method. This allowed the 8086 to access the 1MiB of RAM but it came with a quirk - the addressing method allowed various different combinations of segments & offsets to refer to the same absolute memory position.
Calculating the destination address is simple:
destination address = (segment * 0x10) + offset
To illustrate how different combinations can end up at the same address, we’ll use 0x7C00
as the destination:
segment | offset | calculation | result |
---|---|---|---|
0007 |
7B90 |
(0x7 * 0x10) + 0x7B90 |
0x7C00 |
0008 |
7B80 |
(0x8 * 0x10) + 0x7B80 |
0x7C00 |
0009 |
7B70 |
(0x9 * 0x10) + 0x7B70 |
0x7C00 |
000A |
7B60 |
(0xA * 0x10) + 0x7B60 |
0x7C00 |
0201 |
5BF0 |
(0x201 * 0x10) + 0x5BF0 |
0x7C00 |
01FF |
5C10 |
(0x1FF * 0x10) + 0x5C10 |
0x7C00 |
…and so on.
This was called “memory wrap-around” and programs either intentionally or unintentionally relied on it (Necasek, 2018) – this meant bad things for backwords compatability if it wasn’t implemented!
And such was the case - when the 8088 failed to perform the appropriate memory translations in order to implement backwards compatability with the 8086, IBM decided that it would be a good idea to implement a switch which would enable/disable the 21st address line and thus the A20 switch was born.
Enabling the A20 Line
By enabling the A20 line, the 21st bit would no longer be always set to zero. This meant that a 32 bit CPU would now be able to sucessfully access 232 bits, or also known as 4GiB!
The OSDev Wiki describes multiple ways to enable ways to enable the A20 line and the best way to go about it (OSDev, n.d. [b]), however in my implementation I only utilise the BIOS method.
The 32-bit Global Descriptor Table
A Global Descriptor Table (also known as the GDT) is loaded by the user into the CPU - it is a special data structure that describes controlled memory access and is required in order to move into protected mode.
The Intel Developers Manual describes the different sections (Intel, 2021, pp 3-10; 3-12) like so (paraphrased):
-
Segment Limit is a combination of two fields to form a 20 bit value. The segment size depends on the granularity (G) flag - if disabled then the granularity is 1 byte; if enabled then the granularity is 4KiB – this is what allows 4GiB of memory to be addressed.
-
Base Address defines the location of byte 0 of the segment within the 4GiB address space. This is put together from three base address fields to form a single 32bit value.
-
Type defines the segment type and specifies what kind of access can be made on that segment - it has 3 options:
code
,data
orsystem
. -
S Flag specifies whether the segment descriptor is for a system segment
S = 0
or code/data segmentS = 1
-
Descriptor Type Flag is used to set the privilege level of the segment ranging from 0 to 3 - this relates to the privilege ring where ring 0 = kernel (most privileged) to ring 3 = user space (least privileged).
-
P Flag specifies whether the segment is present
P = 1
, or notP = 0
. -
D/B Flag is set to 1 for 32 bit code & data segments.
-
G Flag specifies the scaling of the segment limit field. When
G = 0
then the limit is interpreted in byte units, whenG = 1
, it is interpreted in 4KiB units. -
L Flag is used for indicating whether the segment contains native 64 bit code - since we’re trying to get into 32 bit, we set this to zero.
The GDT must contain at least three entries: a null segment, code segment and data segment.
32-bit GDT Example
; GLOBAL DESCRIPTOR TABLE FOR 32 BIT MODE
; GDT32.asm
GDT32:
.Null: equ $ - GDT32
dq 0 ; defines 32 bits of zeroes for the null entry
.Code: equ $ - GDT32
dw 0xFFFF ; segment limit
dw 0 ; base address
db 0 ; base address (again)
; [from right to left]
; 0 = accessed flag (set to 1 on first access by the cpu)
; 1 = readable segment
; 0 = 'conforming' - is less privelleged code allowed to run this segment
; 1 = code or data segment (1 = code, 0 = data)
; 1 = segment is code/data segment? (true(1)/false(0))
; 00 = privilege level (00 = ring 0/kernel/os)
; 1 = is the segment present?
db 0b10011010
; [from right to left]
; 1111 (0xF) = last bits in the segment limit
; 0 = 'available to system programmers' but apparently the cpu ignores it anyway
; 0 = intel reserved, should always be zero
; 1 = size - 1 = 32bit, 0 = 16bit
; 1 = granularity - 0: access in 1 byte blocks, 1: access in 4KiB blocks
; TODO: what's the math for enabling the 4GB limit???
db 0b11001111
db 0 ; last remaining 8 bits on the base address
.Data: equ $ - GDT32
dw 0xFFF ; --|
dw 0 ; | - identical to code segment
db 0 ; --|
; [from right to left]
; 0 - accessed flag
; 1 - write access?
; 0 - segment expands upwards from the base address
; 0 - code(1)/data(0) segment
; 1 - is a code/data segment?
; 00 - privilege level (ring 0)
; 1 - is the segment present?
db 0b10010010
; [from right to left]
; 1111 - last bits in the segment limit
; 0 - 'available to system programmers'?
; 0 - intel reserved, should always be zero
; 1 - 'big'? should be set to allow for 4GB
; 1 - granularity
db 0b11001111
db 0
.Pointer:
dw $ - GDT32 - 1
dd GDT32
How do we know if we have actually set it to address 4GiB? 13
- Take the two segment limit values and combine them:
0xFFFF
&0xF
gives us0xFFFFF
- Multiply this value by the granularity flag (if
G = 0
then multiply by0x4
, ifG = 1
then multiply by0x1000
):0xFFFFF * 0x1000 = 0xFFFFF000
- Add the segment limit from the data entry:
0xFFFFF000 + 0xFFF = 0xFFFFFFFF
Booting into 32-bit
With the GDT ready to be used, we now only need to load it into the CPU using the lgdt
instruction, set bit 0
of control register 0
to enable protected mode and finally perform a long jump (Intel, 2021, pp 9-13).
bits 16 ; instruction for nasm
org 0x7c00
entry:
jmp boot
%include "GDT32.asm"
boot:
; enabling a20 gate
mov ax, 0x2401
int 0x15
; changing to text mode
mov ax, 0x3
int 0x10
cli
; load global descriptor table (gdt) with a pointer to the descriptor
lgdt [GDT32.Pointer]
; enabling protected mode
mov eax, cr0
or eax, 1
mov cr0, eax
; long jump to clear instruction pipeline
jmp GDT32.Code:now_protected_boot
bits 32 ; nasm instruction
printer:
printer_loop:
lodsb
or al, al
jz printer_end
or eax, 0x0F00
mov word [ebx], ax
add ebx, 2
jmp printer_loop
printer_end:
ret
now_protected_boot:
mov ax, GDT32.Data ; --|
mov ds, ax ; |
mov ss, ax ; | - loading up the segment registers with the data segment position
mov fs, ax ; |
mov gs, ax ; --|
mov esi, boot_msg
mov ebx, 0xb8000 ; vga memory start
call printer
hlt
boot_msg db "Hello World in 32 Bit!", 0
times 510 - ($ - $$) db 0
dw 0xaa55
If successful, you should see something like this:
Long Mode - 64-bit
We can build off the fact that we have existing code which takes us from 16 bit into 32 bit, and now move into 64 bit mode.
The 64-bit Global Descriptor Table
It is based on the 32 bit GDT, however the most notable change is that the null segment now has some information, instead of being all zeroes.
64-bit GDT Example
; GLOBAL DESCRIPTOR TABLE FOR 64 BIT MODE
; GDT64.asm
; sources
; https://github.com/sedflix/lame_bootloader/
; https://wiki.osdev.org/Setting_Up_Long_Mode
GDT64:
.Null: equ $ - GDT64
dw 0xFFFF
dw 0
db 0
db 0
db 1
db 0
.Code: equ $ - GDT64
dw 0
dw 0
db 0
db 10011010b
db 10101111b
db 0
.Data: equ $ - GDT64
dw 0
dw 0
db 0
db 10010010b
db 00000000b
db 0
.Pointer:
dw $ - GDT64 - 1
dq GDT64
Booting into 64-bit
(The next portion is heavily based on the OSDev wiki page for entering long mode (OSDev, n.d. [c]))
In order to enter long mode, the CPU must have a suitable GDT loaded, but also PAE must be enabled via the control registers and set up properly with special data structures.
PAE requires 4 tables:
- Page-Map Level-4 Table (PML4T) which forms the root for PAE
- Page-Directory Pointer Table (PDPT)
- Page-Directory Table (PDT)
- Page Table (PT)
We can set up the tables like so (OSDev, n.d. [c]) (this is an example, the full code (excluding the 64 bit GDT is after this):
mov edi, 0x1000 ; starting address of 0x1000
mov cr3, esi ; move base address of page entry into control register 3 (https://wiki.osdev.org/CPU_Registers_x86)
xor eax, eax ; set eax to 0
mov ecx, 4096
rep stosd ; for ECX times, store EAX value at whatever position EDI points to, incrementing/decrementing as you go
; (https://stackoverflow.com/questions/3818856/what-does-the-rep-stos-x86-assembly-instruction-sequence-do)
; this effectively sets the tables to zero
mov edi, cr3 ; restore the original starting address
; according to https://wiki.osdev.org/Setting_Up_Long_Mode , this will set up the pointers to the other tables
; using an offset of 0x0003 from the destination address supposedly sets the bits to indicate that the page is present
; and is also readable/writeable
mov dword [edi], 0x2003
add edi, 0x1000
mov dword [edi], 0x3003
add edi, 0x1000
mov dword [edi], 0x4003
add edi, 0x1000
; at this stage:
; PML4T is at 0x1000
; PDPT is at 0x2000
; PDT is at 0x3000
; PT is at 0x4000
; used to identity map the first 2MiB (see https://wiki.osdev.org/Setting_Up_Long_Mode)
mov ebx, 0x00000003
mov ecx, 512
.set_entry:
mov dword [edi], ebx
add ebx, 0x1000
add edi, 8
loop .set_entry
; enable PAE paging by changing the control register value
mov eax, cr4
or eax, 1 << 5
mov cr4, eax
; setting the long mode bit and enabling paging (this enters us into compatability mode)
mov ecx, 0xC0000080 ; magic value actually refers to the EFER MSR
; -> 'extended feature enable register : model specific register
rdmsr ; read model specific register
or eax, 1 << 8 ; set long-mode bit (bit 8)
wrmsr ; write back to model specific register
mov eax, cr0
or eax, 1 << 31 | 1 << 0 ; set PG bit (31st) & PM bit (0th)
mov cr0, eax
We now load the 64 bit GDT which has the 64 bit flags enabled, and make a long jump.
Here’s a complete example of booting into real mode, switching to protected and then switching to long mode:
org 0x7c00
entry:
jmp real_to_protected
%include "GDT32.asm"
%include "GDT64.asm"
bits 16 ; nasm instruction
; 16 bits to 32 bits
real_to_protected:
; enable a20 gate
mov ax, 0x2401
int 0x15
; change video mode
mov ax, 0x3
int 0x10
cli
lgdt [GDT32.Pointer]
; enable protected mode
mov eax, cr0
or eax, 1
mov cr0, eax
; perform long jump
jmp GDT32.Code:protected_to_long
[bits 32]
protected_to_long:
; set up registers
mov ax, GDT32.Data
mov ds, ax
mov fs, ax
mov gs, ax
mov ss, ax
mov edi, 0x1000 ; starting address of 0x1000
mov cr3, esi ; move base address of page entry into control register 3 (https://wiki.osdev.org/CPU_Registers_x86)
xor eax, eax ; set eax to 0
mov ecx, 4096
rep stosd ; for ECX times, store EAX value at whatever position EDI points to, incrementing/decrementing as you go
; (https://stackoverflow.com/questions/3818856/what-does-the-rep-stos-x86-assembly-instruction-sequence-do)
; this effectively sets the tables to zero
mov edi, cr3 ; restore the original starting address
; according to https://wiki.osdev.org/Setting_Up_Long_Mode , this will set up the pointers to the other tables
; using an offset of 0x0003 from the destination address supposedly sets the bits to indicate that the page is present
; and is also readable/writeable
mov dword [edi], 0x2003
add edi, 0x1000
mov dword [edi], 0x3003
add edi, 0x1000
mov dword [edi], 0x4003
add edi, 0x1000
; at this stage:
; PML4T is at 0x1000
; PDPT is at 0x2000
; PDT is at 0x3000
; PT is at 0x4000
; used to identity map the first 2MiB (see https://wiki.osdev.org/Setting_Up_Long_Mode)
mov ebx, 0x00000003
mov ecx, 512
.set_entry:
mov dword [edi], ebx
add ebx, 0x1000
add edi, 8
loop .set_entry
; enable PAE paging by changing the control register value
mov eax, cr4
or eax, 1 << 5
mov cr4, eax
; setting the long mode bit and enabling paging (this enters us into compatability mode)
mov ecx, 0xC0000080 ; magic value actually refers to the EFER MSR
; -> 'extended feature enable register : model specific register
rdmsr ; read model specific register
or eax, 1 << 8 ; set long-mode bit (bit 8)
wrmsr ; write back to model specific register
mov eax, cr0
or eax, 1 << 31 | 1 << 0 ; set PG bit (31st) & PM bit (0th)
mov cr0, eax
lgdt [GDT64.Pointer]
jmp GDT64.Code:real_long_mode
[bits 64]
printer:
printer_loop:
lodsb
or al, al ; if zero
jz printer_exit
or rax, 0x0F00
mov qword [rbx], rax
add rbx, 2
jmp printer_loop
printer_exit:
ret
real_long_mode:
cli
mov ax, GDT64.Data
mov ds, ax
mov fs, ax
mov gs, ax
mov ss, ax
xor rax, rax ; clears out register RAX - if commented out then weird orange square is drawn
; at the end of the string
mov rsi, boot_msg
mov rbx, 0xb8000
call printer
hlt
boot_msg db "Hello World in 64 bit!",0
times 510 - ($ - $$) db 0
dw 0xaa55
Bibliography
Intel (2021) Intel 64/IA-32 Developer Manual Volume 3: System Programming. Intel. Available from https://www.intel.co.uk/content/www/uk/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html
Pellegrini, A. (2018) x86 Initial Boot Sequence. Universita Di Roma. Available from https://alessandropellegrini.it/didattica/2017/aosv/1.Initial-Boot-Sequence.pdf
AMD (n.d.) AMD64 Architecture Programming Manual Volume 2: System Programming. AMD. Available from https://www.amd.com/system/files/TechDocs/24593.pdf
OSDev (n.d. [a]) Real Mode. OSDev Wiki. Available from https://wiki.osdev.org/Real_Mode
UMBC (2011) Segments and Registers. University of Maryland, Baltimore County. Available from https://courses.cs.umbc.edu/undergraduate/CMSC211/fall01/burt/lectures/Chap12/segmentsOffsets.html
Necasek, M (2018) The A20-Gate Fallout. OS/2 Museum. Available from https://www.os2museum.com/wp/the-a20-gate-fallout/
OSDev (n.d. [b]) A20 Line. OSDev Wiki. Available from https://wiki.osdev.org/A20_Line
OSDev (n.d. [c]) Setting Up Long Mode. OSDev Wiki. Available from https://wiki.osdev.org/Setting_Up_Long_Mode