Building a CPU in Verilog

1 Introduction

Field Programmable Gate Arrays (FPGAs) are complex re-configurable devices which can be programmed to implement certain application or functionality requirements via the use of digital circuits - their nature makes them very appealing for various markets, as they can be adapted to support different roles by simply being reprogrammed (Xilinx, n.d. (a)).

In the context of embedded systems, Nakano and Ito (2008) state that it is important to write efficient programs which are responsible for controlling the system, yet students often only understand high-level languages from the context of programs, where they may lack the knowledge of how a formula is evaluated or how advanced programming features such as pointers and stacks work on a hardware level. Despite their focus on embedded systems, the benefits of understanding advanced features and being able to write smaller, more efficient programs will always be desirable no matter how powerful computers get in the future.

This project aims to explore what knowledge is required to create a solution in Verilog, to apply that new knowledge to create a working processor and understand how these solutions provide a deeper understanding of the fundamental link between hardware and software. This exploration is beneficial as it utilises hardware which is in abundance in medical fields, automobiles, industrial control systems and many more (Tong et al, 2006) - this provides a perspective of what types of implementations are possible with such powerful and versatile hardware.

Furthermore, thanks to the abstractions provided by using a hardware description language (HDL), students would be able to draw from existing knowledge they may have of control flow, arithmetic operations and functions from languages such as Python, C# or C++ alongside using vendor-specific design suites to create and understand a design with relative ease.

2 Literature Review

2.1 Background

Nakano and Ito (2008) state that “for developing efficient programs, it is necessary to understand how programs are executed” elaborating later that “it is difficult to understand a […] ‘pointer’ if one does not know how it is implemented in a machine language” and suggest the idea of using FPGAs to teach students not only how pointers work, but to also introduce them to processor architecture, assembler & compiler design and assembler programming. Similarly, Lee et al (2012) also emphasise the importance of learning CPU design by stating that the “students acquire a deeper understanding of computer organization” and that it “profoundly increases their confidence with hardware” whilst also simultaneously providing a greater understanding of software optimisation. Furthermore, Yıdız et al (2018) make the point that without understanding the foundations of processors, it is not possible to understand advanced concepts such as “pipelining, memory hierarchy, branch prediction, caching etc.”, all extremely important concepts which have enabled the high computation power of modern-day processors.

These authors, and many more as cited by Yıdız et al (2018), have recognised the importance of teaching processor design and implementation, and the skills which students can gain from having such a deep understanding of a fundamental piece of modern computers. The goal of this project is to achieve exactly that – to understand processors at the lowest possible level by creating a complete, working CPU using Verilog, burning it onto an FPGA and going through the entire bug fixing and testing stage to ensure it works as expected.

The success of these proposed courses can be seen in the conclusions of the papers; Nakano and Ito (2008) had 18 students enrolled, with 15 partaking in a final exam to confirm their understanding of the material – they concluded that the students “have learned and understood CPU design […] very well.” (p. 728). Similarly, Lee et al (2012) performed a more comprehensive study into the effectiveness of their own proposed course, finding that the sentiment of 29 students regarding the statement “I understand CPU and computer systems much better than before” was a solid “agree”, illustrating the effectiveness of using FPGAs to learn CPU design (p. 347, table VII & table IX). University of Lincoln’s own Computer Architecture module may benefit from this work, by creating new practical workshops based upon it in the future.

Implementations of FPGA based CPUs (otherwise known as “soft processors” (Microsemi, n.d.)), vary between authors with some opting to use different word sizes such as 8, 16 or 32-bit, or even different instruction set paradigms such as CISC (complex instruction set computer), RISC (reduced instruction set computer) or with some opting for a custom ISA (instruction set architecture) as is the case of Nakano and Ito (2008) and their TINYCPU – but this is not an exhaustive list of possible features and implementation differences. That being said, possibly one of the most influential factors for an implementation would have to be deciding the instruction set for the processor – which dictates exactly what the processor is capable (or not capable) of doing, and the paradigm (such as CISC or RISC) which decides exactly how those instructions are executed by influencing the design of the hardware.

For example, a CISC based implementation will have instructions which take multiple clock cycles to complete, requiring microprograms/microcode in order to coordinate the execution of the instruction correctly. On the other hand, a RISC based implementation does away with complex multi-cycle instructions and the microcode hardware in favour for hardware logic, boiling down the instructions into simpler counterparts with a focus on ensuring approximately one instruction gets executed each cycle (Fox, n.d.; Elijhani and Kepuska, 2021).

2.2 Analysis of Existing Work

This section explores works which are relevant to the topic proposed by the project, and aims to derive a greater understanding of how an implementation may be created. It looks into the methodology, design decisions and reasoning made by those authors to understand the advantages and disadvantages of their approach. The goal is to have some form of reference or knowledge base in which informed design decisions can be made.

2.2.1 8-bit CISC

The implementation of the CPU and its supporting components such as the random access memory (RAM), as performed by Yhang and Bao (2011) are very similar to what one might see in a modern day x86-64 CPU – it shares its data/address bus between the arithmetic logic unit (ALU), registers and the RAM of the system, instead of using multiplexers and many wires to connect everything together. However, using a bus also increases complexity as the actions of the individual components must be coordinated in order to avoid race conditions.

More importantly, the factor which makes this architecture CISC based is its use of a microprogram controller – a piece of hardware which is used to coordinate instruction execution (as mentioned briefly above).

Their implementation uses a micro-instruction word width of 24-bits, consisting of two distinct field groups – an address field containing the address of the next micro-instruction, and an “operation controlling field” which is used to issue “micro commands” i.e., these fields will be setting the necessary control lines within the processor in order to achieve the desired instruction. Unfortunately, their table which demonstrates the format of the micro-instruction is slightly cryptic, however it is only the theory which is important. Furthermore, their flowchart which indicates each set of the microprogram is an ingenious way of visualising the appropriate steps required, and definitely something to use in this project (Yhang and Bao, 2011).

2.2.2 16-bit Architecture

The approach taken by Nakano and Ito (2008) with their TINYCPU is certainly very interesting: instead of taking a traditional approach of including an accumulator or a collection of registers for the developer to use, they opted for a stack instead - a stack being a data structure where information is “pushed” onto it and “popped” off where required, creating a last-in first-out situation (Koopman, 1989). The understanding of how a stack is implemented is out of the scope for this report, but it is an unconventional (but clearly working) way of implementing a processor, simply judging by the number of authors which opt for a register/accumulator-based processor instead.

As a result of this stack architecture, the operation codes (opcodes) for the TINYCPU lack any operands – this means that for load, store, arithmetic and logic operations, the element in the top of the stack (and sometimes the element immediately below it) is used. This is quite a novel idea compared to the others, however it unfortunately would be incompatible with the design plan set out in section 3.

One thing of significance to note is their implementation of the division operation, or rather, the lack of a specific opcode for it. As an example, Nakano and Ito (2008) implement an algorithm to prove the Collatz conjecture. The theory behind the conjecture is not important but the algorithm is. Provided with a number n, if it is odd then n = 3n + 1, else if n is even then n = n + 2. It can be seen that from the translated assembly code, that they use bit-shifting operations to divide or multiply by a power of 2, though it is unclear how this would function if you wished to multiply by any number which is not in that set – it is a possible compiler optimisation if required however (p. 726).

2.2.3 Video Graphics Array (VGA) Theory

VGA is an analog video hardware specification introduced by IBM which has reached far and wide, as it is still prevalent on nearly all computer monitors. Implementation of the VGA standard is trivial, which is excellent for creating some meaningful output. However it is important to remember that an FPGA is a purely digital device, and that the video signal for VGA is analog – there is a need for a physical digital to analog converter circuit (DAC) (Monk, 2017, p. 125).

A VGA signal is easy to understand – there are three wires for the colours (red, green and blue), one wire for the horizontal sync and another wire for the vertical sync. The RGB wires are analog and are expected to be driven in a range of 0V to 0.7V, whilst the horizontal and vertical sync wires are digital and can be driven between 0V to 3.3V (Monk, 2017, p. 125).

A core component of a VGA driver is the pixel clock – this provides a force that drives the entire system forwards, and when combined with counters to keep track of the x and y positions, it is able to determine what action the driver should be performing. For a resolution of 640x480, it is exactly driven at 25.175MHz (TinyVGA, 2008), however a higher clock would allow for a higher resolution.

2.2.4 Summary

The exploration of existing works by authors such as Nakano and Ito (2008) and Yhang and Bao (2011) help highlight the different possible approaches to creating a novel CPU architecture – their vastly different methods provide an insight into how the architecture may be structured and how this directly influences the available instruction set.

In particular, the implementation of a microprogram controller which Yhang and Bao (2011) present is of particular interest, as I believe that it is easier to understand the micro-steps/micro-instructions which the processor executes compared to understanding how to coordinate an entire instruction execution within one clock cycle – at least as a beginner with FPGAs and HDLs.

The brief introduction to the VGA standard is used to illustrate how simple the implementation of a video interface can be, given that external hardware is provided in order to draw the colours properly. The information gained from this research and the sources gathered are planned to be used to implement a rudimentary memory-mapped display driver – something akin to a graphics processor.

3 Design and Development

This chapter covers the design process of the CPU. Section 3.1 covers the basic theory behind CPU architecture and organisation, with an analysis of how it may affect the design and an examination of existing processors to determine their type. Section 3.2 applies the theory learned previously to design modules which make up the sub-components of the CPU, and are realised into the final design in section 3.3.

Section 3.2 applies the theory learned previously to design the modules which will end up becoming the sub-components of the CPU, with justification of why one approach was taken over another where applicable.

The design of the modules are realised in section 3.3 which hopes to provide a concise showcase of the final design, using Verilog and diagrams to showcase their function.

3.1 Design

A common goal between both the original project proposal and the final artefact was the creation of a working CPU – however it was important to decide on details of its function before any HDL could be written. It could be designed as a general-purpose CPU such as the MOS 6502 or the Manchester Baby, or it could be designed as a specialised processor which would be akin to a floating-point unit (FPU) on older hardware. It was decided that implementing a specialised processor would not be reasonable, as it is essentially a dedicated circuit for one function with the idea being that it performs that particular function very well and does it very quickly, but a general-purpose CPU should be able to perform the same function albeit at the cost of speed (PCMag, n.d. (a)).

Once the purpose of the CPU is decided, an abstract model can be created which defines precisely what the CPU is capable of – ARM (n.d. (a)) defines this as an “Instruction Set Architecture” (ISA) and states that the “ISA defines the supported data types, the registers, how hardware manages main memory, key features […], which instructions a microprocessor can execute, and the input/output model”. It is clear from this definition that creating an ISA is an important step to realising the goal of creating a CPU and therefore careful consideration must be given to the design.

3.1.1 Internal Organisation

Firstly, the internal storage of the CPU must be decided - this dictates how the CPU stores and manipulates data. There are multiple approaches to this: a stack architecture, an accumulator architecture, or a general-purpose register (GPR) architecture. To make an informed decision on which would be appropriate to use, the advantages and disadvantages must be recognised. Null and Lobur (2018, p. 582) summarise these different architectures as:

  • Stack architectures use a stack (first-in-first-out (FIFO) memory model) to execute instructions and operands are found on the top of the stack. A simple model but the stack does not support random access, reducing the ability to create efficient code and will eventually become a bottleneck.
  • Accumulator architectures store one operand in a specially designated register which allows for reduced complexity and short instructions, however the memory is temporary, and the register would experience increased traffic for every instruction executed.
  • GPR architectures are the most common, using multiple registers (known as a register file or a register set) to execute a program. The additional registers allow for more efficient code and allow for faster execution time by reducing the number of times memory has to be accessed.

Null and Lobur (2018, p. 583) further explain that the GPR architecture type “can be broken down into three classifications, depending on where the operands are located” – this is another indicator of a processor being either CISC or RISC, as explained later. These three classifications are summarised below:

  • Memory-memory: storing operands in memory and allowing an instruction to perform an operation without requiring any operand in the register.
  • Register-memory: at least one operand is in a register and the other is in memory.
  • Load-store: moving data from memory into registers before any operations can be performed.

Two of these three GPR architecture types in particular can be seen in modern processors – ARM (2005, p. A1-2) describes its architecture as being of the load-store type which is evident given that only the load and store instructions can modify memory. In contrast, Intel (and consequently AMD as they both implement the x86 ISA) are of the register-memory type as its ISA allows for the modification of memory directly but only if the other operand is present in a register (Null and Lobur, 2018, p. 627-628).

It was decided to use the GPR load-store type for this project as it would both provide a reasonable number of registers available to the programmer but also decrease complexity in terms of memory access – this allows for a more RISC-like behaviour from the CPU, whilst also having the benefit of making the instruction set clearer as opposed to the multiple possible permutations of memory access allowed by the x86 ISA; simple load and store instructions will handle memory access.

3.1.2 Key Features and I/O

To improve the chances of success, an attempt at keeping the design of the CPU as simple as possible should be made – complex functionality should not be implemented before the CPU itself is able to execute instructions without fault. As a result, very little time had been allocated to implementing I/O as there was an expectation that the core functionality would take some time to implement. Had there been more time, then a keyboard and video interface could have been implemented as the lack of it is a hinderance to the user.

A keyboard implementation can take many forms, perhaps utilising either the UART or PS/2 standard to transmit information; it also the most complex to implement as care has to be given to how a keyboard event is handled via an interrupt. A video interface would be much simpler to create, as it would only have to read from video memory to generate a picture, so this is the first interface to be implemented provided there is enough time remaining. The video interface chosen is VGA due to the limited clock provided by the FPGA developement board.

3.1.3 Instruction Set and Memory Model

There are multiple memory models which can be employed, however the two to focus on will be the flat memory model and the segmented memory model; Intel (2021, volume 1, p. 3-7) describes the flat memory model as memory appearing to a program “as a single, continuous address space” running from 0 to 2N-1 where N is the word size of the system. The segmented memory model however splits the physical memory into groups where they are then “mapped into the processor’s linear address space” and to access the data within a segment, “the processor […] translates each logical address [(consisting of a segment selector and offset)] into a linear address” and the reason for doing this would be to “increase the reliability of programs and systems” however as this project is creating an experimental CPU which will not be used outside of this project, it is not important to consider the long term reliability of the program and systems running it.

Therefore, the CPU is to use a flat memory model which can either be indexed by the control unit incrementing the Program Counter (PC), manipulated by the user using jump instructions or by using the previously mentioned GPRs as memory data registers (MDR) and memory address registers (MAR).

Verilog allows the parameterisation of module instances, such as creating registers of varying widths without the need to rewrite the code particular to that implementation – this allows for one single module to be coded, and then adapted to any situation i.e., creating a 16-bit register and an 8-bit register but using the same module code so that the behaviour of those registers is identical. The “word” size of a system is typically used to describe the natural unit for a given CPU (Fox, n.d.) – it specifies the width of data which the processor can handle at once, so the importance of the parameterisation is evident as it allows for experimental changes to the CPU word size.

The advantages and disadvantages of picking one word size over another are only apparent when there are functions of the CPU which would heavily benefit from having one particular size. Modern CPUs often have word sizes which are either 32 or 64-bits wide, however it seemed excessive for this project and thus some research was conducted into popular retro CPUs. The Intel 8080, MOS 6502, Zilog Z80, Motorola 6800 all used 8-bit words and were successful, with the aforementioned CPUs powering various types of 1980s computers such as the BBC Micro, Commodore PET and the Nintendo Entertainment System to name a few.

The CPU is to use an 8-bit word as its natural unit and therefore the registers, ALU and other components need to be created appropriately. However, it is not sufficient enough to encode data into 8-bits, as it lacks an instruction field meaning that the processor will not know what operation to perform. A solution to this is to encode the instruction into a bitfield larger than the word size of the CPU, which would provide additional bits that can be allocated to encoding the instruction – the size of this bitfield is dependent on how many instructions to be implemented or how the opcode is decoded, for example the Intel 8080 uses an 8-bit word internally but works with 16-bit instructions, whereas the MOS 6502 requires two cycles to fetch both the 8-bit instruction and operand if necessary (MOS, 1976, p. 60).

The final decision is for the CPU to use an 8-bit word size, with 16-bit wide instruction encoding – this would allow for mathematical operations using numbers between 0 to 255 (unsigned) or -128 to 127 (signed), and 256 unique instructions. Knowing the allowed size of the opcode now allows for the creation of the instruction encoding – it must consider multiple types of instructions, such as those which operate between registers, which use an immediate value, jumps and short instructions such as halt and no-operation (NOP) – the encoding of these instruction types are presented in figure 1.

Figure 1: Instruction types

3.1.4 Summary

The CPU is to be designed with general-purpose registers, utilising a load-store architecture similar to that of ARM processors. It will operate on 8-bit words and use a 16-bit address space which will be linearly mapped to physical memory; each 16-bit word in memory will either encode data (such as a variable) or an instruction. There will be limited I/O with the processor, but if time permits, then there will be a focus on implementing video first. The instruction set of the CPU should reflect its general-purpose nature.

3.2 Design Decisions

This section explores the design decisions made for the core modules, and the rationale behind it.

3.2.1 Control Unit

A control unit coordinates the actions of the CPU, such as instruction fetching/decoding and the execution of an instruction (Fox, n.d.). Most importantly, there are two design paradigms, RISC vs CISC, which come into play - they affect the design of the control unit and its method of executing an instruction.

ARM, which implements RISC architectures, has a set of different instruction types which are densely packed with information in order to tell the processor how to appropriately execute it (ARM, 2005, p. A3-2) - this avoids the use of firmware such as microcode and allows ARM to use hardware to accelerate instruction execution. The information density and the variability of the instructions would bring too much complexity to the project, so a more traditional approach with microcode was taken.

Disadvantages of microcode is that it both increases execution time for an instruction, as it requires multiple clock cycles to step through the total number of micro-instructions for one instruction (as shown by the codeblock below), and also that it requires the developer to write the firmware, requiring additional testing to ensure that it is executing as expected and that it is also not interfering with other instructions.

27 0800 // nop @ 0 - 0
28 8800 // hlt @ 1 - 1
29 0016 0808 // mov reg1, reg2 @ 2 - 3
30 0026 0800 // cmp reg1, reg2 @ 4 - 5

In order to coordinate the actions of all components, not only within the control unit but also other sub-components of the CPU, a finite state machine (FSM) is used - this is represented as the “Execution Driver” in figure 4. Wilson and Mantooth (2013, p. 174) state that the “idea of an FSM is to store a sequence of different unique states and transition between them depending on the values of the inputs and the current state of the machine” and that an FSM can be one of two types: “Moore (where the output of the state machine is purely dependent on the state variables) and Mealy (where the output can depend on the current state variable values and the input values)”.

3.2.2 Arithmetic Logic Unit (ALU)

The ALU is the heart of all mathematical operations, both arithmetical and logical - a design may implement integer addition, subtraction, multiplication and division with logical operations such as AND, OR, NOT or XOR. Furthermore, the ALU modifies a status (or “flag”) register, which stores information about the outcome of the operation such as whether the operation had overflowed, resulted in a zero, has a carry or is a negative value (Fox, n.d.).

Thanks to the high-level abstractions provided by Verilog, the code for the ALU was both easy to write and its C-like syntax is easy to understand. The use of constructions such as functions allow the developer to re-use code wherever it is deemed necessary, without worry of how the final circuit would turn out as this is handled with Quartus Prime. However, this abstraction comes at a significant cost - there is simply no transparency with what has been synthesised onto the silicon, meaning the users have to trust that it is safe, secure and will function as described.

To ensure that the flags were set appropriately for whichever operation was executed, the Intel Software Architectures Manual (2021, p. A-1) was referenced – this ensured that the behaviour was consistent across two different types of architectures, which would make the operation of the ALU predictable.

As the ALU does not differentiate between signed and unsigned arithmetic, its functions are easy to implement and it is down to the software programmer to interpret the result of the calculation – this is the sign flag as illustrated in both figures 5 and 6. This is possible due to two’s complement which is a method which uses one bit in the word to indicate whether the value is positive or negative – it is calculated by taking a value as bits, inverting it, and adding one (Fox, n.d.).

3.2.3 Memory, Memory Controller and Datapath Controller

Memory can be seen in all computing systems – it stores the program to be executed and any data required for it, however there are multiple methods of organising this system. Eigenmann and Lilja (1998) explain that a “Von Neumann” type organisation is where “instructions and data are stored together in common memory” which distinguishes it from the “Harvard” type “in which separate memories are used to store instructions are data. Typically, Harvard-type architectures can be found in microcontrollers in embedded systems such as appliances or toys and many non-Von Neumann-type architectures can be found in specialist applications ranging from image and digital signal processing to implementing neural networks, cellular automata or cognitive computers or silicon (Null and Lobur, 2018, p. 136 - 137).

As the aim of this project is to build a general-purpose processor, utilising an architecture such as Harvard is non-sensical, as the instructions and data are stored separately which would result in increased complexity; for this reason, the processor uses a Von Neumann architecture. Null and Lobur (2018, p. 131) describe it as a system which has the following characteristics:

  • Has a control unit, ALU, registers, program counter, a memory system which contains the programs and an I/O system.
  • Has the capacity to carry out sequential instruction processing.
  • Contains a single path, either physically or logically, between the main memory system and the control unit of the processor.

Aside from deciding the memory architecture of the system, some thought must be given to how the CPU can translate an instruction into a memory access. For this, a memory controller is proposed – this will sit before the memory and act as an interface which will handle the loading and storing of values in the right memory block.

Furthermore, the implementation of the jump instruction is unique compared to others (discussed more in section 3.2.4) and therefore requires additional logic in order to execute correctly. To get the right address, the system must first peek ahead in memory by one address and fetch this value however, with a single output line from the system memory, it could end up overwriting the current instruction. For this, the datapath controller will handle splitting the instructions from addresses and miscellaneous data which is being loaded.

3.2.4 Decision Unit

The “Decision Unit” is a circuit which determines the correct address to jump to when a jump instruction is executed, based on the state of the instruction and the ALU flags register.

An oddity of the system is that it implements an absolute addressing mode, so the entire address must be specified unlike in relative addressing mode, where the current instruction address is considered (IBM, 2022). To ensure that the programmer the programmer has access to the entire 216 -word address space, additional circuitry (as described in section 3.2.3) was implemented to allow the system to read 32-bits specifically for jump instructions - 16 bits are used to describe the opcode as usual, and the other 16-bits are the address to jump to if the conditions are met.

Furthermore, if the conditional jump fails, it is important to ensure that the CPU does not accidentally execute the value of the address so even on a failed conditional jump, the program counter is incremented by two, skipping both the instruction and its address value.

For conditional jumps, the decision unit checks the flag states in a similar method to the x86 implementation, as shown in table 2 from the Intel Software Architectures Manual (2021, p. 7-16). This allows for the behaviour of the jump instructions and the decision unit to be predictable and consistent across two different types of architectures, similar to the ALU.

Figure 2: x86 jump table

The processor implements all jumps except for JNP/JNO, JP/JPE, JCXZ or JECXZ as it does not contain the hardware specific to those jumps.

3.2.5 Video Graphics Array (VGA)

Memory available on the FPGA is limited, but in order to draw colours onto the screen, either an additional memory module had to be installed (requiring an interface with external hardware) or to reduce the resolution and use the remaining space for colour information. As such, it was decided to drive the VGA signal at a resolution of 320x240, exactly half of the industry standard of 640x480, which is not a recognised signal so whilst the driver runs the monitor at 640x480, additional logic is required to upscale the resolution so that it is displayed properly. The colours themselves are allocated 4 bits, totalling to 16 unique colours with the help of additional circuitry.

The upscaling logic uses the x and y position counters to determine which address of the video memory to index, and presents the output in the form of pixel information which is fed into a colour look-up table and finally output to the monitor. The upscaling logic also acts as a buffer as the memory can only return a 16-bit value, and at a colour depth of 4-bits (meaning 4-bits per pixel), each new line from the video memory is encoding 4 pixels at once. The logic must be capable of upscaling the output and coordinating video memory access based on the incoming x and y counters.

3.3 Final Design

3.3.1 Control Unit

Microcode was the choice for this project, as it is easier to grasp the concept that multiple micro-instructions make up one entire instruction - these micro-instructions set the control lines for the CPU sub-components, and can be easily converted into regular bytes which are written to the microcode read-only memory (ROM).

A microcode format is presented in figure 3, showcasing the bitfield available to micro-instruction to manipulate in order to achieve its specified goal. It consists of multiple fields which either set flags for functions such as jumping or finishing an instruction, to controlling which mode a module is set to. The micro-instruction may manipulate the memory controller, datapath controller and the ALU (all discussed in later sections) to perform actions such as loading, adding and jumping.

Figure 3: Microcode format

The microcode ROM is only one module in an entire system which makes up the control unit for this CPU, and therefore we must look at the rest of the system, as shown in figure 4.

Figure 4: Simplified system overview

The “Opcode Translator ROM” receives an instruction, where it then decodes the opcode and uses it as an index for its own memory - this will return an appropriate value for the microsequencer in order to begin executing the instruction. As instructions can take multiple cycles to execute, the microsequencer acts as a counter where it will begin counting from the provided index until the instruction finished flag is raised. This also enables truly variable execution length instructions without the risk of the microsequencer indexing illegal microinstructions in the microcode ROM.

The execution driver module is a Mealy-type FSM - it tracks the current state of the machine using a register, but also have multiple input signals which can allow the machine to switch states as illustrated by the “instruction finish flag” line in the figure 4. The possible states (and a description of them) of the execution driver are illustrated in table 1.

State Value State Name Description
0 Cold Boot Initial state - buffering for a clock cycle to ensure the instruction is ready.
1 Loading Instruction Loading the microsequencer with the correct value and enabling the microcode ROM.
2 Long/Short Instruction Execution Checking whether the instruction finish flag is raised - if true then the machine executed a short instruction (one microinstruction) else the machine is executing a long instruction (two or more microinstructions).
3 Reset or Continue If the instruction was short, all the control lines for the program counter, microsequencer and microcode are set appropriately. Otherwise, if the instruction is long, then the machine remains at the state until the finished flag is raised.
4 Finish Execution Prepares the machine for the next instruction.
5 Halt Halt state prevents execution from occurring - this can only be exited by resetting the processor.
6 Jump Handles the control lines for loading the microsequencer and program counter with the new index and address, respectively.

Table 1: Execution machine states

3.3.2 Arithmetic Logic Unit (ALU)

The ALU is implemented via the use of multiple high-level Verilog constructions, namely the ability to define functions and switch-case statements. The use of functions is best suited for actions which are repetitive, such as for when multiple operations modify the flags register, as illustrated by a function definition in figure 5, and a function call in an ALU operation in figure 6.

28 function is_signed (input [WORD_SIZE-1:0] A); begin
29 is_signed = A[7];
30 end
31 endfunction
Figure 5: Using Verilog functions
144 // add = a + b
145 6: begin
146 //temp = a + b;
147 output_C = input_A + input_B;
148 flags[ZERO] = is_zero(output_C);
149 flags[SIGN] = is_signed(output_C);
150 flags[CARRY] = has_carry_add(input_A, input_B);
151 flags[OVERFLOW] = has_overflow(input_A, input_B);
152 end
Figure 6: Updating ALU flags after an operation

As shown by figure 6, flags are modified by calling a function to determine the state of the flag, and then this value is used to update the appropriate bit in the flags register. It is important that these values are updated correctly depending on which ALU operation was executed, as these values are referenced when executing conditional jump instructions.

Furthermore, the switch-case construct provides an easy method of implementing multiple operation codes which can be used on any number of incoming signals. In the case of this project CPU, the ALU is controlled by the microcode and selects the appropriate operation based on the instruction. An example of this is given in figure 7.

93 always_comb begin
94 case(mode_select)
95
96 // do nothing
97 0: output_C = output_C;
98
99 // move B into C
100 1: output_C = input_B;
101
102 // compare a & b
103 2: begin
104 temp = input_A - input_B;
105 $display("ALU CMP TEMP A - B = C --- %h - %h = %h", input_A, input_B, temp);
106 flags[ZERO] = is_zero(temp);
107 flags[SIGN] = is_signed(temp);
108 flags[CARRY] = has_carry_subtract(input_A, input_B);
109 flags[OVERFLOW] = has_overflow_subtract(input_A, input_B);
110 end
Figure 7: Verilog switch-case within the ALU

3.3.3 Memory, Memory Controller and Datapath Controller

The main system memory is implemented as a single port RAM as illustrated in figure 8; its size can be reconfigured based on how it is instantiated via Verilog, however as the processor is designed to work with 16-bit instructions, the RAM has a 16-bit word width with 216 total possible locations which can be accessed - this translates into 65,536 words or a total of 128KiB of memory.

Figure 8: Single port RAM block

A requirement for the processor is for it to generate a valid VGA signal and to successfully draw an image on the screen - this is discussed more in section 3.3.6 however it is being brought up now because the image is actually memory mapped, meaning video memory (VRAM) must be allocated on the FPGA as well. The requirements are that the resolution is 320x240 with 4-bit colour depth - this results in 320x240x4 = 307200 bits required for the image which is 37.5KiB or 19,200 words.

Furthermore, the VRAM is unique in its construction when compared to the system RAM, as it is a dual port RAM module as opposed to a single port RAM - that is, it can be driven by two clocks, have two different addresses reading and writing data (nandland, n.d.). A dual port RAM module was created for use as VRAM as it would allow for mismatched clocks to operate on the RAM block, meaning that the function of the display output is not hindered if the CPU is run at anything other than the required frequency of the VGA signal.

Figure 9: Dual port RAM block

Memory is no good if the processor has no method to access the data stored on it, so a dedicated memory controller module was created in order to coordinate access between both the system RAM and VRAM. Additional functionality was implemented in the module in order to assist with functions such as jumping to different addresses (explained in section 3.2.4), storing values in memory and loading values from memory by setting the correct control lines.

Figure 10: Memory Controller block

Further additional logic was implemented in the form of a datapath controller - this module connects to the outputs of both the system RAM and VRAM and helps correctly coordinate information around to the appropriate busses. The datapath controller acts similarly to a multiplexer, where it dictates the flow of data depending on an input, however it is vital that information is correctly routed - a loaded value should not cause the current instruction line to change, as it could cause unexpected behaviour. In conjunction with the aforementioned memory controller module, they work together to enable the jump, load and store class of instructions.

Figure 11: Datapath Controller block

3.3.4 Decision Unit

Similar to the ALU, the implementation of the decision unit is built using a switch-case statement which is triggered when the right opcodes are executed. It requires thhe current address from the program counter, the destination address fetched from memory and the ALU flags to be input before it can decide on the correct address to jump to. As mentioned in section 3.2.4, if a conditional jump instruction fails, the program counter still needs to be incremented by two in order to not accidentally execute the address for a jump instruction. This is due to the jump instruction being 32-bits long instead of the regular 16-bits, so therefore the result for a failed conditional jump is always program_counter = program_counter + 2, as shown in figure 12.

Figure 12: Decision Unit block

Figure 13 shows how the logic for a conditional and unconditional jump is implemented in the decision unit.

31 always @ (instruction or peek_jump_address or flags) begin
32 case (instruction[15:8])
33
34 JMP: new_address = peek_jump_address;
35
36 JE: new_address = (flags[ZERO] == 1) ? peek_jump_address : program_counter_address+2;
37 JNE: new_address = (flags[ZERO] == 0) ? peek_jump_address : program_counter_address+2;
Figure 13: Jump instruction implementation in the decision unit

Two simple programs are presented in figures 14 and 15 with their accompanying simulation waveforms from ModelSim in figures 16 and 17 respectively; these figures help to illustrate the behaviour described previously. The program itself is simply moving a value into a register (line 1) and then comparing the value in the register with an immediate value (line 2); depending on the result of the comparison (which has modified the ALU flags) will decide which address the processor jumps to.

mov_cmp.hex

Full file (local copy)

1 2805 // mov a, 5
2 3005 // cmp a, 5
3 1500 // je 0008
4 0008 // (address)
5 0000 // nop
6 0000 // nop
7 0000 // nop
8 0000 // nop
9 0100 // halt (jump target)
Figure 14: Successful conditional jump

mov_cmp.hex

Full file (local copy)

1 2805 // mov a, 5
2 3006 // cmp a, 6
3 1500 // je 0008
4 0008 // (address)
5 0100 // halt
6 0000 // nop
7 0000 // nop
8 0000 // nop
9 0100 // halt (jump target)
Figure 15: Failed conditional jump
Figure 16: Successful conditional jump waveform (note the program counter)
Figure 17: Failed conditional jump waveform (note the program counter)

It can be seen from the simulation waveform figures 16 and 17 that the CPU has successfully jumped to the appropriate address given the result of the previous instructions.

3.3.5 Glue Logic

PCMag (n.d. b)) defines “glue logic” as “a simple logic circuit that is used to connect complex logic circuits together” - instances of these circuits can be seen all throughout the project in different forms. For example, some circuits act as decision logic for the multiplexers which help select the correct input based on the instruction as seen in figure 19, another circuits acts as a lookup-table to provide 16 different colours for the video output as seen in figure 20, and another circuits handles the logic required for the halt instruction, as shown by figure 18.

1 module halt_check # (
2 parameter OPCODE_SIZE = 8
3 ) (
4 input wire [OPCODE_SIZE-1:0] opcode,
5 output reg result
6 );
7 always_comb begin
8 if (opcode == 8'h01) begin
9 result = 0;
10 end else begin
11 result = 1;
12 end
13 end
14 endmodule
Figure 18: Check whether a halt instruction has been issued
75 module muxer_select_decision_logic_b # (
76 parameter WORD_SIZE = 16,
77 parameter DEFAULT_IMM_SELECTOR_VALUE = 0
78 ) (
79 input wire [WORD_SIZE-1:0] instruction,
80 input wire imm_flag, clock,
81 output reg [3:0] mode_select
82 );
83
84 // always_comb begin
85 // always @ (negedge clock) begin
86 always @ (instruction) begin
87
88 // if the instruction contains an immediate value
89 if (instruction[15:8] >= 8'h28 & instruction[15:8] <= 8'h8f) begin
90 mode_select <= DEFAULT_IMM_SELECTOR_VALUE;
91 // if the instruction does not contain an immediate value but affects the registers
92 end else if (instruction[15:8] >= 8'h06 & instruction[15:8] <= 8'h13) begin
93 mode_select <= {1'b0, instruction[2:0]};
94 end else begin
95 mode_select <= 'z; //mode_select;
96 end
97
98
99 //if (instruction[15:8] >= 8'h28 & instruction[15:8] <= 8'h8f) begin
100 // if (imm_flag == 1) begin
101 // mode_select <= DEFAULT_IMM_SELECTOR_VALUE;
102 // end else if (imm_flag == 0) begin
103 // mode_select <= {1'b0, instruction[2:0]};
104 // end
105 //end else begin
106 // mode_select <= mode_select;
107 //end
108 end
109 endmodule
Figure 19: Logic to select the correct value to load into the ALU
68 module ega_colour_palette_logic (
69 input wire [3:0] index,
70 output reg [5:0] colour
71 );
72
73 always_comb begin
74
75 case (index)
76 // https://en.wikipedia.org/wiki/Enhanced_Graphics_Adapter#Color_palette
77
78 // formatting is rgbRGB where lowercase = low intensity, uppercase = high intensity
79 // 5 4 3 2 1 0
80 // r g b R G B
81 0: colour <= 6'b000000; // black
82 1: colour <= 6'b000001; // blue
83 2: colour <= 6'b000010; // green
84 3: colour <= 6'b000011; // cyan
85 4: colour <= 6'b000100; // red
86 5: colour <= 6'b000101; // magenta
87 6: colour <= 6'b010100; // brown
88 7: colour <= 6'b000111; // white/light grey
89 8: colour <= 6'b111000; // dark grey / bright black
90 9: colour <= 6'b111001; // bright blue
91 10: colour <= 6'b111010; // bright green
92 11: colour <= 6'b111011; // bright cyan
93 12: colour <= 6'b111100; // bright red
94 13: colour <= 6'b111101; // bright magenta
95 14: colour <= 6'b111110; // bright yellow
96 15: colour <= 6'b111111; // bright white
97 default: colour <= 0;
98
99 endcase
100
101 end
102
103
104 endmodule
Figure 20: Lookup table to convert pixel information into appropriate colour lines

3.3.6 Video Graphics Array (VGA)

This clock required for VGA is achieved by using a phase locked loop (PLL) within the FPGA, which is provided as an intellectual property (IP) package by Altera. Counters which track the x and y position on the screen are set accordingly on each rising edge of the pixel clock. It is important to note that despite running the screen at 640x480, the counters are required by the specification to count above those values – this is due to the need for blanking intervals as the signal was originally developed with CRT monitors in mind, which would require time to reposition the electron gun so that the image was drawn correctly; these blanking intervals are illustrated in figure 21.

Figure 21: VGA timing diagram (DigiKey, 2021)

TinyVGA (2008) provides the timing information required to correctly drive the VGA display, such as the counts for the front & back porch and precisely how long the sync pulse must be. A Verilog implementation for a VGA driver can be seen in figure 22.

1 module vga_driver (
2 input wire clock_25mhz,
3 output reg [9:0] x, y,
4 output reg hsync, vsync, in_active_area
5 );
6
7 initial begin
8 x <= 0; y <= 0;
9 end
10
11 assign in_active_area = (x <= 639 && y <= 479);
12 assign hsync = ~(x >= 655 & x <= 751);
13 assign vsync = ~(y >= 489 & y <= 491);
14
15 always @ (posedge clock_25mhz) begin
16
17 if (x == 799) begin
18 x <= 0;
19 y <= (y == 524) ? 0 : y + 1;
20 end else begin
21 x++;
22 end
23
24 end
25
26 endmodule
Figure 22: Verilog implementation of a VGA driver

From figure 22 it can be seen that the module uses both continuous assignment with the “assign” keyword but also sequential logic due to the use of a clock. On every positive edge of the clock, the module will generate an output determined by the state of both the x and y registers. If x=799 then it indicates the end of a line, so the x counter is reset, and depending on the value of y depends which value it gets set to, as if y=524 then it would indicate the end the screen and it would get reset.

The continuously driven values ensure that the signals are always valid, which is vital for the h-sync and v-sync signals as this is how it synchronises with the monitor itself.

4 Evaluation

4.1 Testing

The importance of testing cannot be understated for any FPGA project – the time taken for the design suite to compile, synthesise and fit the design is considerable compared to simply running a simulation; therefore, testing was performed in two ways – via Verilog “testbenches” with ModelSim, and physically burning the design onto the FPGA and connecting to its pins.

Firstly, Verilog testbenches are created for every module present in the system; these testbenches are similar to the modules themselves, but the code instead instantiates the module which is under testing, and probes its input and output ports. In conjunction with simulation software, it is possible to analyse the input and output waveforms to determine whether the module is functioning as expected.

The difficulty of this testing scales as the project gets more complex, as testbenches need to be made which represent the final system in order to analyse how each module interacts with others – this results in a significant increase in the signals which need to be analysed to debug behaviour, both increasing complexity and time required. Furthermore, this type of analysis will simply be unable to cover all possible input permutations, which may result in some edge case bug not being discovered – it is possible to write specific test cases for these edge cases but it requires the developer to be aware of them.

The second method of testing utilised in this project was only used for determining whether the VGA signal and colour was being generated properly – this was quickly determined depending on whether or not the monitor would successfully recognise the signal. If successful, the test would show that the VGA driver and VRAM worked as expected, else the traditional functional analysis as described above would be used to determine the exact fault of the system.

Figure 23: Successful drawing of a test pattern using an FPGA

An alternative to these methodologies would be “formal verification, where it is rigorously proved that the system functions correctly on all possible inputs” (Harrison, n.d.). Harrison (n.d., p. 2) continues, stating that “this involves forming mathematical models of the system and its intended behaviour, and linking the two”. This form of testing is more much likely to find issues with the design compared to the functional analysis, however the skills and knowledge to be able to determine which type of formal verification to use, how to appropriately apply it to the project and to interpret its results is far out of the scope of this project; this means that despite all the testing attempts, there is a chance that there may be a bug in the design which simply has not been discovered due to the right conditions not being met for it.

4.2 Operation

It is impossible to concretely state whether the CPU will function as expected for every possible input permutation, as discussed in section 3.3.1, however despite this, we can write code and then analyse the output and determine from that whether the CPU operates in a predictable manner.

A simple program is illustrated in figure 24 - it loads register A with 0 and register B with 1, performs an addition A = A + B and then halts.

simple_add.hex

Full file (local copy)

1 2800 // mov a, 0
2 2901 // mov b, 1
3 0b01 // add a, b
4 0100 // halt
Figure 24: Simple addition program
Figure 25: Simple addition program waveform

It can be seen from the waveforms presented in figure 25 that the outcome of the program is exactly what we have expected – the register A is initialised to the value 0, and then the CPU performs 0 + 1 = 1, and stores the result in register A.

A variant of the program previously is presented in figure 26 – this program initialises register A with the value 0, and then repeatedly adds 1 to register A via the use of jumps.

loop_add.hex

Full file (local copy)

1 2800 // mov a, 0
2 2901 // mov b, 1
3 0b01 // add a, b
4 1400 // jmp
5 0002 // (address)
6 0100 // halt
Figure 26: Looping addition program
Figure 27: Looping addition program waveform

The final waveform for the program presented in figure 27 showcases that the CPU is capable of handling unconditional jumps – due to the way the CPU is designed, this functionality also extends to conditional jumps.

A more complex program is presented in figure 28 – this implements the Fibonacci sequence as described by Chandra and Weisstein (2022): Fn = Fn-1 + Fn-2 where Fn = F2 = 1, and by convention F0 = 0. The expected output for this equation is 1, 1, 2, 3, 5, 8, 13, 21 … and so on, as it continues to grow to infinity. This sequence is implemented in figure 28, and the output of the program is again presented as a waveform in figure 29.

fibonacci.hex

Full file (local copy)

1 2800 // mov a, 0
2 2901 // mov b, 1
3 0621 // mov c, b
4 0b10 // add b, a
5 0602 // mov a, c
6 0621 // mov c, b
7 0b10 // add b, a
8 30e9 // cmp a, 233
9 1500 // je
10 000c // (address)
11 1600 // jmp
12 0004 // (address)
13 0620 // mov c, a
14 0100 // halt
Figure 28: Fibonacci program
Figure 29: Fibonacci program waveform

It can be seen by the contents of register C that the CPU has successfully executed the Fibonacci sequence program and has gone to the highest value that can fit into the 8-bit register – a comparison instruction and a conditional jump are used to detect this state and will jump to the halt instruction to prevent overflowing. This complex program implements the use of multiple registers, arithmetic operations with both immediate and register-stored values, comparison instruction and both conditional and unconditional jumps.

Furthermore, the CPU implements instructions (see appendix A) which enable it to load and store values from both system and video memory – a use of the load instruction on system memory is showcased below, however the principle is the same for the other types of memory access instructions.

load_store.hex

Full file (local copy)

1 2c00 // mov e, 0
2 2d04 // mov f, 4
3 2801 // mov a, 1
4 0200 // load
5 0b50 // add f, a
6 0300 // store
7 0100 // halt
8 beef // data
9 0000 // target memory location
Figure 30: Load and Store program
Figure 31: Load and Store program waveform
Figure 32: System memory contents after the execution of the program - note the new value "beef" at address 8

The steps taken to execute the program can be seen by the “control line” signal, present in every waveform above; these signals correspond to the microcode format presented in figure 3, and can be decoded in order to determine exactly what micro-instruction the CPU was performing to achieve the operation requested.

As mentioned in section 3.2.1, microcode increases execution time as it requires multiple clock cycles to step through each micro-instruction – as a result, it is difficult to determine the performance of the CPU as it depends entirely on the program composition. For example, a NOP instruction has one micro-instruction, an ADD instruction has two, LOAD has five and any jump instruction has three. Therefore, a program composed of purely NOPs would execute considerably faster than a program composed of LOADs – despite this, it is unrealistic to expect these types of programs they are non-functional in the sense that they do not compute anything. A more realistic expectation of performance would fall anywhere between the time it takes to execute a NOP and the time it takes to execute a LOAD instruction.

This performance can be quantified by calculating the number of instructions which the processor executes per second, IBM (1995, 6-9) provides an equation for MIPS (short for “Million instructions per second”) as follows: MIPS = clock / clocks per instruction x 10^6

An assumption will be made that the clock rate will be fixed to 50MHz (50,000,000Hz) as this is the maximum clock speed available on the FPGA development board. The number of cycles an instruction takes to execute is determined from the simulation waveforms.

A NOP instruction takes 3 clock cycles to execute, therefore a program consisting of all NOPs (or equivalent length instructions) has a MIPS of: MIPS(NOP) = 50,000,000 / 3x10^6 = 16.67

An ADD instruction takes 4 clock cycles to execute, therefore a program consisting of only ADDs (or equivalent length instructions) has a MIPS of: MIPS(ADD) = 50,000,000 / 4x10^6 = 12.50

A LOAD instruction takes 7 clock cycles to execute, therefore a program consisting of only LOADs (or equivalent length instructions) has a MIPS of: MIPS(LOAD) = 50,000,000 / 7x10^6 = 7.14

This fails to account for the fact that both E and F registers must be loaded with the appropriate address before a LOAD instruction can occur, and therefore the true MIPS would be lower as it takes 15 cycles to execute both move instructions and to load: MIPS(MOVE&LOAD) = 50,000,000 / 15x10^6 = 3.33

The MIPS metric itself can be misleading, as it does not account for some key factors when it comes to performance as described by IBM (1995):

  • MIPS cannot be used to compare machines with different ISAs.
  • MIPs cannot compare different programs on the same computer, as the instructions used to execute the program could vary drastically.

As a result, it is difficult to accurately compare the performance of this CPU to others in its class – perhaps a combination of multiple metrics would prove to be a better judge of performance.

5 Conclusion

To conclude, the project successfully achieves the aims set out in the introduction and presents a functional CPU which is capable of executing user programs without fault, as presented in sections 3.3.4 and 4.2. As a result of this success, invaluable knowledge about CPU organisation was gained which leads to a greater understanding of the fundamental link between hardware and software.

This flexibility provided by FPGAs lends itself to an incredibly steep learning curve – not only does the developer have to both learn the tools at hand and also understand how those tools assist in reaching the goal, but they must also be able to explore the problem space for their particular task and derive a series of steps which can successfully lead them to the end goal. In the context of this project, an understanding of the inner workings of a CPU was gained by researching multiple CPU architectures and examining their behaviour, available architecture documents and other pieces of information available around the internet – the culmination of this information was then used to influence the final design of the CPU.

The architecture of a processor is often hidden away from the end user, through mountains of abstractions and high-level languages; as a result, people often have a weak understanding of how certain concepts work, such as pointers, if they do not understand how it is implemented in machine language (Nakano and Ito, 2008). Furthermore, the concept of the fetch-decode-execute cycle only helps to abstract the CPU further, boiling down the entire process into three words and failing to emphasise the complexity of each stage and the coordination required.

Through the creation of the CPU, a much richer understanding of how a computer system works as a unit was gained but most importantly it helped to showcase the intricacies of how a CPU performs its tasks, illustrating how an instruction set can be mapped onto an underlying architecture. The synergy between the instruction set and the silicon below it is often overlooked, but without it modern day computing would not be a reality.

6 Reflective Analysis

It is important to note that the project had deviated from the original proposal of attempting to make cycle-accurate versions of the MOS 6502 and the Manchester Baby – the change was a natural progression after having seen how quickly I had implemented a traffic light circuit based on a counter and decoder combination. Despite not being anywhere near the complexity of a CPU, it had showed me that the language was easy to work with and that I understood, to an extent, how circuits can be made.

Ultimately the project was a success – the processor is capable of computation as shown by the Fibonacci sequence, and the output has a meaningful use as it can be stored into VRAM to display on a monitor; unfortunately, the I/O is very rudimentary, and would have definitely benefited from some form of input – a keyboard at minimum – to enable more possibilities such as playing a game or navigating some menus. This additional feature would require a modification to the processor in order to implement an interrupt handler, alongside with a method to handle the input itself – such as memory mapping the keyboard or having some method of polling it.

The development side of the project provided me with invaluable knowledge of how to design and test circuits from scratch using HDLs, alongside with learning a vendor suite means that I am now aware of the process it takes to transfer a design from HDL onto physical silicon. Despite being locked into one vendor for the entirety of the project, this project has provided transferrable skills such as debugging, understanding appropriate pin assignments, and overall understanding what Verilog is capable of.

I think the success of the project can be partially attributed to the abstractions and ease of use that Verilog provides – the C-like syntax certainly helps in easing the learning curve (at least in my case) but there are things to get used to such as new ideas like as a sensitivity list or having to define the direction of a parameter (i.e. input or output) or understanding the difference between a blocking & non-blocking assignment, and having to get past the quirks of using “begin” and “end” instead of the regular curly brackets in every other C-like language.

Furthermore, debugging is a completely different experience when compared to regular software development; there is no concept of a breakpoint or being able to step through the code so you are forced to use simulation software to plot waveforms in order for you to analyse the behaviour and try to make some sense of this – it gets really difficult in a complex system as there are so many sub-modules and possibly 20 to 30+ different signals to keep track of.

Given another chance to do this project, there are a multitude of things I would change about this particular implementation.

The CPU architecture is incredibly simple – it lacks any form of the traditional busses you may see in a typical CPU that is out on the market right now, which makes connecting things like the ALU and registers together incredibly tedious – there are ~3 wires to connect so that it functions, but that is without having to rework the multiplexers and de-multiplexer at both ends of the ALU.

Furthermore, the CPU is not pipelined – every instruction has to finish before the next one is even fetched. This results in increased execution time as when one instruction was busy being executed, another could have been fetched and decoded. Obviously for a small scale project the execution time does not matter much as it will never be used in a time-sensitive situation, but pipelining is a key feature of modern day CPUs and it is important to understand how it works.

Additionally, the CPU lacks any form of floating-point operations, making floating-point math basically impossible – unless you attempted to use fixed-point math however it is not quite clear on how feasible it would be or how easy it would be to implement with the current instruction set of the architecture.

Although it would be a very advanced feature to implement, some form of branch prediction would be fascinating to see – unfortunately, there is not much information out about how a branch predictor is implemented so it is unclear how it would fit in with the project processor.

Unlike x86 with its MASM and NASM, the project CPU lacks any form of an assembler – the key component in translating assembly code into the machine-specific code which could eventually provide the foundations for a toolchain for higher, more abstracted languages to be implemented. Unfortunately, it was not in the scope of this project to make one as it would take too much time away from other studies, but also from the CPU itself. This does mean that any program written for the CPU has to be written in its machine code, which is both tedious and very error prone – for example, modern assemblers provide features such as labels which allow the programmer to denote a particular memory location with some form of symbol to which they can refer to later however, in the case for this CPU, there are no such features since you are working with machine code directly, so it is up to you to ensure that the correct memory locations are being referenced. Not only is writing the code tedious, but debugging is only possible with the use of software like ModelSim to plot the waveforms and then analysing them to see what is wrong.

I do not think that I would have learned as much about CPU design if I had copied an existing model – there are countless intricate details which need to be considered for all of the digital machinery to function in harmony.

The CPU is fully open-source and is available at https://github.com/krisjdev/tau-processor

Bibliography

Xilinx (n.d. (a)). “What is an FPGA?”. Available from https://www.xilinx.com/products/silicon-devices/fpga/what-is-an-fpga.html (accessed 9 November 2021).

Nakano, K. and Ito, Y. (2008). “Processor, Assembler, and Compiler Design Education using an FPGA”. Available from https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4724385 (accessed 14 November 2021).

Tong, Jason G., Anderson, Ian D. L. and Khalid, Mohammed A. S. (2006). “Soft-core Processors for Embedded Systems”. Available from https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4243676 (accessed 15 November 2021).

Lee, J. H., Lee, S. E., Yu, H. C., Suh, T. (2012). “Pipelined CPU Design With FPGA in Teaching Computer Architecture”. Available from https://ieeexplore.ieee.org/document/6093707 (accessed 18 February 2022).

Yıdız, A., Ugurdag, H. F., Aktemur, B., İskender, D., Gören, S. (2018). “CPU Design Simplified”. Available from https://ieeexplore.ieee.org/document/8566475 (accessed 18 February 2022).

Microsemi. (n.d.). “Soft CPUs”. Available from https://www.microsemi.com/product-directory/mi-v-ecosystem/5093-other-cpus (accessed 19 February 2022).

Fox, C. (n.d.). “Computer Architecture: from the stone age to the quantum age”. No Starch Press, San Francisco, in press.

Eljhani, M. M. and Kepuska, V. Z. (2021). “Reduced Instruction Set Computer Design on FPGA”. Available from https://ieeexplore.ieee.org/document/9464409 (accessed 19 February 2022).

Yhang, Y. and Bao, L. (2011). “The Design of an 8-bit CISC CPU Based on FPGA”. Available from https://ieeexplore.ieee.org/document/6040633 (accessed 20 February 2022).

Koopman, P. (1989). “What is a stack?”. Available from https://users.ece.cmu.edu/~koopman/stack_computers/sec1_2.html (accessed 19 February 2022).

Monk, S. (2017). “Programming FPGAs: Getting Started with Verilog”. New York: McGraw Hill.

TinyVGA. (2008). “VGA Signal 640 x 480 @ 60 Hz Industry standard timing”. Available from http://www.tinyvga.com/vga-timing/640x480@60Hz (accessed 15 May 2022).

PCMag. (n.d. (a)). “specialized processor”. Available from https://www.pcmag.com/encyclopedia/term/specialized-processor (accessed 2 May 2022).

ARM. (n.d. (a)). “Instruction Set Architecture (ISA)”. Available from https://www.arm.com/glossary/isa (accessed 3 May 2022).

Null, L. and Lobur, J. (2018). “Essentials of Computer Organization and Architecture”, Fifth Edition. Burlington: Jones & Bartlett Learning.

ARM. (2005). “ARM Architecture Reference Manual”. Available from https://documentation-service.arm.com/static/5f8dacc8f86e16515cdb865a (accessed 5 May 2022).

Intel. (2021). “Intel 64 and IA-32 Architectures Software Developer’s Manual – Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D and 4”. Available from https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html (accessed 14 April 2022).

MOS. (1976). “MCS6500 MICROCOMPUTER FAMILY PROGRAMMING MANUAL”. Available from http://archive.6502.org/books/mcs6500_family_programming_manual.pdf (accessed 20 May 2022).

Wilson, P. and Mantooth, H. A. (2013). “Model-Based Engineering for Complex Electronic Systems”. Oxford: Elsevier.

Eigenmann, R. and Lilja, D. (1998). “Von Neumann Computers”. Available from http://public.callutheran.edu/~reinhart/CSC521/Week3/VonNeumann.pdf (accessed 7 May 2022).

IBM. (2022). “Absolute addressing”. Available from https://www.ibm.com/docs/en/aix/7.2?topic=addressing-absolute (accessed 10 May 2022).

nandland. (n.d.). “What is a Block RAM (BRAM) in an FPGA? Tutorial for beginners”. Available from https://www.nandland.com/articles/block-ram-in-fpga.html (accessed 10 May 2022).

PCMag (n.d. (b)) “glue logic”. Available from https://www.pcmag.com/encyclopedia/term/glue-logic (accessed 10 May 2022).

Harrison, J. (n.d.) “Formal verification”. Available from https://www.cl.cam.ac.uk/~jrh13/papers/mark10.pdf (accessed 16 May 2022).

Chandra, P. and Weisstein, E. W. (2022). “Fibonacci Number”. Available from https://mathworld.wolfram.com/FibonacciNumber.html (accessed 17 May 2022).

IBM. (1995) “6 Measuring Performance”. Available from https://u.cs.biu.ac.il/~wisemay/co/co6.pdf (accessed 18 May 2022).

Appendix A

Presented below is the implemented instruction set of the processor. Some notes:

  • For instructions where there is an immediate value, the opcode range indicates which register the CPU will target. For example, mov a, imm8 has an opcode of 28h, and mov b, imm8 will have an opcode of 29h - this is repeated for all registers (8 total - A, B, C, D, E, F, G & H).
  • Some instructions combine the values from registers for their own purpose, this is presented with curly braces e.g., {E, F} indicates that the contents of the E and F registers are concatenated together i.e., if E = 0xbe and F = 0xef then {E, F} = 0xbeef.
  • There are four ALU flags - Zero (Z), Sign (S), Carry (C) and Overflow (O). They are referred to by their letter.
  • On a failed conditional jump, the program counter will always be incremented by two.
Opcode
(Base 16)
Mnemonic Action ALU Flags Affected
00 nop Nothing is executed for one machine cycle None
01 hlt Stops execution completely - machine must be reset None
02 load Indexes system memory at address {E, F} and stores the value in {G, H} None.
03 store Stores value of {G, H} in system memory at address {E, F} None
04 loadv Indexes video memory at address {E, F} and stores value in {G, H} None
05 storev Stores value of {G, H} in video memory at address {E, F} None
06 mov reg1, reg2 reg = reg2 None
07 cmp reg1, reg2 Compare if reg1 == reg2 by performing reg1 - reg2. Does not affect the registers Z, S, C, O
08 test reg1, reg2 Performs bitwise AND on reg1 and reg2 Z, S. C & O = 0
09 shl reg1, reg2 reg1 = reg1 << reg2 Z, S, C
0A shr reg1, reg2 reg1 = reg1 >> reg2 Z, S, C
0B add reg1, reg2 reg1 = reg1 + reg2 Z, S, C, O
0C adc reg1, reg2 reg1 = reg1 + reg2 + C Z, S, C, O
0D sub reg1, reg2 reg1 = reg1 - reg2 Z, S, C, O
0E sbb reg1, reg2 reg1 = reg1 - (reg2 + C) Z, S, C, O
0F mul reg1, reg2 reg1 = reg1 * reg2 Z, S
10 and reg1, reg2 reg1 = reg1 & reg2 Z, S. C & O = 0
11 or reg1, reg2 reg1 = reg1 | reg2 Z, S. C & O = 0
12 xor reg1, reg2 reg1 = reg1 ^ reg2 Z, S. C & O = 0
13 not reg1 reg1 = ~reg1 None
14 jmp [address] Unconditional jump - program counter = address None
15 je [address] program counter = address if Z = 1 None
16 jne [address] program counter = address if Z = 0 None
17 jc [address] program counter = address if C = 1 None
18 jnc [address] program counter = address if C = 0 None
19 js [address] program counter = address if S = 1 None
1A jns [address] program counter = address if S = 0 None
1B jo [address] program counter = address if O = 1 None
1C jno [address] program counter = address if O = 0 None
1D ja [address] program counter = address if C ^ Z = 0 None
1E jae [address] program counter = address if C = 0 None
1F jb [address] program counter = address if C = 1 None
20 jbe [address] program counter = address if C | Z = 1 None
21 jg [address] program counter = address if (S ^ O) | Z = 0 None
22 jge [address] program counter = address if S ^ O = 0 None
23 jl [address] program counter = address if S ^ O = 1 None
24 jle [address] program counter = address if (S ^ O) | Z = 1 None
25, 26, 27 undefined nop None
28 to 2F mov a-h, imm8 reg1 = imm8 None
30 to 37 cmp a-h, imm8 Compare if reg1 == imm8 by performing reg1 - imm8. Does not affect registers Z, S, C, O
38 to 3F test a-h, imm8 Performs bitwise AND on reg1 and imm8 Z, S. C & O = 0
40 to 47 shl a-h, imm8 reg1 = reg1 << imm8 Z, S, C
48 to 4F shr a-h, imm8 reg1 = reg1 >> imm8 Z, S, C
50 to 57 add a-h, imm8 reg1 = reg1 + imm8 Z, S, C, O
58 to 5F adc a-h, imm8 reg1 = reg1 + imm8 + C Z, S, C, O
60 to 67 sub a-h, imm8 reg1 = reg1 - imm8 Z, S, C, O
68 to 6F sbb a-h, imm8 reg1 = reg1 - (imm8 + C) Z, S, C, O
70 to 77 mul a-h, imm8 reg1 = reg1 * imm8 Z, S
78 to 7F and a-h, imm8 reg1 = reg1 & imm8 Z, S. C & O = 0
80 to 87 or a-h, imm8 reg1 = reg1 | imm8 Z, S. C & O = 0
88 to 8F xor a-h, imm8 reg1 = reg1 ^ imm8 Z, S. C & O.