Writing an Assembler Part 1 - Learn, Tinker, Build

Introduction

As part of the series on The Art of Computer Programming I have been creating assembly programs for the various algorithms and problem sets. For the early entries it is fairly easy to convert the assembly over to the needed machine code for entry into the Altair 8800. As the tasks progress however it will make more sense to have a way to assemble the machine code using an assembler. I could locate an assembler that is already available and save myself the step of writing my own, but that doesn’t sound as fun, and I have a preference at this point for an octal output for easy entry via the front panel.

The reason for building this assembler using Ruby as opposed to C/C++ or some other language is my personal familiarity. I use Ruby as part of my day job and am quite familiar with using it to solve complex and unusual problems. I may come back and rewrite this in C at some point, but that will probably when I reach the point of feeding the machine code via a binary transfer as opposed to the front panel.

Requirements

The Assembler

The requirements for this assembler will be pretty simple and somewhat unique to the way the machine code as being used.

Take in Assembly code as a file
Output machine code in Octal for front panel entry
Support for Labels
Support for Header files

The first requirement is fairly self-explanatory as there must be some way to get the assembly code to the assembler to be converted to machine code. This will be as simple as taking the location of the assembly code file as a parameter to the assembler and reading the file into memory. If the assembly programs that are being assembled get very large it might become necessary to only read parts of the file into memory and parse the program in sections. Given that the available memory of the 8800 this will be creating machine code is much smaller than the desktop assembling the code that won’t be a factor at this time.

The second requirement is also fairly straightforward at first glance, but on closer inspection leads to a question of where to output the output. The easiest solution would be to print the output to the terminal and then just use a redirect to send the output to a file if needed. The better route would be to take a parameter to the assembler to let it know if it needs to write to a file or to the terminal. This will keep the assembler in line with how most standard assemblers and compilers operate. This parameter will be optional and default to writing the output to a file with a name matching the assembly code passed to the assembler.

Supporting labels is a very important feature for the assembler as it will allow loops and jumps to function. Without labels the programmer would need to keep track of where the program is in memory to point back to the instruction they want to jump to. In order for this feature to work the assembler will need to keep a running list of the labels encountered and what memory address they are being placed in. The functioning of this feature is directly tied to the requirement of the assembly file to start with a header telling the assembler where the program should start in memory, otherwise the assembler will have to assume address zero is the starting point.

The last major feature of supporting header files will be just as simple as loading the assembly file. The major difference will be that instead of taking in the information about where to find the files from the command line they will be in include statements in the assembly code. The real trick with this requirement will be determining where in memory to store the contents of the header files. Once where to store them has been decided there is also the opportunity for an optimization of only outputting the sections of the header files that the assembler sees being used. That optimization will be more of a stretch feature since for the first few programs we won’t even be needing header files.

The Assembly Code

Like the assembler the assembly files will also have some requirements around how they are structured so they can be parsed correctly.

Consistent header format with initialization information
Styling

The header section of the assembly file will contain information about the CPU to be assembled for, the header files to include, and where in memory to start the program. To start with the assembler will only work for the 8080 CPU, but the plan will be to expand this to incorporate the other CPUs I have available. The include statements will tell the assembler what headers to be loaded and where to find them. The final important piece of information provided is where to place the start of the program in memory. For a program that has no jumps the start location isn’t important, but in order for the assembler to write jumps it needs to know where in memory the code is being placed.

The styling of lines will be important since it will determine the difference between a code line and a label line. For lines that contain code the mnemonic will be offset from the start of the line by 4 spaces. Lines that are labels will start immediately with the label name and end with a colon. For both types of lines the assembler will ignore case, which means case cannot be used for uniqueness for labels. For comments a semicolon encountered at any point on a line will set the remainder of the line as a comment to be ignored by the assembler.

Step-wise Example

Example Assembly File

The following example file will be used in the subsequent step-by-step walk-through of how the assembler will parse and interpret the assembly code into machine code.

    PROCESSOR 8080
  
    INCLUDE ‘macro.h’
  
    SEG code      ; Start of code segment
    ORG 0000H     ; Where in memory to start the program
  
 ; Simple program to run a loop 100 times
 Start:           ; Label to start the program
    MOVI A, 100   ; Store 100 in the Accumulator
 Loop:
    DCR A         ; Decrement the Accumulator
    JNZ Loop      ; Jump to Loop if Accumulator not zero
  
    HLT

This program will give a nice simple combination of comments, labels, and a loop for the assembler to need to work through.

Line by Line Assembling

The next step in this exercise is to step through each line and look at what the resulting output should be along with the information the assembler should be keeping track of.

    PROCESSOR 8080

This line tells the assembler to load the mnemonics for the 8080 to be used for this file. In the first iteration of the assembler 8080 will be the only option so this line could be ignored. That being said the functionality for the assembler to read and be able to load other mnemonics will be present so that it will be easier to expand later.

    INCLUDE ‘macro.h’

This line tells the assembler to load the macro.h file and parse the code found there. For the first iteration of the assembler this will be handled by adding the file to an array of files to be parsed. The original assembly file will be parsed to completion before any of the included files. This does mean that there is there will be labels that are come across that a memory address has not been set for, those will be stored in a hash to be matched up with locations after the assemblers first pass.

    SEG code      ; Start of code segment

This line alerts the assembler that this is starting a code section. To start with there will be options for code and data. This provides a way to logically split the functional code from static data and allows them to be stored in separate sections of memory. This line also includes the first comment which will be ignored by the assembler.

    ORG 0000H     ; Where in memory to start the program

This line tells the assembler where to start placing code in memory. As part of this line the starting address is listed in hexadecimal as noted by the H at the end of the address location.

 ; Simple program to run a loop 100 times

This line is just a comment and will be ignored by the assembler.

Start:           ; Label to start the program

This is the first real line of the program. This line is a label so the assembler will place the label name and location of 0000H into the labels array if it is referenced later. This line does not add anything to the output since it does not contain any operations or data.

    MOVI A, 100   ; Store 100 in the Accumulator

Now we have reached an instruction. Since the initial output styling will be address per line octal to make setting the switches on the 8800 easier. This will add 076 followed by 144.

Loop:

This is the second label to be added to the labels array. This time the associated address will be 0002H.

    DCR A         ; Decrement the Accumulator

This will add 075 to the output.

    JNZ Loop      ; Jump to Loop if Accumulator not zero

This jump instruction will make use of the label Loop that was stored two lines ago. The output will have 302, 002, and 000 added.

HLT

This is the final line and will halt execution at this point. The output will close out with 166.

Expected Output

Once the assembly is completed the assembler will have output a file that has the memory address and contents listed in two columns. This output is specifically tailored for hand entry on the 8800. For the example code the output will look as follows.

Address    Value
0000H      076
0001H      144
0002H      075
0003H      302
0004H      002
0005H      000
0006H      166

This output does not include anything from the loaded ‘macro.h’ file as there was no reference to any of the labels in that file.

Part 2

In part 2 we will begin work on an MVP to handle meeting the requirements of the example file. Part 2 will be linked here once it has been posted.