Writing an Assembler Part 2 - Learn, Tinker, Build

This is a continuation of the series on writing an 8080 assembler in ruby. Part 1 explained the basics for how the assembly code is converted into machine code to be entered into the Altair 8800. This part will actually go through deciding on an MVP and creating a working script.

MVP

For version one of the assembler the minimum viable product will not meet all of the requirements listed for a final version of the assembler. For this version the focus will be on taking in the assembly code file and outputting the machine code for toggle entry. To this end version one will assume all programs are starting at address 0000H. The second simplification will be not using any included header files.

From the input file side there will be some simplifications made. These simplifications are chosen so that they will not cause issues in later versions. In all cases the input for a previous version should work on the subsequent versions. For this version the input file will start immediately with code and not have declarations on processor, includes, segment, or start location.

Example File

The following produces the same result as the file in the first part, but simplified to work with our MVP assembler. This file and it’s subsequent output will be our check for basic working functionality.

; Simple program to run a loop 100 times
 Start:           ; Label to start the program
    MOVI A, 100   ; Store 100 in the Accumulator
 Loop:
    DCR A         ; Decrement the Accumulator
    JNZ Loop      ; Jump to Loop if Accumulator not zero
  
    HLT

The expected output of the MVP should be as follows.

Address    Value
0000H      076
0001H      144
0002H      075
0003H      302
0004H      002
0005H      000
0006H      166

Structuring the Code

Before we begin writing any code it can be helpful to work out a basic structure for how we want the code laid out. Given the simplifications made for the MVP it is possible to just start writing right away based on the defined spec, but that may lead to more work when we add additional features.

Opcode Lookup

One of the most important parts of the code is a way for it to match up the opcode mnemonic with its octal code representation. To handle that there will need to be a lookup block. This block will be more than just a straight lookup since some of the opcodes have differing resultant octal codes based on passed parameters. Overcoming that can be accomplished in a few different ways.

One would be to setup each opcode as a function call and pass it’s parameters to the function. The other option is to have the the opcode lookup provide parameters to pass to a smaller set of generic functions based on need. Since not every opcode varies based on its parameters this reduces the number of functions that must be written. This also allows for quickly adding new opcodes while only needing new functions when there is new variations in how the resultant output varies.

File Read/Write

The decision to be made here is on whether or not to perform the read/write actions as needed or all at once. Performing as needed has the advantage of not needing to persist the data for a long time in volatile storage. Handling all at once means that files are not blocked except for a very short period of time. There is also the advantage of only a very small chance of receiving a partially completed result. From that perspective I feel that handling the read/write operations all at once is best for now. The point where that may change would be if the files being read reach a size where maintaining both input and output in memory becomes burdensome.

Label Handling

The final major decision to be made for the MVP is how to store and handle labels. For this the best option to me is to place the labels and their addresses in a hash to allow for easy lookup and replacement. For labels that appear before their use this is fairly straightforward. For labels that are used before they are defined we will need to implement a multiple pass system. Under this system we will keep the label in place if we don’t know the address. After the entire code has been processed through for one pass then the location of all the labels should be known. At that point a second pass will be performed that will update all the labels that were unknown the first time around.

Writing the Assembler

Initializing

The first section to be built is the general initialization. This will handle storing the where for input and output to variables. It will also set up the initial state of the assembly code, machine code, and label variables. This is also where we will load in the opcode lookup so that it can be housed in a separate file.

require_relative 'opcode.rb'

input_file = ARGV[0]
output_file = ARGV[1]

assembly = Array.new
machine = Array.new
labels = Hash.new

The reason for handling the opcode lookup in a separate file is to make long term maintenance easy, as well as facilitate reuse if we decide to completely rewrite the main code. Input and output files are pulled directly from the command line parameters. This makes it easy to work with for an MVP, but will need reworked in later iterations. For the various state variables arrays are used to track the code as it is being translated. A hash is used for the labels so that lookups can be performed.

File Read and Store

Once the initialization is complete the first real step is to read in the assembly file. During the read I will also be cleaning the code to remove the comments, empty lines, and extra spaces. The resulting output stores to the assembly array is just the labels and code we need to process.

File.readlines(input_file).each do |line|
  trimmed = line.split(';')[0].strip
  assembly.push(trimmed) unless trimmed.eql?('')
end

This block sets up a simple loop to iterate over the contents of the assembly file line by line. Each line is then processed by the second line. The code here is splitting on the comment delimiter to remove the comments, then stripping the extra white-space. Finally the line is only being pushed to the assembly array if it contains any text after the processing.

Line Parsing

Now that the file has been read into a variable we need to process and parse it. The most important part of this section is in setting up the addressing and separating labels from opcodes. Once separated the opcodes are broken into the mnemonic and the parameters.

address = 0

assembly.each do |step|
  if step.include?(':')
    label = step.split(':')[0]
    labels[label] = address.to_s(16).rjust(4, '0')

The separation of labels and opcodes is done based on the presence of a colon. We then take the labels and store them with the corresponding address to an array to allow for replacement.

arr = step.gsub(',', '').split(' ')
opcodes = nil

case arr.size
  when 1
    opcodes = translate(arr[0])
  when 2, 3
    opcodes = translate(arr[0], arr[1..-1], labels)
  else
    raise "Invalid line #{step}"
end

When dealing with the opcodes we need to first split it into the mnemonic and the parameters. That is done by removing any commas and splitting on the spaces. Then based on the size of the resulting array determines the parameters to pass to the translation function. This function will return the binary values representing the opcode and the parameters.

opcodes.each do |opcode|
  addr_s = address.to_s(16).rjust(4, '0')
  data_s = opcode.to_i(2).to_s(8).rjust(3, '0')
  line = [addr_s, data_s]

  address += 1

  machine.push(line)
end

The final part in processing the opcodes is to combine them with the address they will go in. This is done along with converting the binary data to octal for easy entry. Finally both address and data are stored to the machine array we initialized at the beginning.

Opcode Translation

Opcode Data

In order to be able to translate the opcodes to their respective machine code we need to store some data about them. This data includes the mnemonic, machine code, number of parameters, and if there is a memory address or data associated. This is all stores in a lookup table for the translate function to use.

OP_CODES = {
    'MOV' => {
        params: 2,
        memory: false,
        data: false,
        bin: '01DDDSSS'
    },
    'MOVI' => {
        params: 2,
        memory: false,
        data: true,
        bin: '00DDD110'
    },
    'LDA' => {
        params: 1,
        memory: true,
        data: false,
        bin: '00111010'
    },
    'STA' => {
        params: 1,
        memory: true,
        data: false,
        bin: '00110010'
    },

Register Data

The next piece of lookup data we need is for the registers. This is a simple lookup for the various register bits that are set in the machine code. These will be used to replace instances of SSS or DDD in the machine code for the opcodes.

REGISTERS = {
    'A' => '111',
    'B' => '000',
    'C' => '001',
    'D' => '010',
    'E' => '011',
    'H' => '100',
    'L' => '101',
    'M' => '110'
}

Translate Function

The translate function for the opcodes breaks the problem down based on the number of parameters the opcode has. The function returns the resulting machine code as an array for further processing. For opcodes that have no parameters there is just a simple lookup and return of the machine code.

def translate(opcode, params = nil, labels = nil)
    lookup = OP_CODES[opcode]

    case lookup[:params]
        when 0
            [lookup[:bin]]

Once you move up to a single parameter there is additional information needed based on if the parameter is a memory address, data, or a register. For data the data is converted to binary then the resulting machine code array is two entries long. For a memory address the address needs to be looked up if it is a label. The address then needs to be split into a high and low byte. The resulting machine code array is three entries long. For registers the register placeholder is replaced in the opcode’s machine code and the resulting array is a single entry.

when 1
    if lookup[:data]
        data_byte = "%08b" % params[0]
        [lookup[:bin], data_byte]
    elsif lookup[:memory]
        address = labels[params[0]]
        address ||= params[0]

        bin = "%016b" % address.to_i(16)
        low = bin[8..-1]
        high = bin[0..7]

        [lookup[:bin], low, high]
    else
        register = REGISTERS[params[0]]

        bin = lookup[:bin].gsub('DDD', register).gsub('SSS', register)
        [bin]
    end

The final opcode type to deal with is two parameters. In this case the options are a register and data or two registers. For a register and data the register will be inserted into the machine code for the opcode and the data as a second byte. With two registers the resulting machine code is just a single entry with the two registers entered in their respective locations.

when 2
    if lookup[:data]
        register = REGISTERS[params[0]]
        bin = lookup[:bin].gsub('DDD', register)

        data_byte = "%08b" % params[1]

        [bin, data_byte]
    else
        destination = REGISTERS[params[0]]
        source = REGISTERS[params[1]]

        bin = lookup[:bin].gsub('DDD', destination)
        bin = bin.gsub('SSS', source)

        [bin]
    end

Finally I close out the method with a catch in case there is bad data. This exception will point to an issue with the data in the opcode lookup hash.

else
    raise "Bad opcode #{opcode}. Check the lookup."

File Output

Now that we have all of the opcodes translated to machine code and addresses associated we need to write the output. For that we will open the output file to write to. First is placing the headers on the two columns of the output. Next is to loop through the machine code and write out the two columns.

File.open(output_file, 'w') do |file|
  file.puts("Address\tValue")

  machine.each do |line|
    file.puts("#{line[0]}H\t#{line[1]}")
  end
end

Conclusion

That concludes the code needed to build up the MVP assembler. In order to run the code we use the following command. You will need to specify the input and output files you want. If the output file already exists it will be overwritten, so be careful. For the assembly code at the beginning of this article the output will match what is expected.

ruby assembler.rb input.asm output.o

From here the code can be expanded to better encompass the entirety of the features listed in part one.

MVP