A Brief z80 Assembly Tutorial

Chapter 1

In this brief tutorial I'm aiming to make a small game for the ZX Spectrum, written 100% in assembler. I've done a bunch of projects for the speccy using SDCC; while I couldn't completely escape assembly, the C compiler did a lot of heavy lifting for me. In other words, I traded control for convenience.

Now, since I've never actually done a 100% assembler project before (ignoring some tiny DOS TSR experiments), this is a learning experience for me as well. A lot of things I'll do here is likely not going to be optimal, but what I present is what happened to work for me. All the source code is free to use in whatever way you wish, with no warranty whatsoever.

Target

Like I mentioned, the target is to make a game for the ZX Spectrum. More specifically, I'll be targeting a 48k subset of the 128k spectrum; I'll ignore the bank switching memory things of the 128k, but I want to use the audio chip of it. The result will also work on, but not use any of the advanced features of the ZX Spectrum Next.

For development, I'm using sjasmplus assembler, because it's convenient and can output .tap files all by itself - and if need be, scales up to other devices too, including Next's extended instruction set. As an emulator I'll be using the good old Fuse, which handles the pre-Next speccys so well that testing against actual hardware is rarely, if ever, necessary.

We won't be returning to BASIC, or using ROM calls at all, which frees the maximum amount of RAM (as well as the IY register); we will want to use the frame interrupt (if, not no other reason, to have something to sync audio with), which means we have to set up an interrupt trampoline.

Which leads us to..

Budgeting

Looking at the speccy memory map, the bottom 16k is ROM, which we can't do anything about, followed by 6144 bytes of display bitmap plus 768 bytes of color, leaving 65536-(16384+6144+768)=42240 bytes of space for everything including code, interrupt trampoline, stack, assets, buffers and variables. I'm not sure if we can even use all of that as I don't know what trickery sjasmplus' .tap writer does (mackarel could give us basically all of it, but let's not go there if we don't need to).

To make things slightly worse, the bottom 32k of the address space is "contended ram", meaning its shared with the CPU and the graphics circuitry. In Amiga terms, the bottom 32k is "chip" ram, and the top is "fast" ram. So we'll want most, if not all, of our stuff to live on the top part.

To make things simple(r), let's use the bottom 9472 bytes for things we want out of the way, namely interrupt trampoline and stack, with room to spare if we're really tight on variable space. That leaves the top 32k as free estate for whatever we want.

Except, we can't. Well, we could, but that would expose a hardware bug, which can't be seen in Fuse. Remeber when I said you hardly ever need to test on real hardware? Oh, those sweet, innocent times...

If the interrupt address table is in the contended memory, the real hardware will show snow on the screen, which does not happen on most emulators. Luckily only the 257 bytes of it needs to be in the upper 32k.

If the 32k sounds like a very small amount of memory for everything, consider this: There was a 16k version of the ZX Spectrum, which simply didn't have the chips for the top 32k of memory, and some very, very successful games were made for that setup - using that mere 9472 bytes of available RAM! (Just the plain text of this single web page, which didn't take me all that long to write, is way longer than 9472 bytes). So the 32k we'll be filling is actually pretty big. Don't worry, z80 code is pretty compact - it's assets (graphics, text, etc) that require a lot of memory.

Another dimension of budgeting is CPU use. What can be done in a frame?

There's 69888 clocks per frame on a standard 48k ZX Spectrum. Each opcode takes 4-23 clocks, plus if you happen to get contended, you may have to add up to 6 clocks to the instruction while waiting for the CPU to get access to whatever resource (ROM, RAM, i/o ports) was contended. So at best we're talking 69888/4=17472 instructions per frame and at the worst we're talking 69888/29=2409 instructions per frame. Ouch. Remember that the screen was 6144 bytes, so just doing a linear copy from an offscreen buffer takes more than a whole frame of time. So double buffering is (largely) out of the picture.

The 128k speccy actually has hardware support for double buffering, but, like many developers at the time these devices were relevant, we're not going there.

Graphics Limitations

Let's not forget the graphics limitations of the speccy. The screen is a 256x192 bitmap, meaning each pixel can only have one of two colors, which the speccy calls "ink" and "paper" for foreground and background. These two colors can be set for each 8x8 pixel area on the screen, out of 8 possible colors. Additionally the 8x8 area can be set to be "bright", which affects both the ink and the paper color; bright black remains black.

The final bit in the color definition is blink, which causes ink and color to swap every 16 frames. The blinking may be annoying (as we'd probably prefer a separate bright bit for ink and paper), but on the other hand, it's the only bit of hardware accelerated rendering we'll be getting.

This means that either things move in 8 pixel steps, or you have the infamous color clash effect.

So what kind of game would we be making, given all these limitations? We'll see in chapter two, I suppose; but for now, we have work to do to make the initial framework that handles everything we've talked about so far.

Initial Framework

Download the tools and initial source code here: specasm_ch01.zip. All subsequent chapters will add to the files here, so in order to compile, you'll need this (except for the last chapter that has everything again).

What's in the zip?

Directory "fuse" contains the fuse speccy simulator, and "sjasmplus-1.18.2.win" contains the sjasmplus assembler. Both of these are for windows, but if you're using some other platform, both can be found for various platforms due to their open source nature.

The "z80code.txt" contains listing of the z80 opcodes. I strongly recommend printing this on paper to have it handy without having to look it up on screen. I keep leafing through it all the time.

"m.bat" is for compiling and "r.bat" is for running our program. I recommend having a command line / terminal window open and running from there, so you can see error messages.

"m.bat" contains:

sjasmplus-1.18.2.win\sjasmplus.exe test.s --lst=test.lst --sym=test.sym --raw=test.raw

which runs the assembler, gives the source code and additionally asks the assembler to generate listing, symbol and raw outputs. We'll take a look at these a bit later. The "r.bat" simply runs fuse and gives the .tap file as parameter.

Finally, "test.s" is our initial framework. We'll go through that momentarily.

Running the "m.bat" should yield the following output:

SjASMPlus Z80 Cross-Assembler v1.18.2 (https://github.com/z00m128/sjasmplus)
Pass 1 complete (0 errors)
Pass 2 complete (0 errors)
test.s(47): warning: [SAVETAP] Tape file will not contains data from 0x5B00 to 0x5E00
Pass 3 complete
Errors: 0, warnings: 1, compiled: 48 lines, work time: 0.016 seconds

Note that the .tap file written won't contain memory from 0x5B00 to 0x5E00; that won't matter as our code will start at address 0x8000.

After running we have some new files: "test.lst" which is a detailed listing of the opcodes generated, exact addresses of everything, etc. This file is a critical debugging tool when figuring out what's going wrong when it eventually does.

The next two files are mostly for fun - "test.sym" lists the symbols and their addresses, which may be useful in some cases, but probably not in the scope of this tutorial. "test.raw" has just the raw bytes of our code, to see how big the program is - currently we're using massive 58 bytes of it.

Finally, "test.tap" contains our code in an emulator-runnable form. This is clearly not an optimal .tap file as it's over 32k of size, but it does its job.

So what do we do with those 58 bytes?

Initial Code

To get some context for further discussion, let's first go through the code, chunk by chunk, top down:

        DEVICE ZXSPECTRUM48         ; Device setting for sjasmplus (.tap writing etc)
SCREEN  EQU $4000                   ; Location of screen
COLOR   EQU $5800                   ; Location of color array

This first chunk doesn't generate any code. We tell the assembler that we're targeting this specific device, which affects some pseudoinstructions like saving of the tap file. Additionally, we define a couple of constants which will become useful eventually.

        ORG $8000                   ; Let's start our code at 32k
        di                          ; Disable interrupts
        ld  sp,     0x8000          ; Set stack to grow down from our code
        ld  de,     0xfe00          ; im2 vector table start right after color table
        ld  hl,     0xfdfd          ; where interrupt will point at
        ld  a,      d 
        ld  i,      a               ; interrupt will hop to 0xfe?? where ?? is random 
        ld  a,      l               ; we need 257 copies of the address
rep_isr_setup:
        ld  (de),   a 
        inc e
        jr  nz,     rep_isr_setup
        inc d                       ; just one more
        ld  (de),   a
        ld de,      isr
        ld  (hl),   0xc3            ; 0xc3 = JP
        inc hl
        ld  (hl),   e
        inc hl
        ld  (hl),   d
        im  2                       ; set the interrupt mode
        ei                          ; Enable interrupt

Here's where most of our bytes go - we build the interrupt service routine jump table as well as code that jumps into our actual interrupt service routine. The first line of the chunk specifies where the following bits of code should go in the memory map; you can sprinkle these wherever you need, but usually you'll only use them to specify the entry point and possibly to align some tables for optimization purposes (for 8 bit indexing)

We disable interrupts right off the bat to avoid situation where the system interrupt blows something up. After setting up our stack, we build the 257 byte table of identical bytes that will say where the interrupt should jump to; this is an unfortunate artefact of how the z80 and speccy itself was designed. The bottom 8 bits of the interrupt table index comes from the data bus, so it may be anything; a little bit of additional logic might have made it possible to always have that be zero, but either the hardware designers didn't think of it or didn't care.

This is the only place where we'll be accessing the special register "I". That doesn't really matter as the only things you can do with that 8 bit register is to write register "A" to it or to read its contents to register "A".. I suppose if you are really hurting for registers in your interrupt service routine you might use that as a temporary storage, as it's marginally faster (and smaller) than writing to a memory location.. but I digress.

To rephrase: in the interrupt mode we are interested in, the CPU reads the register "I" and a (random) byte from the data bus to create an address from which it reads a 16 bit, or 2 byte address.

After building the table we write a few bytes to the place the table points at to generate a jump instruction to our actual interrupt service routine we're free to place wherever we want. These bytes could be down in the contended memory, and we might relocate them later on if we find ourselves hurting for space, as having them up in the top 32k makes the memory a bit of a swiss cheese. We still have over 30k of space to fill before we get there, though.

To finish off our interrupt setup, we set the interrupt mode to the one that actually uses our trampoline and enable interrupts.

Next up, main loop:

mainloop:
        ld  hl, SCREEN
        ld  a, (framecounter)
        ld  (hl), a
        
        jp mainloop

Our main loop doesn't do much yet. We take the bottom 8 bits of our frame counter and draw it on the screen, indefinitely.

isr:                    ; This will be called ~50Hz
        push bc
        ld   bc, (framecounter)
        inc  bc
        ld   (framecounter), bc
        pop bc
        ei
        reti            ; Return from interrupt

The ISR (interrupt service routine) we spent a lot of time setting up earlier simply increments our 16-bit frame counter and, after enabling interrupts again, returns. If we didn't enable interrupts here, we'd only run the interrupt routine once.

Little bit of esoteric trivia: the EI instruction is special in the way that the interrupts only get enabled after the instruction following it, meaning that interrupts, in this case, only get enabled after we return from the interrupt. If this wasn't the case, a very long interrupt might cause interrupt to trigger again and again, leading to stack overflow. The counterpart DI works the same way, so EI followed by DI works in consistent way.

RETI is "return from interrupt" and as far as z80 itself is concerned, works exactly the same apart from taking one more byte and 4 more clocks to execute. Some external peripherals may be looking for RETI being executed, though. There's no such thing on the spectrum, but using RETI is still considered good form.

framecounter:
        db 0,0
    	
    	SAVETAP "test.tap", $8000   ; Save the assembled program as a tap file

Finally, we have a little bit of data - our frame counter bytes - as well as the assembler pseudoinstruction to write out the tap file.

If you run the r.bat, you should see fuse start up and load the tap file, and then the first 8 pixels of the screen cycle through all the 256 bit combinations.

Assembly Basics

Let's look at the various instructions we've seen so far, just to see what's what.

    ld X, Y

The "ld" is short for "load". On x86 world the same instruction is "mov" for "move". This is practically "set the value Y to the register X", where the "value" may itself be a a constant, label, register, or if we want to do pointers..

    ld X, (Y)

..loads the value in memory address pointed by Y to the register X.

Speaking of registers, the z80 has the following registers:

    A F     A' F'
    B C     B' C'
    D E     D' E'
    H L     H' L'
    IX
    IY
    I R
    SP

"A" being the most general purpose accumulator. It's 8 bit, and most of the interesting instructions use it. You'll find yourself shuffling data around just to use this register. "F" is the flags register containing things like carry and zero bits; you'll never access it directly. The next registers - B,C,D,E,H,L can be used either as separate 8 bit registers or paired into 16 bit ones; B and C become BC, etc. For stack purposes, A and F combine into AF, so if you want to push A to the stack, you puch it along with F.

Different instructions prefer to use specific registers and register pairs, and are not largely interchangeable. IX and IY are special index registers which are slower than the others, but useful in their specific uses.

In addition we have the special register "I" for the interrupt jump table, "R" which is an instruction counter, which can be used for (very poor) random number generation or tool for copy protection if you're into that sort of thing.. and then there's shadow versions of AF and BCDEHL, usually marked with a hyphen (so shadow of " F " is " F' "). There's a couple of instructions for switching which version of the register set is in use. This is sometimes useful, but less than you'd think outright, as moving data between the register sets is complicated.

Finally there's stack pointer, which is also a 16 bit register. You rarely need to interact with it directly apart from setting up where you want your stack to be, except if you want to abuse the stack for data copying.

The z80 instruction set is not symmetric, meaning that there's a lot of "holes"; you may feel that it's logical to do some operation and then find that it doesn't exist, and you have to write some workaround (such as passing a value through the stack) to get values where you want.

If you browse through the instruction set, you'll find that a bunch of things you'd expect are missing, notably division, multiplication and barrel shifter (i.e, shift x by y). You can't rotate bits by variable amount via a single instruction, which you may have gotten used to on x86. Workaround for the missing bits is to do things the hard way, loop a lot, use precalculated values, lookup tables, or just figure out some other way to get where you wanted.

Let's look closer at the interrupt table building loop; here's the most critical bits of it again;

        ld  de,     0xfe00          ; im2 vector table start right after color table
rep_isr_setup:
        ld  (de),   a 
        inc e
        jr  nz,     rep_isr_setup    

First we set the value of the "DE" register pair. Note that the bottom byte (and hence, the "E" register) is zero. We write the value in register "A" into the address pointed by "DE", increment "E" (not "DE"!), and jump back to the rep_isr_setup label in case "NZ", or "not zero". The opcode "JR" (jump relative) takes less space to encode than the opcode "JP", but has shorter range, so while either could be used in this case, "JR" is smaller. Once the register "E" rolls over and becomes zero again, the test fails and the jump isn't taken.

Note that while there is an "INC DE" instruction, it wouldn't work here, since that instruction doesn't update flags.. and we wanted a 256 cycle loop, not a 64k one, in any case. Alternative would be to spend a bunch of instructions to see if we've hit a specific value, but that's much more wasteful. Overall, you'll want to get away with 8 bit operations on this 8 bit CPU.

Since we actually wanted 257 copies, not 256, the code continues:

        inc d                       ; just one more
        ld  (de),   a

Note that we increment "D", not "E" here; "E" is already zero, so we're back to the initial value of "DE"; incrementing "D" will move forward 256 bytes, which is what we wanted.

Since the ISR can trigger at any moment, we need to take care to preserve the running state of the system.

        push bc
        ld   bc, (framecounter)
        inc  bc
        ld   (framecounter), bc
        pop bc    

Since we're overwriting the "BC" register, we push it to stack, use it, and pop it back before returning. Remembering that "INC BC" doesn't mess with the flags, we don't need to preserve the flags register. If we wanted to make sure we're saving absolutely everything, we'd do this:

        push af
        push bc
        push de
        push hl
        push ix
        push iy
        ex af, af'
        push af
        exx
        push bc
        push de
        push hl

        ; your code here	

        pop hl
        pop de
        pop bc
        pop af
        exx
        ex af,af'
        pop iy
        pop ix
        pop hl
        pop de
        pop bc
        pop af    

..which is what one needs to do when writing C code, as you never know what the compiler might do! That is one example of trading convenience for control.

That's It For Now

And that's our initial framework and code. In the next bit we'll probably get to making a game.

Any comments etc. can be emailed to me.