A Brief z80 Assembly Tutorial

Chapter 3

Logically the next step would be to handle player input so we could see the player sprite walking about, so let's ignore that and look at performance instead.

Handy way to see how much of frame time a routine takes is to change the border color. Since we're racing the beam all the time whether we want or not, changing the border color on the run will show exactly where on the frame we're going (unless we're in the vertical or horizontal blank, but let's not split hairs here).

Border Color

To see how much of the frame our map drawing takes, we need to do a couple small changes to the main loop, like so:

mainloop:
        ld a, 2
        out (0xfe), a    

We'll start the main loop by writing value 2 to the output port 0xfe to make the border red. If you're unfamiliar with the zx spectrum, this port will feel really familiar soon enough, as everything is tied to it, from border color to keyboard to audio output and even tape loading. We won't dive too deeply into it yet, and it's sufficient to say that the bottom 3 bits control the border color, so we're safe to just write small values into it.

        jr nz, maploop
       
        ld a, 0
        out (0xfe), a        
        halt
        jp mainloop

To end the main loop, we set the border color to black, and instruct the CPU to halt operations until interrupt happens. As you may recall, the interrupt (which we have enabled) occurs at the start of frame. If the hardware designers had had tiny bit more foresight the interrupt would happen at the end of the drawn pixels instead, as that would let programs do maximum amount of work during the non-drawing section of the screen, but you can't have everything. (Well, you couldn't, until Spectrum Next, but that's a different beast in various ways).

If you want the device to freeze requiring power cycle - and you might, for debugging purposes - all you have to do is call DI followed by HALT, which will cause the CPU to wait for the interrupt that will never happen.

Anyway, after these small changes we can compile and run to see just how much of the frame we're using, and..

Oops. We're using more than a frame. Just how much more is hard to say from the blinking, but I'd guess we're using about 150% of the time we have.

Sorry for the epileptics out there. Just... scroll it.. out of the way, okay?

This isn't horribly useful for us, so let's drop the payload a bit by halving the map we're drawing:

        ld hl, map+63
        ld bc, 0x0804
maploop:    

which is easiest by just telling it to render four tile lines instead of 8. This leads to what we wanted:

Couple things of note here. First, you can see that we've rendered the bottom half of our map, but it's in the place of the top half. That's because we're drawing the map from end to beginning, but we're using the loop indexes as coordinates. So the result is what's expected.

Second thing is that the border isn't 75% full as we might have expected from dropping payload to half from using 150% of the frame time. This is due to time being spent during vertical retrace (i.e, time outside the visible screen) as well as the fact that while we dropped the number of tiles, we're still spending time doing the loops.

Aren't you glad you live in a time with GHz processors?

Okay, so what can we do to make things run faster? The most obvious thing is to only draw what's changed, and we'll do that eventually - that will drop the frame time spent drawing nothing to nearly zero - but let's look at some smaller optimizations and what their effect is.

Optimizing a Bit

Looking at our tile drawing code, we have some bits that are repeated many times, so they're the most obvious bit to start optimizing.

            ld a, (de)    ; Read pixels from data
            ld (hl), a    ; Write to screen
            inc de        ; Increment de and hl..
            inc hl
            ld a, (de)    ; And repeat
            ld (hl), a
            inc de
            add hl, bc    ; Add in bc to move to the next line in screen (and one byte back)    

We're reading from (de), writing to (hl) and then incrementing both. There's a single instruction to do something like this - LDI. Unfortunately, LDI reads from (hl) and writes to (de), and also decrements BC. If we swap the registers around, we'll find that we can't add to de - only hl - so we'd need to swap registers around again. We'd need to store the hl for the time we'd use it.. and all of this takes time.

(Addendum: as Ped7g on specnext discord points out, the inner loop here just needs to be LDI; LDI; DEC E; DEC E; INC D, which would be a lot faster - but we won't be using that here).

One option would be to abuse the stack and draw the tiles bottom up, which would actually free up a register. That also comes with its own pile of hazards, though; if we hit interrupt while our stack pointer isn't where the interrupt expects, things will blow up, and disabling interrupts will cause us to miss the interrupt if it happens.

But for now, as a demonstration, let's do the one small optimization we can, and eliminate the extra stuff we do in the last iteration:

        DUP 7
            ld a, (de)    ; Read pixels from data
            ld (hl), a    ; Write to screen
            inc de        ; Increment de and hl..
            inc hl
            ld a, (de)    ; And repeat
            ld (hl), a
            inc de
            add hl, bc    ; Add in bc to move to the next line in screen (and one byte back)
        EDUP
            ld a, (de)    ; Read pixels from data
            ld (hl), a    ; Write to screen
            inc de        ; Increment de and hl..
            inc hl
            ld a, (de)    ; And repeat
            ld (hl), a
            inc de

        ld bc, 65536 + 255 - 256 * 8 + 32 ; Move to the next block of 8 pixels
        add hl, bc

Here we've reduced the DUP from 8 to 7, and copied the code to do the final iteration, only without the last add. We've also added "+ 255" to the "ld bc" line, causing the follow-up add to do its work as well. We can do similar thing to the second dup section:

        ld bc, 255        ; And repeat the above process
        DUP 7
            ld a, (de)
            ld (hl), a
            inc de
            inc hl
            ld a, (de)
            ld (hl), a
            inc de
            add hl, bc            
        EDUP
            ld a, (de)
            ld (hl), a
            inc de
            inc hl
            ld a, (de)
            ld (hl), a    

Here the copied section loses two instructions that we don't need to do. These changes may feel insignificant (and in the large scale of things, they are), but the result can still be seen:

I've overlaid the screen captures of two frames here; the darker red shows the old time taken. Chipping away like this may mean the difference of running at 50Hz versus running at 25Hz!

Let's do the higher level optimization of not drawing tiles unless we need to. Let's say the high bit of tile in the map says it's dirty. This will also limit our types of tiles down to 127 from 255, but I think that's acceptable.

Let's add a couple of functions to mark things as dirty:

; Marks all tiles as dirty and requiring redraw
; no inputs, destroys hl, a, b
dirtymap:
        ld hl, map
        ld b, 64
dmloop:
        ld a, (hl)
        or 0x80
        ld (hl), a
        inc hl
        dec b
        jr nz, dmloop
        ret

The first function loops through the whole map and ORs the top bit on for each index. We don't use ADD because we don't know if some of the tiles are already marked as dirty. Note that the OR only acts on register A.

; Marks one tile as dirty
; hl=tile index, destroys hl, bc, a
dirtytile:
        ld bc, map
        add hl, bc
        ld a, (hl)
        or 0x80
        ld (hl), a
        ret    

We'll also need a way to mark single tiles as dirty, so here we go. We'll probably replace this function call with a macro later on, as it's so short that using a function call seems wasteful.

The main loop also changes, naturally, to do the actual work of skipping the draws:

        call dirtymap
mainloop:
        ld hl, 8*1+6
        call dirtytile
        ld hl, 8*2+6
        call dirtytile

We call dirtymap before starting the main loop to get all of our tiles drawn (once), and mark a couple tiles as dirty in the loop to cause at least something to be drawn every time.

maploop:
        push hl
        push bc
        ld a, (hl)
        bit 7, a
        jr z, skipdraw
        and 0x7f ; clear top bit
        ld (hl), a
        ld l, a
        ld h, 0
        call drawtile
skipdraw:        

Inside the maploop, after reading the tile from the map we do a bit check using BIT, and if the bit is zero, we skip the draw. If we don't jump, we clear the top bit using AND (which, again, only works with register A). Other ways to clear the top bit include adding 128 to it, which would take just as much space as well as time, or using the RES instruction, which would actually be slower.

After the bit is cleared we store it back to the map so we don't need to do it on the next frame, and finally get to calling drawtile.

The result of only drawing the two dirty tiles each frame is pretty clear..

..although you have to remember that time is also spent before the top border starts, so we're actually using a bit more time than it looks.

The slightly optimized version of the source can be downloaded here.

Next up we'll do something more useful. Honest.

Any comments etc. can be emailed to me.