There's no way around it, sprites are pretty necessary when doing just about anything, so I tackled this problem. First, here's the sprite I'm using:
Uh, let's zoom a bit. Too bad there's no standard way to disable the bilinear filtering. Maybe in a future browser:
So yeah, I'm not a graphics designer. Anyway. The colors are black, white, and anything else. Everything that's black or white becomes the "sprite data" or "or mask" and everything that's anything else becomes "sprite mask" or "and mask".
Why do we need the two masks? Since the screen is a bitmap with every byte representing 8 pixels, we'll want to preserve any pixels that our sprite is not covering. We'll have to do a read-modify-write operation, where we first take the current pixels off the screen, then AND our sprite mask to cause all pixels that are on the sprite to become zero, and then OR our sprite data on top of this to turn the pixels back on where we want.
For any kind of speed, the 16x16 sprite has to be stored as eight 24x16 sprites, one copy for each sideways shifting. Remember, the z80 doesn't have a barrel shifter, so doing the shifting on the fly would be rather slow. If the sprite is animated (say, a walking animation), those animation frames can sometimes be written to these shifted copies, saving space, and you get the animation playing for free to boot.
Iteration 1, in pure C, managed to render one sprite in a frame.
Iteration 2, where I massaged the sprite data to be more streamlined - every two bytes of data contains one byte of mask and one byte of pixel data - managed three sprites in a frame. Still way too slow, so I went down to assembly.
/* shift 3*/
0b11111111, 0b00000000, 0b00000011, 0b11111100, 0b11111111, 0b00000000,
0b11111100, 0b00000011, 0b00000000, 0b00000011, 0b11111111, 0b00000000,
0b11111000, 0b00000100, 0b00000000, 0b00000000, 0b01111111, 0b10000000,
0b11110000, 0b00001000, 0b00000000, 0b10000000, 0b00111111, 0b01000000,
0b11110000, 0b00001001, 0b00000000, 0b01000000, 0b00111111, 0b01000000,
0b11100000, 0b00010000, 0b00000000, 0b10000000, 0b00011111, 0b00100000,
0b11100000, 0b00010000, 0b00000000, 0b00000000, 0b00011111, 0b00100000,
0b11100000, 0b00010000, 0b00000000, 0b00000000, 0b00011111, 0b00100000,
0b11100000, 0b00010000, 0b00000000, 0b00000000, 0b00011111, 0b00100000,
0b11100000, 0b00010000, 0b00000000, 0b00000000, 0b00011111, 0b10100000,
0b11100000, 0b00010000, 0b00000000, 0b00000000, 0b00011111, 0b00100000,
0b11110000, 0b00001000, 0b00000000, 0b00000000, 0b00111111, 0b01000000,
0b11110000, 0b00001000, 0b00000000, 0b00000001, 0b00111111, 0b01000000,
0b11111000, 0b00000100, 0b00000000, 0b00001000, 0b01111111, 0b10000000,
0b11111100, 0b00000011, 0b00000000, 0b00000011, 0b11111111, 0b00000000,
0b11111111, 0b00000000, 0b00000011, 0b11111100, 0b11111111, 0b00000000,
Oh yeah, the SDCC compiler (like most small device compilers) supports the 0b00000000 literals. My PNG to sprite data converter could easily be changed to outputting hex constants. Or to draw ascii art. Or to do political commentary on twitter. But I digress.
Iteration 3, with inner loop (which renders one span of the sprite) in assembly got up to 6 sprites in a frame.
Iteration 4 unrolled the whole drawing routine, and had it completely in assembler, managed 18 sprites in a frame. Here's code for one span (repeats 16 times):
ld e, 2 (iy) ; fetch screen y offset from table
ld d, 3 (iy)
ex de,hl
add hl, bc ; add x offset
ex de,hl
ld a, (de) ; load pixel from screen
and (hl) ; and the mask
inc hl ; increment data ptr
or (hl) ; or the pixel data
ld (de),a ; put modified pixels to screen
inc hl ; increment data ptr
inc de ; increment screen ptr
ld a, (de) ; rinse
and (hl)
inc hl
or (hl)
ld (de),a
inc hl
inc de
ld a, (de) ; repeat
and (hl)
inc hl
or (hl)
ld (de),a
inc hl
At this point I stepped back and pondered if there was some other approach I could take. Any further optimizations on the current approach would be very incremental. So I wrote a code generator that takes in the sprite data, and outputs assembler functions that have the sprite data baked in. Armed with the knowledge of what exactly is in the sprite data, the code generator could take some shortcuts, and for all writes where all of the pixels get overwritten it can skip the whole read-modify-write operation and just write out the sprite data.
The result was 26 sprites in a frame. On average. Since the sprite data is different for every shifted version, the shortcuts also vary - some shifted versions are faster than others.
ld l, 2 (iy) ; fetch screen y offset
ld h, 3 (iy)
add hl, bc ; add x offset
ld a, (hl) ; load pixel from screen
and #252 ; and with hardcoded value
or #3 ; or with hardcoded value
ld (hl), a ; plot pixel to screen
inc hl ; increment screen ptr
ld (hl), #3 ; write hardcoded value (mask was 0)
; skip third byte completely since it doesn't change screen
There are two other drawbacks to this approach. First, it takes a lot of space. The data-based approach took 546 bytes for the rendering routine and 768 bytes for the sprite, and if I wanted to render several different kinds of sprites (gasp!), each of these would take 768 bytes more.
The "compiled sprite" took 3645 bytes. And it can handle only this one specific 16x16 sprite. I did consider the possibility of doing some kind of crazy self-modifying-code hack where you first prime the function with the sprite data and then render several times, which should still be faster... but that would mean removing all of the shortcuts, which would make the routine slower and bigger.
Iteration 5 of the data based approach abused the stack mechanism of the z80 and got up to 20 sprites per frame. Also swapped a couple of registers to get rid of some instructions. Code size dropped to 470 bytes.
ld l, 0 (iy) ; fetch y offset from table
ld h, 1 (iy)
add hl, bc ; add x offset
pop de ; get mask and pixel data using stack
ld a, (hl) ; load pixel from screen
and e ; and the mask
or d ; or the pixel data
ld (hl),a ; put modified pixels to screen
inc hl
pop de ; rinse
ld a, (hl)
and e
or d
ld (hl),a
inc hl
pop de ; repeat
ld a, (hl)
and e
or d
ld (hl),a
The sprites per frame values are from a non-accurate emulator that doesn't handle the ULA contention timings (where ULA monopolizes memory and causes z80 to delay), so the actual sprite counts are lower. The relative speedups are still real. The drawing routines also simply draw the sprites, and there's no clearing covered here, so the routines as is are not too useful, but they are a foundation.
All of the code is available on github.