2019-10-04

Self-Modifying Code

When you read about programming in the 80s, often there is a discussion about how common assembly was.  You'll hear about all sorts of mystic tricks and tips that programmers back then used.  You might have also heard reference to "self-modifying code".

When I was younger, "self-modifying code" always sounded so advanced.  It was something that only the very brightest people did, and they surely must've done it to prevent people from disassembling their code.

The reality is much more mundane.  I'll show you a real example of self-modifying code from a piece of commercial software from the 80s.  I'll walk you through it, and explain why it was written this way.

The code in question is from an Apple IIgs game (the CPU is a 65816, which is a 16-bit variant of the classic 6502).

00/0e4e: a9 00 60    lda #$6000     ; A = $6000
00/0e51: 8d 6d 0e    sta $0e6d      ; mem[$0e6d] = A
...
00/0e65: e2 30       sep #$30       ; 8-bit registers
00/0e67: a2 00       ldx #$00       ; X = 0
00/0e69: bd 00 0a    lda $0a00, x   ; A = mem[$0a00 + X]
00/0e6c: 8d ff ff    sta $ffff      ; mem[$ffff] = A  !!!
00/0e6f: ee 6d 0e    inc $0e6d      ; mem[$0e6d]++
00/0e72: d0 03       bne $00/0e77   ; is result != 0?
00/0e74: ee 6e 0e    inc $0e6e      ; no, mem[$0e6e]++
00/0e77: e8          inx            ; X++
00/0e78: d0 ef       bne $00/0e69   ; is result != 0?

That's a chunk of real self-modifying code, let's go through it.   Notice the references to memory at addresses $0e6d and $0e6e.  If you look at the address labels on the side, you'll see those addresses point to the operand of the instruction I've marked with "!!!".  By modifying the values at those addresses, we cause the "sta" instruction to point to arbitrary 16-bit addresses.  That's the only part of this code that gets modified.  Mundane, right?

I put pseudocode over on the right so you can see what each instruction does even if you don't know 65816 assembly.  Basically the code does this:

uint8_t *ptr = &memAt6000;
...
   for (x = 0; x < 0x100; x++) {
     *ptr++ = memAtA00[x];
   }

The observant among you will wonder why we go through all this hassle when you could instead do: sta $6000, x.  (which is basically the same as ptr[x] = memAtA00[x])

There are several reasons.  One is the result isn't quite identical.  In the original, when the loop is finished, the pointer points to the end of the array, not the beginning. This is important in the original code because this is actually an inner loop that gets run multiple times by an outer loop, which leads us to two, because it gets run multiple times, the array grows beyond the bounds of an 8-bit index register.

Why not just use a variable elsewhere in memory to store the pointer then?  Surely this doesn't need to be self-modifying.

Well, actually it does.  It has to do with the addressing modes available, or not available as it were.

You see, the 65816 only allows indirect addressing on the direct page (known as the zero page on the 6502).  Absolute addressing doesn't offer any indirect modes.  Needless to say, the direct page gets a bit crowded, and programmers have just decided that its far more elegant to use self-modifying code to "fake" indirect absolute addressing.. which is why this technique is rampant on the Apple IIgs.

It's also less confusing in practice.  The direct page can move around, so you either keep track of it by hand (setting it to the appropriate place upon entering a routine, adding mental and real overhead) or you leave it in a single place the entire time (limiting you to less than 128 pointers for your entire program).  It becomes clear that this particular style of self-modifying code is an elegant solution to the problem.

This isn't the only use-case of self-modifying code, of course, but it is one of the most common, especially on machines with limited addressing modes.

2019-09-24

A Word Search Generator

A while ago I needed to generate a daily word search using a word list that was unique for each day.  I didn't find anything online so I invented my own algorithm.  It is designed to encourage words to overlap as much as possible.

The setup for the algorithm is very simple and a bit boring, so I'll just cover the basics here.  The two things we need before we begin is a list of words to put in the word search, as well as the dimensions of the letter grid.  For my implementation, the dimensions and a file containing a whitespace-separated list of words were passed in as arguments.

The word list also needs one other, non-obvious thing.  You need to make sure that none of the words in the word list are substrings of any other words in the word list.  Be sure to check for reverse matches as well.  For example, "times" and "emit" would be very confusing to search for.

Then before we hit the main algorithm, we sort the word list by length, with the longest words first.  We then loop through all the words in the word list and attempt to insert them into the grid.  If we succeed, we mark the word as used so we can later print it out as part of the clue list.

for (word : words) {
  if (addWord(grid, word)) {
    word.wasUsed = true;
  }
}

To add a word to the grid, we first loop through every cell in the grid.  We check to see if the cell is blank or if the cell matches the first letter of our given word.  If either of these are true, we measure the score for that position.  Then we move onto the next cell.

addWord(grid: Grid, word: Word): bool {
  bestScore = 0;
  for (y := 0..grid.height) {
    for (x := 0..grid.width) {
      if (grid.cell(x, y) == Blank || grid.cell(x, y) == word[0]) {
        measure(grid, word, x, y);
      }
    }
  }
  // random best fit goes here
  ...
}

To measure the score, we try to place the word in the grid at that location in all directions.  We then calculate a score based on the number of intersections for a given direction.

measure(grid: Grid, word: Word, x: int, y: int) {
  for (dy : [-1, 0, 1]) {
    for (dx : [-1, 0, 1]) {
      if (dx == 0 && dy == 0) {  // no direction
        continue;
      }
      ex := x + word.length * dx;
      ey := y + word.length * dy;
      if (ex < 0 || ex >= grid.width ||
          ey < 0 || ey >= grid.height) {  // word doesn't fit
        continue;
      }
      int score = 1;
      for (i := 0..word.length) {
        cell := grid.cell(x + i * dx, y + i * dy);  // next cell
        if (cell == word[i]) {  // matches word
          score++;
        } else if (cell != Blank) {  // word doesn't fit here at all
          score = 0;
          break;
        }
      }
      if (score >= bestScore) {  // are we among the best matches?
        if (score > bestScore) {
          bestScore = score;
          matches.clear();  // new tier of best scoring matches
        }
        matches.add(Match(x, y, dx, dy));
        if (dx && dy) {  // we want diagonals, so add twice
          matches.add(new Match(x, y, dx, dy));
        }
      }
    }
  }
}

This is the core of the algorithm, so we should look at it at little closer.  We loop through all the directions, which is where dx and dy are one of -1, 0, 1.  If both dx and dy are 0, then that doesn't point in any direction, so we skip that case.  We calculate the end position of the word for the current direction and check to see if it fits on the grid.  If not, we skip that direction.

Next we calculate the score for that direction.  Encountering a cell already containing a matching letter is worth 1 point.  This encourages intersecting words.  If we encounter a cell with a different letter, we reset the score to zero and move on since the word doesn't actually fit there.

If the score is among the best scores, we add our current orientation to the list of matches.

What we end up with is a list of all the highest scoring orientations.  Given the choice between a horizontal or vertical match and a diagonal match, we would prefer the diagonal match.  Thus, we've added any diagonal matches to the high-score list twice.

Note that we haven't actually placed the word on the grid yet.. we've just calculated a list of best places to put it.

After we have tested every cell in the grid, we now have a list of possible places to put that word.  If we don't have any, that means the word doesn't fit anywhere in the grid and we should abandon that word and move onto the next.

If we do have a list of possibilities, we should pick a random one and then place the word into the grid.  Then move onto the next word.

Finally, after we have added all possible words to the grid, we loop through the grid and fill out any blanks with random letters.

Then you can output the grid and marked words.

Possible improvements could be the inclusion of red herrings.  When filling in the blanks with random letters, you could use letters from the word list to make "almost" matches.  It's also trivial to make the grid a random shape.  Just take the grid and before you do anything, fill in any blank spaces with a special character.  Then replace the special character with a blank on the final grid.

2019-02-15

Oldschool planar graphics and why they were a good idea

Oldschool planar graphics

If you ever work with old hardware, like writing an NES emulator or reverse engineering an old Amiga game, you will inevitably come across planar graphics modes.

These days, with loads of memory and huge color depths, graphics are stored sequentially.  It makes the most logical sense.  The red, green, and blue values for each pixel are just stored in an array, in some predefined order (RGB, BGR, whatever).  There may even be an alpha channel.

Video cards of yesteryear didn't have access to that much ram, so direct color, requiring 3 or 4 bytes of ram per pixel, was out of the question. 

Those old video cards used a palette-based system instead.  There's a palette with the red, green, and blue values of each color.  Then the image data itself is just an array of indices into the palette.  With 256 colors, we can use a single byte per pixel.

This is all very straight forward so far.  However, even 256 colors was extravagant for some older hardware.

What if you only had 16 colors?

With 16 colors, we only need 4 bits per pixel for the indices.  That means that a single byte can modify 2 pixels on the screen.  This works out rather well, since the width of each video mode is always going to be an even number.  With 16 colors, images take up exactly half the amount of space as if we had 256 colors.  The only thing you'd really need to worry about is whether the low nibble (4 bits) is drawn first, or the high nibble?  That of course would be up to the hardware to decide.  Regardless, the sequential palette based system still works.

32-colors - the problem child!

With 32 colors, things get difficult.  We technically only need 5 bits per pixel for the indices, but you cannot really cut up an array of bytes into 5-bit chunks easily.  You could pad it, so that each byte contains the index, but now you're wasting 3 bits per pixel.  That's a lot of space to be wasting.

You could try to cram it all together, so the first 5 bits of the first byte represent the first pixel, and the top 3 bits represent the bottom 3 bits of the second pixel, and the bottom 2 bits of the second byte represent the top 2 bits of the second pixel, and so on.  This is a disaster.  Addressing individual pixels is now a total nightmare.

So our options are waste space, or make addressing suck.

Planes to the rescue!

This is where planar graphics come in to save the day.  Instead of trying to pack all the bits of the indices together, we spread them out across planes. 

A plane is a 1 bit per pixel array.  So any byte of a plane controls 8 pixels.  For 32-colors, we need 5 bits per pixel, so we need 5 planes.

The first plane contains the lowest bits of all pixels.  The second plane contains the second lowest bits of all pixels, and so on.  So to modify the 3rd pixel on the screen, you would modify bit 3 of the first byte of every plane.

It's a bit more complicated than sequential storage, but it's still easy to address a specific pixel (byte = position / 8, bit = position & 7), and it saves a ton of RAM.  Imagine a 320x200 screen with 32 colors.  If we had padded each pixel out, it would require 64,000 bytes.  With planes, we only need (320 * 200 / 8) bytes per plane, with 5 planes, comes to just 40,000 bytes.

Another reason for planes

While planes were super useful for odd bit-length palette sizes like 32 colors, they were also used for expansion reasons.  Let's say we made a system with 16 colors, but intended to upgrade the system later on to 32 colors.

If we made the 16-color system use planar graphics instead of storing the indices in the low and high nibble of each byte, then going to 32-colors only requires adding a new plane somewhere in RAM. The older 16-color games would still work, they just wouldn't ever reference the 16 newer colors in the palette (assuming the new plane was initialized to 0 by the system).

With systems like the Amiga, software could even specify how many planes it wanted to use, and the hardware addressing doesn't even have to change to accommodate it.