Code++: Understanding larger disassembly

This past weekend I was disassembling something and it struck me how few people knew how to do what I was doing.

So I figured I would quickly document the process of taking a chunk of disassembled code and putting it into a format that's easier to understand.

The disassembly I'm going to use for my examples is the same disassembly I was working on this weekend. It is the music playing routines from some Apple IIgs software.

There is some knowledge that you should have before attempting to parse disassembly on any platform.

You should be familiar with the assembly language for your target platform.

Since this code ran on an Apple IIgs, that means you should know 65816 assembly. If you know 6502 assembly, you're more than halfway there. The 65816 is a 16-bit successor to the 6502, and is backwards compatible. There's just a few more opcodes and addressing modes, as well as some additional flags for switching between 8-bit and 16-bit registers.

You should understand how hardware is used on your target platform.

For example, if you're disassembling software that draws to the screen, you should know how the various video modes work and how video memory is laid out. Since we're disassembling a music player, we should know how the sound card works.

I'm not going to cover how the IIgs sound card works here, if you care, the Apple IIgs Technical Reference covers it.

Locate the code you're interested in.

This is one of the more difficult things to do. You can't just disassemble an entire executable and hope to parse it all at once, it's far too big. Instead you need to use some techniques to narrow down your search.

For example, if you're trying to figure out a file format for some file the program uses, you should search the disassembly for OS file calls and then work your way backwards from those calls.

If you're trying to find out how a program does some special graphical effect, then maybe check for code attached to the video blank interrupt (if it has one, again, know your platform).

For our example, we want to find the routine that's tied to the sound interrupt. This interrupt is fired whenever one of the soundcard's channels stops playing a note. This is the natural place where you would advance the song and set up the soundcard to play the next note. (Again, know your platform).

Searching through the disassembly I find the following:

00/3b84: a9 5c 98    lda  #$985c
00/3b87: 8f 2c 00 e1 sta  $e1/002c   ; Sound interrupt
00/3b8b: a9 3c 00    lda  #$003c
00/3b8e: 8f 2e 00 e1 sta  $e1/002e

This installs the sound interrupt routine. It actually stores 5c 98 3c 00 into that memory location. This is machine code that translates into the assembly "jmp $00/3c98". So we now know that $3c98 in bank $00 is the sound interrupt routine that we want to understand.

The code next to where the sound interrupt is installed also loads the music into a specific location in RAM, loads all the sample data into the sound RAM, and sets some variables. You would want to disassemble this routine as well as it paints a clearer picture of what's happening with everything sound-related. I won't do that here, but this is how I was able to better understand the sound interrupt routine.

Isolate the code you care about, once you've found it.

So now we know the function we care about, we're going to isolate it and any small functions that the code calls. This helps get rid of noise.

I've got the code we care about uploaded here. It's 528 lines of assembly. That's a pretty healthy sized routine.

One thing I'll point out before we get into it. There are two small routines at $3cd4 and $3cd9 that seem to be redundant and return the value in a table located at $00:0000 (which is a total red flag that something is horribly wrong). That's not correct. The function that sets up the sound interrupt routine also replaces those zeroed addresses with calculated addresses into the music data.

That's another lesson. Assembly programmers will modify program code instead of setting a variable, all just to save a memory access later on.

So how do we begin? Do we just step through the assembly and see what it's doing? Well, that's certainly an option, and for smaller routines, I have no problem doing this. However, with longer routines, you'll want to do what you can to see the structure of the routine.

Visualize the structure of the routine.

I mean that literally. The best way I've found to do this is to break the assembly up into what's known as "basic blocks". A basic block is a chunk of contiguous assembly with only a single entry point (at the head), and exit point (at the tail). You then draw connections between chunks.

There is software that will automatically do this. It's the first step in a higher-level decompiler. However, that software is either specific to an architecture, or expensive, or both. Luckily, it's rather easy to do it by hand.

I like to use GraphViz for this. It's a simple command-line tool that takes a text file and generates a directed graph from it.

Here's the first two basic blocks from our assembly plotted:

digraph {
  node[shape="record" fontname="helvetica"];
  edge[fontname="helvetica" fontsize=9];
  m3c98 [label="$3c98|$3c9c"];
  m3c98 -> m3c9d;
  m3c9d [label="$3c9d|$3ca0"];
  m3c9d -> m3c9d [label="bmi"];
  m3c9d -> m3ca2;  
}

We start with a node called "m3c98". This is the very first address in the routine. It's prefixed with "m" just because node names cannot start with a number. The node has a label that has 2 records, these records are the entry and exit addresses of the basic block. This is immediately followed by a connection from this basic block to the next basic block.

You can see how I've also added a label to one of the edges. If a jump between basic blocks is caused by a branch instruction, I'll put a label on it. Unlabeled edges are implicit connections (m3c98 will automatically go to m3c9d without any branching).

Doing it this way is quick and easy, you can build the graph progressively as you go through the assembly looking for branches. If you jump backwards into the middle of a basic block, it's trivial to split that basic block into two blocks and reassign the edges coming from original block to come from the new block instead.

The end result is this graph:

Full function graph

You can view the full size graph here.

This graph is useful, but it's a bit noisy. Our next step is to clean up the graph. We do this in several ways. The first is to remove any useless blocks. It's very common to find a branch that leads directly to a jump. This creates a useless block. So you can go through the graph, looking for any blocks that do this, and remove them. Replace any edges pointing to that node to the target node of the removed block.

Another common thing, at least with this CPU architecture, is there are branches that lead to a basic block that does nothing but change between 8 bit and 16 bit mode, and then jump to another block. Since we don't really care about this too much (the disassembler cared more), we can treat these as empty blocks and remove them as well.

Finally, you'll find nodes that do nothing but clean up the stack and then RET. Instead of jumping to these blocks, we should go through and replace them with unique RET blocks. That way you don't have all of these exit points all routed to couple of blocks, looking like part of the structure.

After you make these cleanups, you'll have a graph that looks like this:

Cleaned up graph

See the full size graph here.

This is much cleaner and easier to follow.

Analyze the structure of the routine.

There are patterns in this graph that correspond to programming structures that you are familiar with.

For example, this pattern is an IF statement.

IF statement

At the end of the block at $3e11, there is some comparison done. The bpl tells us that we skipped a block if the result of some comparison was positive.

This can be structured as follows:

// at the end of the $3e11 block
if (result < 0) {
  // do $3e24 block
}
// do $3e27 block

The IF/ELSE statement is also easy to spot:

IF/ELSE statement

The beq tells us that if some result is zero, then we do the $40db block, otherwise we do the $40a9 block.

This can be structured as follows:

// at the end of the $40a5 block
if (result == 0) {
  // do $40db block
} else {
  // do $40a9 block
}
// do $40e7 block

Finally, loops (for loops, while loops, do/while loops) look like this:

WHILE loop

This is a case where a block points back at a previous block. That's how you know it's a loop. This particular one looks like a WHILE loop because the comparison happens in $3ec7 (i.e. first). The bcs tells us there was most likely a comparison between two variables.

The structure then would be:

// $3ec7 block
while (something < somethingelse) {
  // 3ed3 block
}
// 3eda block

Using these simple patterns, it becomes rather easy to see the structure of the routine.

Convert disassembly into pseudocode

Now we begin the process of converting the disassembly into pseudocode. This is a step-by-step process, and we use the basic block graph to help us organize the large structures.

For example, when you get to a basic block that has multiple other blocks from below pointing back up to it, you know it's the top of a loop. You should open up a new block and mark it. As you go along, you'll discover what type of loop it is and you can go back and clarify.

When you get to an IF/ELSE statement, you can scan through the tree and see where it will converge (of if it will converge).

You may end up with criss-crossed blocks, which cannot be represented cleanly without GOTOs. This is especially true of older programs that were likely written in assembly in the first place. I would recommend using GOTOs in your pseudocode, and only after you grasp what is happening, refactor it into a structured routine.. duplicating code if necessary.

As you practice, you'll get better at spotting patterns in the structure. Without much thought at all, I can see that the block at $3d67 is probably the same as the block at $4013. They're likely both the bottom of the same FOR loop, and they're likely just duplicated because assembly programmers would do that. At the very least, block $3d67 is a "continue" statement if block $4013 is indeed the bottom of a FOR loop.

The way I convert disassembly into pseudocode, is at the top of the file I keep a list of the globals.

They start out looking something like:

uint8_t d_4356;  // 4356

Then as I figure out what d_4356 is, based on usage, I'll go rename it:

uint8_t numInst;  // 4356

Whenever I come across a new memory address, I'll search for it, if it exists, I use its name, if it doesn't, I add it to the list.

For the opcodes themselves, I tend to build them out piece by piece.

Let's look at the beginning of the assembly.

It starts with a spinlock on the soundCtl. I'll look up what the high-bit is and note it in a comment.

while (soundCtl & 0x80) {}  // wait for DOC to finish

Then it takes the result of soundCtl and ANDs it with $9f and stores it back into soundCtl. Again, I'll look up what those bits do and comment it.

soundCtl &= 0x9f;  // DOC mode, no autoinc

and so on. I'll have temporary variables for the accumulator and x and y indices, which will eventually get replaced with more complicated expressions to make them easier to follow.

And that's it. The final pseudocode (basically C) for the sound interrupt routine is here.

Code++

2016-12-13

Understanding larger disassembly