TQ4HAM: Parsing on Three-Phase Power (LLO Archive)

Created 2025-12-12, last modified 2025-12-12. Visibility: public

Part of my archive of Layover Linux Official posts on Tumblr.


2025-11-12

As a human, when you read source code, you use a lot of visual processing, which is very parallel. This is why syntax highlighting and minimaps can be helpful to developers: we usually take in the shape of things before we start registering the specifics.

A picture of Hello Kitty on a unicorn. The picture starts out too pixelated to understand as anything but "colorful", but the whole image is visible in pixelated form. Over time, the pixelation clears up and the image is visible in higher and higher resolution.

A program that needs to read source code - whether an interpreter, compiler, linter, whatever - it has no such luxury. There's no quick perception of the vibes. A CPU can only pay attention to tiny slices of data at a time. It can do that very quickly, but it's more like a laser beam scanning over your code character by character.

The same picture of Hello Kitty on a unicorn. This time, it starts out blank white, and loads in full resolution from top to bottom over time. So halfway through the animation, the top half of the image is visible in perfect clarity, and the bottom half is entirely missing.

So, mechanically turning some programming language source code into some in-memory datastructures (a process called "parsing") is a challenge: you want to be fast, and all you have to work with is being able to perceive a character at a time, more or less.

It turns out, this work is best broken into phases, each of which can be simple and operate with the efficiency of a factory assembly line. My current parser is in 3 phases:

  1. Lexer
  2. Recursive descent
  3. Object construction

My old parser had phases 2 and 3 combined, and I'm going to explain why that's bad. Or at least, it was for my circumstances.

Lexing

A Lexer does an extremely simple linear scan of your source code, and turns it into chunks called Tokens. Each token has a type, but these types are pretty simple. Let's look at how some real Prone source code breaks down into Tokens.

[3, 'four']

Turns into:

(start) (end) type
0 1 left square bracket
1 2 number
2 3 comma
3 4 whitespace
4 10 string
10 11 right square bracket

It's really not complicated, and that's the point. A lexer never needs to backtrack and undo a mistake. It doesn't need to hold complicated state. It does one scan over your source code and it's done. This means later stages are still seeing the world through a straw - it's the CPU, after all - but you can see a whole token at a time through that straw! Maybe a few in a row.

Recursive descent

Recursive descent is more intense. While there are other types of parser designs, recursive descent is nice for being flexible and easy to understand, relatively speaking. The whole idea is that, most of the time, there are multiple different directions you could go. Like if we've got this far reading some source code:

x = ...
    ^

Well, what comes next could be anything! It could be a list. It could be a number, or a string, depending on what token (or tokens) come next.

So we try something, which we know might be wrong. We try to read it as a number, and see oops, that isn't right. So we undo, and try another possibility. Is it a string? Nope. And eventually you try "list" and it works, but only when that whole list has been confirmed valid through the same trial and error process.

Recursive descent is simple and flexible because you just list out all the things that might happen at every fork in the road, and your parser... tries 'em. Eventually, you get a pass-or-fail grade for the whole piece of source code, which tells you whether the source code was valid or not. Obviously just knowing that your source code is valid doesn't get you a whole lot, but the idea is, you'll also do a little bit of note-taking as you go, so if you get to the end of a successful recursive descent, you'll have your notebook at the end of the journey.

Object construction

The whole point of parsing is to turn all your source code into some data structure in memory. And in theory, you should be able to do it with the notebook from your recursive descent quest. So what's in the book?

Well, in my case, that book is a list of instructions for a virtual stack machine. And the instructions look like this:

This is... basically a program unto itself! A tiny temporary program that, by having access to the original token list, and the source code it's based on, can build objects and fuse them together until you get a final result. The program should end with one item on the stack. And that, folks, is your final parse result.

The cool thing is, aside from some asserts to defend against bugs in the parser itself, there is no error checking code in the object construction phase. We already know we have a correct piece of source code, and a notebook detailing exactly the steps to turn it into an object in memory. We are no longer in "I don't know" territory. Recursive descent took that bullet for us. Now we know.

So, how did this work with two phases, then?

Badly!

At the "maybe it's a list" part, it would allocate a VVec in memory just in case, and then if it turns out not to be a list we're looking at, we have to deallocate the VVec. This type of stuff would happen all the time. Soooo much wasted work. It was also harder to think about:

maybe it's a block?

As the parser writer, you have to pay so much attention to which resources need cleanup at different failure points, and it's constantly changing.

Basically, recursive descent wants a cheap undo button, as close to instant as possible. Object construction wants to never undo or have failure logic ever. If you try to weave the two together, you get a godawful worst-of-both-worlds combo: recursive descent with the slowest undo button ever, and object construction with the error handling logic of the Hellraiser dimension.

In part 3, I'll talk about the wrepr functions, and why they make repr so much faster than it used to be.