The Quest for Half a Milli - Fast Atoms (LLO Archive)

Created 2025-12-12, last modified 2025-12-12. Visibility: public

Part of my archive of Layover Linux Official posts on Tumblr.


2025-11-11

Recently, I've set my mind to something. I'd already had a goal to keep the total execution time of the test suite under 1ms on my own hardware. But because optimization is its own mindset (and can get in the way of feature work, or vice versa), I've decided not to optimize until I go over 1ms, and then make everything faster until I get below 0.5ms.

I've already got from over 1ms to 0.54ms, and the end is within sight, but will take serious work. I want to talk, in installments, about what I've done so far, and what I think will push me over the edge.

Fast Atoms:

Atoms are short strings in Prone, which are stored and deduplicated in a lookup table, so that they can be passed around (and compared for equality with each other) as just ID numbers. Traditionally, their behavior was to reserve the ID zero for the #NULL atom, and then as you define more, they're given monotonically increasing IDs. One, two, three, four...

This is a good start, but encounters some limitations:

There's a clever solution here, for anyone either creative or experienced enough to see it. But it requires looking at the DV format.

// ============================================================================
//                              DYNAMIC VALUES
// ============================================================================
// DV is a tagged pointer type. Since malloc'd pointers are 16-bit aligned, we
// know that the lowest four bits of every "legit" pointer should be 0. This
// lets us store a very small amount of metadata in those four bits, and also,
// store lots of non-pointer values inline.
//
// The overall structure is as follows:
//
//                                                   [     typecode  ]
// vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv mmmmmmmmmmmmmmmm ssssssssssss tttt
//  32 bits of value data            16 bits meta        12b sub   tag
//
// "Inline" values are stored with TAG=0. We have plenty of other bits to
// clarify the actual type in this case. For pointer values, we only have 4
// bits to work with, so we need to treat them as precious.

Dynamic Values are 64-bit numbers that can be interpreted in multiple different ways, but contain enough information to tell you which way is correct. They can store a pointer (with 4 bits of type information) or 48 bits of inline data (with 12 bits of type information).

Historically, AtomIDs were 32-bit numbers that fit in the left half of a DV with room to spare. But let's think about the full 48 bits we have to play with. It's 6 bytes. That's 6 letters of ASCII (or coincidentally ASCII-compatible UTF-8), if you wanted.

This has a few immediate benefits. Not only is it very fast to convert between numeric and text forms of an Atom with no lookup table, you can even do it at compile time! You can finally hardcode an AtomID into your source code, and know it'll work right!

There's also an obvious downside. 6 characters is very short. It's an extremely punishing limitation, and considering the language relies on using atoms for things like variable and function names, that limitation would permeate throughout the language.

So, instead, we use a hybrid approach. Bring back the lookup table. But we only use that for atoms longer than 6 characters. Atoms that belong to lookup are marked by having their high bit set, something that still gives 47 bits of headroom for monotonically increasing IDs. If the high bit is zero, interpret it as text turned into a 48-bit number.

As you might imagine, this makes a huge incentive pressure to have the Atoms built into the language, like #DISPATCH_ERR, short enough to fit in fast mode. So now it's #DERR, and #PARSE_ERROR becomes #PERR. They're less self-documenting, but there's only a few of them to know, and the speedup up of avoiding a dv_atom() function call is worth it.

In the less frequent cases where we do need to go to the lookup table still, I did optimize that too. I already had an index of offsets, but that wasn't good enough. I also needed an index of lengths, because string comparison is expensive: when checking if two strings are equal, if you know their lengths, you can check if they're equal length FIRST, and rule out any obvious mismatches before slowly checking character by character. So now I zip through the length index first, and only if I find a string length match, do I find the string's starting offset in the offset index. And because the length is known, I can use memcmp() instead of strcmp(), which is another speedup.

Final trick: because I'm zipping through potentially a lot of the length index looking for matches, the main limitation is memory bandwidth: the CPU can work faster than memory keeps up. To solve this, I made lengths an 8 bit number instead of a full size_t. This means we can check 64 string lengths per cache line. The tradeoff is that the longest length we can represent is 255, so the longest atom texts we can store are 255 characters. Which is fiiiine, because atoms are supposed to be for relatively short #TEXT_TAGS_LIKE_THIS.

Next time I'll talk about the parser updates. But don't forget the small string optimization (SSO) trick we just used here for Atoms... it's going to come back later in unexpected ways.