Despite still being in recovery from toxic mold exposure[^1], other health issues that will probably require surgery at some point, and being in the process of moving... I really should post about the tech I make on the tech blog I write. Which means the _Quest for Half a Milli_ series is back!

In fact, this is the penultimate chapter: the next one, about reference counting optimizations, will be the last. I really have enjoyed writing this series (and the principal work behind it), but it'll be nice to get to do other stuff in the Prone space. I look forward to getting to talk about the platform abstraction work, for example, which inches me a bit closer to being able to run an interactive Prone interpreter on my ESP32 board, and also let me fix a performance regression.

None of that is today's topic, though. We're here to talk about Dynamic Values (DVs), and how I made things faster by changing what their bits mean.
## Swinging both ways

It's easy to say "DVs are tagged pointers", but that's half a lie. _Some_ are pointers that are tagged with what type of object they point to. The tag lives in the bottom four bits.

```c
// [ 60 bits of pointer goodness ][ 4 bits of tag ]
// [                64 bits total!                ]
DV my_dv = ...;
uint8_t tag = my_dv & 0b1111;
```

This relies on every valid pointer _already_ having the bottom four bits be zero. Since we know that, we can safely store data in those bits, as long as we remember to clean that data up before actually using a DV as a pointer.

```c
// The bottom four bits are, indeed, zero.
String *str = malloc(sizeof(String));
assert((str & 0b1111) == 0);

// This means we can make a DV (Dynamic Value) like:
DV *tagged_ptr_to_str = (DV)str | DV_TC_STRING;
assert((str & 0xF) == DV_TC_STRING);

// Now, code can tell that this DV points to a String.
// This means we can make decisions based on the type
// of a pointer-y object in memory.
assert(dv_typecode(tagged_ptr_to_str) == DV_TC_STRING);
assert(dv_typecode(tagged_ptr_to_str) != DV_TC_LIST);

// We can cheaply zero out the tag bits again with a mask.
String *from_dv = POINTER_MASK & tagged_ptr_to_str;
assert(from_dv == str);
```

Like magic! Of course, it depends on part of the C standard that requires `malloc` to hand you memory that is (at least) 16 byte aligned. Think of it like rounding to the nearest hundred in decimal. If I give you some number that's guaranteed to be a multiple of 100, you always know the bottom two digits are zero. All this "alignment" talk is really just the binary version of the same idea, where it's multiples of 16, which gives you 4 **b**inary dig**its** (bits) of... you wanna field this one, Tim?

![[free_real_estate.png]]
> "It's Free Real Estate" - Tim Heidecker

Thanks, buddy.

**But not all DVs are pointers at all.** If the tag bits are all zero, that's treated as a special signal that we're actually talking about some kind of "inline" value. A DV is 64 of God's own bits, there's room. We still need to spend some of those bits on saying what type the value is, for example, is it a 16-bit signed integer, an 8-bit unsigned, etc. So the standard is, for inline values, the entire bottom 16 bits of the DV is type information.

```c
// Sure, we can fit a byte in there. Fuhgettabout it.
uint8_t my_u8 = 117;
DV inline_value = (my_u8 << 16) | DV_TC_U8;

// We can tell what type the DV is, and unwrap it again.
assert(dv_typecode(inline_value) == DV_TC_U8)
assert((inline_value >> 16) == 117)
```

Of course, if you remember, inline DVs need the bottom four bits to be zero, so all the typecodes for inline types end in "0000" on a binary level. This means we only have 12 bits to tell all the different kinds of inline values apart, but honestly, there's not that many types to worry about. 12 bits is more than enough. The real squeeze is for pointer types, which only get 4 bits of type info, and have to give up one of those precious 16 slots to indicate inline values.

At this point, I can show you how that `dv_typecode` function works, and it should probably make sense.

```c
uint16_t dv_typecode(DV dv) {
    bool is_pointer_mode = (dv & 0xF) != 0;
    if (is_pointer_mode) {
	    // Only the lowest 4 bits are type information.
	    // We need to zero out the other stuff!
		return dv & 0xF;
    } else {
		// Value mode, so the low *16* bits are type info.
	    return dv & 0xFFFF;
    }
}
```

The real version was a bit more compact than this reader-friendly reenactment, but honestly, in the eyes of a half-decent optimizing compiler that can turn simple `if` statements into branchless conditional moves:

![[same_picture.png]]
> "They're the same picture." - Pam, from The Office. US, obviously.

So, pros and cons. On the pro side, this scheme ends up mostly using fast instructions like shifts, ANDs and ORs, which is important because we're type-checking, wrapping, and unwrapping DVs _constantly_ as some of the most fundamental operations of the Prone interpreter.

It's not without cons, though! 15 pointer types is a tight budget, like "eating nothing but lentils this year" tight. And even though `cmov` is faster than branching instructions, they're not free - we do end up spending a lot of time in the `cmov` instructions of `dv_typecode` in practice, and that's always going to be the case as long as type checks have to navigate this two-mode system.

All of this, I could live with, but there was a final straw. Remember how the pointer tagging only works with pointers you get from `malloc`?
## We need to point at things we didn't get from `malloc`

Dang it! That's specifically the thing we can't handle in this design!

I actually teased this in the previous chapter of TQ4HAM:

> Now I pass these around as tagged pointers to the C function itself, saving an indirection on every native function call. This is the woman in the red dress. Do you see why yet? It's the subject of the next post, but you'll need time for your mind to... align to it.

Alignment puns. Truly the height of comedy.

So here's the problem. Most of the Prone standard library (at least right now) are functions written in C, and these will always be how the base layer of the interpreter works. That means we need some way to pass these functions around as objects that the Prone interpreter can understand: DVs.

This seems at first like it should work. We have pointers, we have a scheme for tagging them with a type (`DV_TC_NATIVE`), match made in heaven, right? Except that the starting addresses of these functions don't necessarily live at clean little multiples of 16. It's wherever your C compiler decided to put them. And telling your compiler to strictly align your functions to 16 bytes is... messy, and vendor-specific, and I never got it to quite work, but it probably would have been fragile if I did. Sometimes your pointer just ends in a 3 or something. That's life.

For awhile, I had a workaround, where I would allocate a struct object on the heap, which contained a pointer to the C function.

```c
// Again, simplified reenactment.
typedef struct {
  RC refcount;
  PtrToCode code;
} NativeFunction;

NativeFunction *add = malloc(sizeof(NativeFunction));
*add = { .rc = 1, .code = &op_add };
DV add_dv = add | DV_TC_NATIVE;
```

This... worked. But it was expensive and annoying. The C functions themselves have what's called a "static lifetime" - they exist the whole time the program is running. But these wrapper objects have to be allocated on the heap, and have reference counting, and when you want to actually use the C code they're talking about, you have to:

1. Untag the pointer to the wrapper.
2. Fetch the data at that pointer to find `.code`.
3. Jump to `.code`, which will probably be a second fetch.

Modern CPUs are okay at doing Step 1 really fast, but they _hate_ stuff like Step 2. It's like when your dad gets you a gift card for Christmas, but it's in five layers of nested boxes. Hardware engineers are actual wizards, and if your CPU can predict what memory you need to fetch from RAM, you will be rewarded with incredible execution speed. "Pointer chasing" - having to go to Point A to find out where Point B is - makes this magic impossible. When I talked about saving an indirection? This was that indirection.

For a fun bonus, generating the standard library as a dictionary-like object meant creating a _lot_ of these wrapper objects. And in the test suite, we create and tear down the standard library multiple times. Real-world use cases would generally only feel this setup cost once, but they'd probably feel that indirection cost _more_ than the test suite does. It's miserable either way.

So getting rid of these wrapper objects was great, and I'll explain how I did it shortly, but being able to store non-aligned pointers unlocked one more key performance boost: stack allocation.
### A side tangent about stacks and heaps

Nearly all programming since the 1980s has been _structured programming_, either in C or a higher level language. This sounds fancy, but mostly, it means that we define software as a bunch of nested side quests.

> In order to run my web server program, we first initialize, then we run a connection loop, then we clean up.
> 
> In order to initialize, first we parse the command line arguments, and then we bind a socket, then we print a log line saying we're ready and serving on whatever port.
> 
> In order to parse the command line arguments...

You get it. This mindset is accepted now, but it was very controversial when C was new and everyone was used to writing raw assembly.

The cool thing about structured programming is it lets you keep track of all this nested work with a very fast data structure: the Stack. While you're running some function, you extend the stack a little further to make room for the temporary data that function needs. When that function ends, you shrink the stack back again. This is about the fastest way you can possibly allocate memory dynamically - there's a register in your CPU for the active end of the stack, and allocating and deallocating are just making that register bigger or smaller with single-cycle math.

The other kind of dynamic memory is the Heap. When I talk about `malloc`, this is the kind of memory being managed. The heap is a lot slower, because it has to keep track of where there's room for new objects. Heap allocation is one of the worst bottlenecks of the Prone test suite. Instead of a single instruction, you're running a complicated function full of heuristics and bookkeeping.

The heap is more flexible (because it lets objects outlive the functions that created them), and sometimes you need that flexibility. But when you _know_ you don't need that flexibility for an allocation, you can use the stack, and not pay the `malloc` speed tax.

The thing about stack allocation is that it's _another_ thing that happens outside the protective grace of guaranteed 16 byte alignment, just like the function placement issue from earlier. You might end up being okay by coincidence, especially because every C type has its own alignment requirements (so some types will always be 16 byte aligned or more, even on the stack), but it's flimsy, and can be architecture-specific, and you just... you really shouldn't do this.

So, if you're wondering why I kept mentioning in previous posts that all objects in Prone _had_ to be heap-allocated instead of stack-allocated, now you know why. We couldn't safely tag pointers to stack-allocated objects because they might not be aligned right. Now, we don't have to care about pointer alignment, so we can avoid a costly `malloc` for temporary objects in some performance-sensitive functions![^2]
## What's the catch?

Okay, this is going to seem a little gross at first. On a 64-bit system, pointers are 64 bits wide. It would be easy to assume that all 64 of those bits are sacred, because they might contain important information you can't afford to corrupt.

![[wellyes.png]]
> "Well yes, but actually no." - Hugh Grant (misquoted)

It turns out that, in practice, most hardware actually only cares about the lower 48 bits, and even though there's new hardware pushing that number a little higher, we'll probably always have 8 bits at the very top to work with.

This sounds odd, but it's a practical decision. 64 bits is overkill for address space, by a lot. We're talking 36,893,488 _terabytes_ of RAM. It's easy to think "hey, number go up before, maybe we'll have supercomputers like that someday" - but supercomputers don't work like that. They're actually clusters of smaller computers. Still very impressive computers, especially the high-performance networking between them, but you would never need that much RAM in a single node, because you hit other single-node bottlenecks way before that.

In light of this, lots of programming languages take advantage of these high bits to store type information, exactly the way I'm looking to do. This means a hardware vendor couldn't start using those high bits without breaking those languages, many of which have deep-pocketed investors like Google, Microsoft, or Meta. This means, no matter how small-time _my_ language is, I get to take shelter in the shadow of giants.

![[you_mess_with_one_of_us.png]]
> "You mess with one of us, you mess with all of us!" - a brave, unnamed New Yorker from Spider-Man (2002), carrying a murder weapon from Clue (1985)

In fact, some architectures have special support for doing these type of shenanigans more efficiently (by masking out tag bits in hardware without even needing an extra instruction for it), like ARM64's "Top Byte Ignore." So although repurposing high bits feels sketchy at first, this method of pointer tagging is about as officially sanctioned as you could need it to be.

I did run some experiments while trying to figure out whether to use 8 or 16 bits of tag info, and whether to put it at the top or the bottom. The 8-bit options were extra-safe and allow more room for inline data (`IText` in particular gets to be 7 chars long instead of 6), so I ended up committing to that, but there was still the question of whether to store the tag bits at the top or the bottom (in the encoding sense).

```c
// [ lower 56 bits of pointer data ][ 8 bit tag ]
// [ 8 bit tag ][ lower 56 bits of pointer data ]
//
// Same difference, really. Just different instructions
// to pack or unpack them. We're losing the unused top
// 8 bits of pointer data either way.
uint8_t low_8_get_tag(DV dv) { return dv & 0xFF; }
uint64_t low_8_get_val(DV dv) { return dv >> 8; }
uint8_t high_8_get_tag(DV dv) { return dv >> 56; }
uint64_t high_8_get_val(DV dv) { return dv & 0xFFFFFFFFFFFFFF; }
```

I was only able to test on an x64 machine, where I wasn't surprised to see that `LOW_8` was the most consistently fast option, but I was surprised how close the race was, and how many iterations I had to benchmark to tease out a difference! x64 is really pretty good at truncating registers, for historical reasons, so it makes sense that grabbing the bottom byte of a register is cheap. On ARM64, I suspect that the ideal choice is `HIGH_8` to take advantage of Top Byte Ignore. But all of these are quite cheap.

![[dv_perf_graph.png]]
> Time to run the Prone test suite with a crazy high number of in-process repetitions. Lower scores are better. The `LOW_8` option doesn't win every round, but the high options have consistency issues, and `LOW_8` beats `LOW_16` in almost every round.

This isn't the most methodologically rigorous benchmark ever, but it was good enough for what I needed.

A few final notes:

* Every option did consistently better than the legacy format, which did so poorly in comparison that I ruled it out before I even started making graphs in LibreOffice Calc. Not that I could have stuck with that format anyways, given its allergy to alignment issues!
* We now have a shared 256-slot space for possible typecodes. This gives pointer types way more breathing room than the old 15-slot limitation, even when sharing the space with the inline types.
* I'll talk about this more in the final TQ4HAM post, but this enabled a really nice optimization for determining by typecode whether an object is refcounted or not.

I don't know when the edit for the final post will be finished, since I'm stretched so thin in my life right now. That said, I'm looking forward to sharing it! It'll be ready when it's ready, I suppose.

-----------------------------

[^1]: Actually, on this note. I was getting firsthand and secondhand exposure to the fungal nightmare realm from inside the unsealed walls of my apartment for almost a year, and in that entire time, I didn't turn into a raging transphobe. So I don't want to hear any excuses for J. K. Rowling.

[^2]: Only where we can make lifetime guarantees, of course. I've made some stack-related API decisions to help catch human error here, but that's for the final episode.