Performance Has a Harsh Mistress (LLO Archive)
Created 2025-11-30, last modified 2025-11-30
Part of my archive of Layover Linux Official posts on Tumblr.
2025-06-17
Performance has a harsh mistress, and her name is Dynamism. But we'll get back to that. Have you met Occam's Genie yet?
No? Oh. Well. He's not much of a conversationalist, but quite a useful friend. Genies have a reputation for being maliciously compliant, finding the most spiteful way to grant a wish. My boy Occam is different. Describe your requirements, and he'll give you the simplest thing that fulfills them. It's kind of his whole schtick.
So imagine you're tasked by your morally grey employers to pollute the urban landscape with an ad for the latest Marvel movie at a bus stop, and you have the good sense to subcontract Occam's Genie to do it for you. You do have to be specific in your requirements - "include the name of the movie, don't just announce there's a new one. It should be in color." And so on, so forth. The Genie will never try to subvert you, just... his job is bare minimum complexity, he's very dutiful about it, so you need to be dutiful about stating what can't be optimized away.
Pretty soon, you have a lovely still image of ad creative printed on plain paper. You might need to ask for glossy, but overall, you're feeling quite proud of yourself, until your boss pops her head in and says: "Oh yeah and it needs 3D embossing around the characters and objects." Suddenly, you have a more complex production process for a more complex output. You need dies or other tooling to stamp a light 3D effect onto the paper. You need to key the embossing and the print so that you don't get alignment issues. The frame you put the ad into may need to have a bit of an air gap to allow room for the new ridges and rises. You ask this of the Genie, and he complies, but it costs more.
The boss reappears with a new requirement: "The exploding Big Ben in the background? We need the hands on the clock to actually tell time." You call your Genie, and he quotes you a new price, to account for the machinery involved to embed an analog clock into your display.
Turns out it needs to display the live, accurate time in London, and the company doesn't trust technicians to calibrate an analog clock to a distant time zone. So now your display needs to have digital innards behind the mechanical hands, to synchronize with GPS to get the current UTC time, do some conversions in the logic to get London time, and be able to self-calibrate (maybe with encoders) to point the hands appropriately. It's starting to matter how we power the display, so you'll need a battery pack or to hook into the power supply at the bus stop, whichever is simpler.
The order comes down from on high: we need animations. The ocean waves endlessly crashing down, the helicopter hovering with its blades spinning. Now we really do have to resort to the entire display being a flatscreen panel, with no moving hardware parts, and rendering Big Ben's hands in software (along with everything else). What internal chipsets are going to be appropriate for the job without being overkill?
"Oh and we need the protagonist's eyes to track anyone looking at the display." Guess we're putting a camera on there! And OpenCV! At minimum.
Obviously, every time we raise the bar on requirements, it forces us to reconsider the simplest technology that still fits the requirements. But I want to say something more specific than that. From a production and operation costs perspective, the measurement of this complexity is "how much runtime dynamism is needed?"
Even embossing is about looking different from different angles (conditional behavior!), it's just simple enough that you can still solve the problem with an inanimate object. But as the amount of dynamic behavior you need goes up, it defines the complexity floor of your solution space. It doesn't affect the complexity ceiling - you could, of course, solve the simple poster problem with a gaming PC rig, and it would work, it would just mark you as insane. But the complexity floor will always be defined by how much runtime conditional behavior you need.
The rubber hits the code
Software engineering is where this rule is on clearest display. Let's look at the following source code:
#include <stddef.h>
// Defined elsewhere. Has some kinda panic logic.
void report_fatal_error(const char *msg);
typedef IntArray {
size_t len;
int buf[];
}
// Nice safe API so out-of-bounds accesses will be caught and reported
int get_item(IntArray *arr, size_t i) {
if (i >= arr->len)
report_fatal_error("Bounds check failed! We're all gonna die!");
return arr->buf[i];
}
int sum(IntArray *arr) {
int total = 0;
for (size_t i=0; i<arr->len; i++)
total += get_item(arr, i);
return total;
}
The get_item function is, in theory, pretty nice - you could have a policy to always use this safe API, and it would start to look weird and wrong to see any raw accesses to arr->buf[i]. Then you can just not worry about bounds errors!
But in the sum function, we can statically reason something: we're never going to have a bounds violation here. We're iterating from 0 to arr->len - 1. Every time we call get_item from sum, we'll be in a valid range. So we'd hope and expect our compiler to be smart enough to notice this too. On a modern compiler, we will probably get the correct simplest possible machine code output: get_item is inlined, the compiler has the context to say that the report_fatal_error branch never happens, and that branch gets completely eliminated. You should get the same binary as if you wrote:
total += arr->buf[i];
Because to your optimizing compiler (a bit of an Occam's Genie itself), shaving away the bounds check does not affect the correctness of the program. We don't need it (here). So we leave it out. And the magic of the compiler is that it can automate these decisions, so we can use safe zero-cost abstractions all throughout our source code, and the compiler will unwrap whatever it safely can. If you plug my sample into Compiler Explorer, you'll find that on low optimization levels, the compiler will translate your source very literally and sum will still call get_item, but with optimization turned on to -O2 or higher, sum's assembly code is totally self-contained and does not check boundaries. That tiny little dynamic behavior was optimized away, and it's less work that has to happen at runtime.
In fact, let's add one more function and see its assembly.
int sumfour() {
int numbers[] = {1,2,3,4};
int total = 0;
for (size_t i=0; i<4; i++)
total += numbers[i];
return total;
}
/* Assembly (x64, -O2):
sumfour():
mov eax, 10
ret
*/
I don't think you have to be an assembly guru to notice that this function is clearly just return 10. There's no for loop. There's no stack allocations. It's just return 10. The compiler was smart enough to do the math at compile time. And I want to emphasize - if we didn't know the contents of numbers ahead of time, if that was a runtime-provided value, we would need to give different answers for different inputs (dynamism), so we would need a lot more instructions to do runtime labor.
Side note: a lot of modern software is highly dynamic on purpose. One build of
nginxcan serve lots of people with different configurations, even if nobody uses all the available features. There's a lot of "if" statements that, on a given webserver, always go one specific way. This is why I like to argue that the branch predictors of modern CPUs are a very lite, constrained type of JIT compilation. The hardware is using runtime statistics to let us keep our dynamic little "if" statements, but execute them at the speed of certainties. As long as the prediction is correct, this is a huge speedup. But mispredicts are expensive, and programs start with no prediction data, so static knowledge is still more powerful than runtime heuristics when you have the knowledge.
But the rabbit hole goes deeper
Is a ROM chip static? The contents can't change. That seems pretty static to me.
But wait, you can't access all the contents at once. You have to send a memory address on some input pins, and get the result on some output pins. ROM is really doing this:
static const char DATA[] = { /* ... */ };
char get_byte(size_t address) {
return data[address];
}
And that's conditional behavior, isn't it? The output pins depend on the input pins, even if the results are deterministic. In fact, the performance bottleneck of ROM chips is how fast they can respond to a change of the input pins, producing the correct voltages on the output pins. When you cut out all the fat, all that's left is the cost of... dynamism.
It really is turtles all the way down.
Where this connects to my recent work
I'm working on a programming language called Prone. It makes a lot of design decisions that try to push computation into the compile step really aggressively. One of those decisions is the emphasis on value types. You'll notice in my sumfour example, those crazy optimizations were only possible because the data was in a format that the compiler could recognize as entirely static. We know the types, the contents, everything. Value types work like that! They allow more stuff to be recognized by the compiler as non-dynamic/hardcode-able, so you can eliminate entire deep call stacks by working them through at compile time and just baking the results into the runtime version of the program.
Functional programmers will recognize this as "the power of pure functions." I don't technically disagree, but I find "purity of data" a more useful framework than "purity of functions." That's worth a blog post on its own.
This is a particularly impressive feature for asset pipelines. You can load a bunch of images at compile time, do whatever tweaks and modifications you want, producing grids of raw RGB data. The compiler will automatically hardcode that RGB data into your final binary, meaning that when you need to send textures to the GPU at runtime, you're just accessing memory locations in the read-only parts of the binary. No file accesses at runtime, no JPEG decompression library, no image manipulation functions, just array access into stuff that's already memory mapped by the OS (because that's how running a program works). Can you tell that video games are a primary use case I'm holding in mind for this language?
My goal is to create the most aggressive Occam's Genie compiler possible. One which removes all runtime dynamism except that which is logically required, automatically, producing a program which is as static as provably possible. If I used objects and private fields, this would be a lot harder, as these features make the compiler a bit blinder to what can be optimized.
Let's look at a bit of Prone implementation code:
Ast op_plus(Ast *args) {
if (!AST_IS_TERM(args[0]) && AST_IS_TERM(args[1]))
return args[0];
if (AST_IS_U8(args[0]) && AST_IS_U8(args[1]) && AST_IS_TERM(args[2]))
return AST_U8(AST_AS_U8(args[0]) + AST_AS_U8(args[1]));
return AST_ATOM("DISPATCH_ERR");
}
This is the current source code for the + operator, which can also be used as op::plus(arg1, arg2, ... argN). You'll notice that it has to handle any number of arguments, which may be any type (and only a few combinations can result in sensible output). So this code is making a bunch of late-game decisions about whatever random junk you feed into it, and as a consequence, the + operator can be slow as it figures out its multiple dispatch.
While early versions of Prone will use the execution engine at runtime, even then, builds are going to simplify the AST tree before producing C code. So if your Prone source was something like this:
print(@runtime, u8(1) + u8(2))
You would still get a transpiled C source like this:
#include "prone.h"
int main(int argc, char **argv) {
evaluate(prone_stdlib(argc, argv),
PRONE_FNCALL("print", PRONE_IDENT("@runtime"), 3));
}
And not like this:
#include "prone.h"
int main(int argc, char **argv) {
evaluate(prone_stdlib(argc, argv), PRONE_BLOCK(
PRONE_STMT_EXPR(
PRONE_FNCALL(
"print",
PRONE_IDENT("@runtime"),
PRONE_FNCALL("op_plus", AST_U8(1), AST_U8(2))
)
)
));
}
Even on a trivial example like this, you can see the difference that comes from simplifying the tree as much as possible during compilation. And like, this is still running with the interpreter engine doing all sorts of dynamic stuff! "AST" objects are 64 bits wide, which is a crazy way to box up 8 bit numbers if you can know in advance that they're 8 bit. But we still saved the runtime cost of parsing the Prone source, calling op_plus, etc.
This is about what I expect from the Prone 1.0 milestone - that at runtime, we still use all the fully dynamic machinery of the Prone execution engine (including the fat, generic AST instances), but on a pre-digested version of the program state. Where it gets really exciting is Prone 2.0. You might notice that the whole high-level call stack is dynamic machinery, but at the leaves of the call tree we drop into manual C with concrete types. The op_plus builtin is an example of this "dynamic interface to static code" pattern. Prone 2 code generation will be all about noticing opportunities to turn small chunks of dynamic logic into new native functions, and continuing to eat away at that border between the Prone execution engine and raw C until we've made things as native as possible. So, producing C like this:
#include <stdio.h>
int main(int argc, char **argv) {
printf("%d\n", 3);
}
No evaluate function. No #include "prone.h". Truly boiled down to a simplest little C program that fulfills the behavior we expect. This is going to be a lot of hard work, which is why I want a whole major version to achieve this level of optimization after getting the core language working in a more naive way with Prone 1.
If this works out, I can give the world - and myself - a language that feels like something from the Python/Ruby/Perl family of scripting languages (garbage collection and all), but which compiles to very efficient machine code that's often suitable for embedded, or audio processing, or web servers, or PC games.

