As you might or might not know, but I'll recap just in case:

 - `rsync` is an incredible piece of software (that has been mostly stable and feature-complete for years), which allows you to make files over yonder match the files you have over here, without copying the data that already exists over yonder. It would be difficult to count how many people trust `rsync` for backups and other tasks.
 - It was originally created by Andrew Tridgell, who commonly goes by "Tridge," who's been an open source hero for many people for many years.
 - `rsync` has been maintained by a few people over time. Recently, the person who had been maintaining it wanted to step down, and asked Tridge to come back and take over the project again. Tridge agreed.
 - Tridge didn't really want to be in charge of `rsync` again, as he's getting into the retirement phase of his life, and like all open source projects of any note these days, the issue tracker was getting hammered with a flood of LLM content (including many vuln reports that are high effort to verify). So, he decided to use LLMs to help with some of the stuff he saw as worthy long-term investment in the project, for example, rewriting the test suite. There's a very Thanos-like rationale to it, in a way: "I used the AI to fight the AI." Or alternatively, if you can't beat 'em, join 'em.
 - The new code had problems that were obscured by the test suite rewrite.
 - That's quite bad, but ethics issues of using LLMs aside[^1], these problems could have been caught by more adventurous early-adopter users, had the new version of `rsync` been published as a "release candidate", or even a major version bump. If you've heard of semantic versioning (AKA, semver), that exists to help in situations like this[^2], to manage the blast radius of software updates intuitively.
 - Tridge released the new broken versions of `rsync` as minor versions, which is the semver way to say "no biggie, this won't break nothin'." Downstreams like Linux distros had no obvious reason not to "take him at his word" about these being non-breaking changes.
 - People's backups broke. They were, quite reasonably, upsetti spaghetti when this came to light, both for ethical and stability concerns, and the feeling that these new hazards had been snuck in quietly.
 - Some people are trying to offer alternative, pre-slop versions of `rsync` under different names, although we can expect these efforts to make it into package managers unevenly and slowly. This is probably the long-term future of the project, but in the short term, it's a mess.
 - Some people, mostly _not_ the effected downstream users but just trolls that just like drama and excuses to bully people (lot of chan presence, from what I've heard) have apparently spammed the `rsync` issue tracker with horrible messages and images, to enjoy making Tridge's life a living hell.
 - Tridge responded in a [blog post](https://medium.com/@tridge60/rsync-and-outrage-d9849599e5a0), which... didn't really help. It doesn't have much to rebut the people who warned about the things that can go wrong with AI even if you're a careful seasoned pro, which isn't really counterargued with "you don't understand, I'm a seasoned pro and I was _so_ careful!" It also doesn't really stop the trolls, who will lap up almost any acknowledgement as fuel. To my read, the moral ends up being "you are not immune to slopaganda", even if that wasn't the intended message of the author.

So yeah, that's all a big freakin' mess, huh? And for people trying to use a minimum of LLM-contributed code[^3], especially across multiple different packages managers, in the short term chaos of trying to figure out which package managers have good `rsync` vs bad `rsync`... the best option feels to be getting away from any `rsync` at all for awhile, until things settle down and slop-free forks are widely available across distros.

It me. I'm people.

So now I have to take a deep breath and explain what _I_ was using it for: this website. Specifically, these days my website is statically generated by a Python script I wrote myself. As I've said before, you don't need an off-the-shelf SSG these days, it's not hard to make your own. But that gets generated locally on my own laptop (Katarina, running on Arch packages), where I can locally preview and do local quality control, before finally publishing to the real server ([a VPS in Germany](/blog/fuller-stack/0027-hallo-welt) called Alune, running on Debian packages) were you all can read it.

Do you see it? I have files, and they're on my computer, and I want to update Alune to have the same files and contents, without sending a bunch of data to Germany for things that were already up-to-date.

Yeah. I was using `rsync` for the copy stage.

So, I needed a different way to copy the files. `scp` would work in a pinch, but it's not a great long term infrastructure choice, because that copies everything every time. Someone suggested `unison`, but that's really for syncing in both directions, and it'd take some finagling to get it to only update in a Katarina->Alune direction, and I'd never fully trust I got the flags right (plus, `unison` does some persistent state tracking that I've found to be messy). But there *is* an option here that ended up making a lot of sense: `git`.

Even if you're not a programmer, you have probably heard of `git`, or at least GitHub, a code hosting site named after `git`. It's a version control system, which - to oversimplify it - means that it stores a history of snapshots, not just the latest version. Nearly all software development is done in `git`, and my website is already developed in a `git` repo  (the generator, not generated contents, or the Obsidian vault where most of the generator input data comes from). So why wasn't I doing this before?
## The downsides

These all ended up solvable, but, okay hear me out first.

Git can handle large files, but it doesn't like them. One big reason is that it stores the whole history. For large text files, this isn't the end of the world - those files changes slowly, and `git` is smart enough to be able to store (or at least transfer) those as a series of differences between versions, instead of `MassiveFileVersion1, MassiveFileVersion2...` all being totally separate copies, taking up tons of room despite having most of their content in common. But for binary files like images and audio, `git` just isn't designed to be able to diff them, so you do end up actually storing each version as if it were a new file.

I also have files on my website that I consider semi-sensitive, due to their personal nature. Not so much that I'm stressed about them passing in cleartext through Cloudflare - I'm trying to get away from those goons for other reasons, I don't think they care about a dear diary or two of mine - and anyone who knows the URL can see them, but I wouldn't want the names, let alone the contents, to just be publicly readable. Think, "unlisted on YouTube" privacy semantics here. These days, `git` usually entails having some service like Sourcehut (love 'em!) be a central hub for you to push data from one place, and pull it from the other, which gets a little hairy for sensitive info.

The sensitivity problem gets even worse in combo with the full history model. Here's something that genuinely happens sometimes to the nerds in your life: a secret credential ends up accidentally being committed into a repository's history. If you want to erase it, now you have to rewrite a bunch of history, which breaks a bunch of tooling assumptions, because you just hopped over to a parallel timeline where the secret was never revealed. But you also have to scrub all the places the secret might have ever showed up, which is a nightmare. If at all possible, the best thing you can do is give up on that secret, change your password or whatever, and just accept that the original is forever publicly compromised. Not something that makes sense for a diary entry.

I never rigorously examined these things, really. It's just a situation where a lot of usually-correct rules of thumb said "nope, bad idea, do it a different way."
## Actually looking closer

Rsyncaggedon finally got me to reexamine which of those rules actually mattered, and I'm glad I did. First of all, you don't technically need a central server - `git` is first and foremost a _decentralized_ version control system. I could have a repo that I just directly sync from Katarina to Alune, with no GitHub or Sourcehut or such, and that solves a lot of problems in one fell swoop.

Large files? Eh, who cares, it's a totally private repo and the history doesn't really matter - I could truncate the history any time I want, if it gets big enough to be a problem.

Sensitive files? Well it's still only on Katarina and Alune, and transferred over `ssh` the same way `scp` would.

Sensitive files being impossible to erase from the history? Again, the repo's properly private and the history is disposable - the reason history is normally so sacred is because that makes it possible to collaborate between multiple developers, which is irrelevant to me, since I'm just abusing `git` as a transport system.

This ends up working great, because `git` has enough information to figure out what files are really changed _totally locally_ before ever connecting to Alune and sending the latest commits. It even means I can review and sanity check the changes to the generated site contents, if I want.
## Making it happen

First of all, I worked out most of the kinks locally. I _had_ been generating all the site data to a directory called... well... `site`. So in the `mm4cc-web` repo checked out on Katarina, there would be `site/index.html` and `site/static/dillo.png`, etc. When I needed to regenerate the site, for local testing or final publishing (they generate slightly different contents, so cleanup is necessary) I delete the `site` directory and remake it with `uv run main.py`. This would never end up accidentally committed into the generator repo, thanks to `site` being listed in `.gitignore`.

This needed to change if I was going to have a long-lived git repo for generated content, I needed to figure out where it would live. I decided to shift everything down and have a `site/contents` dir, so instead of `site/index.html`, I'd be generating `site/contents/index.html`. That meant `site/contents` would be safe to delete, and `site` would never be deleted, which means the internal state management dir `site/.git` would always be left in peace.

That opened up an opportunity for improving my local hosting story anyways. You see, normally I bind-mount `site` into the docker container that's running `nginx`  to host my site. That bind mount can get confused and unhappy if `site` itself is deleted, which I was doing, which meant I'd have to restart the container any time I regenerated the site contents "the fully clean way". Eugh. I decided to keep the mount itself on `site` (which binds to `/usr/share/nginx/html` in the container filesystem), but change the `nginx.conf` so that it would look in `/usr/share/nginx/html/content` for the site data from now on. This means technically the nginx container can see the `.git` internals at `/usr/share/nginx/html/.git`, but it's not serving anything out of there, so it doesn't matter. Now I'm never deleting the mounted directory, just a thing inside it, so the bind mount doesn't break, even on a clean regen of the data!

Honestly, most of the other local effort was just making sure all the different pieces of my Python script were putting files in the new correct place. After getting that all working, I turned `site` into a git repo of its own, and created a initial commit with all the generated data inside it.

On Alune, I shut everything down and pulled the latest versions of the `nginx` and `cloudflare/cloudflared` containers (might as well, while I'm here and need to take down containers anyways). I pulled the latest version of the `mm4cc-web` repo (mainly to get the updated `nginx.conf`), and here's where I remembered I needed to be a little smart:

There's broadly two kinds of git repositories, bare and checked-out. Most of the time you interact with a checked-out repository, where you have all the files sitting around and editable. A bare repository doesn't have those, but that restriction means that you can _push to it directly_, which you can't do with checked-out repos. That's why bare repos are used for hosting, like that's how GitHub and Sourcehut work[^4].

What I really needed was both. I needed a bare repo I could push to directly from Katarina (which I set up at `~/site.git`), which kinda acts like a mini-git-server on Alune. And then I needed a checked-out repo in `~/mm4cc-web/site`, which can pull from the local `~/site.git` and update the local checked-out files. Once I set those up, I turned the freshly-updated containers back on, and it just worked.

So, part of the reason I'm writing this is that I need some new content I can use, to actually test out the publish automation, which should now work something like this:

1. Regen the site on Katarina to `site/content`
2. Create a new commit of the `site` repo
3. Push that commit to `~/site.git` on Alune
4. SSH into Alune and `git pull`, which should pull from `~/site.git`.

That sounds like a lot, and it would be annoying to do manually, but it shouldn't be bad as a series of steps rapidly executed in `make deploy`. This blog post should work as content to test out the automation code, so if you're reading this, I'm betting it worked!

_epilogue: [yes it did](https://git.sr.ht/~maddiem4/mm4cc-web/commit/2e178a8f90da9c0b53d19dce9b3065b1f15b4144) :)_

[^1]: Only Rusties kids will remember this ;)

[^2]: Semver massively predates the LLM fad, and is not specific to agentic coding. We've been shipping bugs and breaking changes on the back of human error for a long time.

[^3]: Which is very hard to get to zero now that the Linux kernel includes slop. I do eventually plan to get off Linux for that reason, but that's obviously going to involve process and planning, similar to getting away from Cloudflare for my hosting. I am so tired of migrating things.

[^4]: Technically those are probably doing more clever custom stuff, but a little indie hosting situation would literally just be bare git repos that you push to over SSH.