Debugging when everything is on fire
Using AI to debug critical system failures
Have I mentioned that I am really lazy?
A few weeks I posted this note:
Got Claude code to debug a memory leak that was causing my system to crash within a few minutes of the leak starting by creating the conditions for the leak, running Claude with dangerously-skip-permissions, and telling Claude it had a few minutes to figure out what was going on before the system died
It actually managed to figure it out by inspecting the jobs that were causing the system crash in real time. Here’s what it found:
So here’s the infinite loop:
Root runs: concurrently npm:test:ui npm:test:server npm:test:plugin
This executes: cd plugin && npm test
Plugin is empty (no package.json)
npm searches up, finds ROOT package.json
Runs ROOT’s test script again: concurrently npm:test:ui npm:test:server npm:test:plugin
This creates ANOTHER cd plugin && npm test
Infinite recursion!
Took maybe 3 runs (and by ‘run’ I mean ‘system crash’ lmao). Each time the system crashed I would reload a nori claude agent with the previous transcript and go ‘system crashed, pick up where you left off’ and I’d restart the leak. Shockingly effective, end to end the whole debug took ~15 minutes.
My current day to day operating procedure is:
Figure out what I need to do
See if an AI can do it
If the AI can do it, tell the AI to do it
If the AI can’t do it, figure out how to make it so that the AI can do it
Repeat
My laptop is a piece of shit and sometimes will crash, freeze, or otherwise become unresponsive. It seemingly does this for all sorts of creative reasons, or no reason at all. The most common culprits were OOMs caused by:
Running too many vitests at once
Running too many claude code instances at once
Claude code doing something stupid like calling the same tool a million times by accident
Doing too many builds at once
Chrome doing something dumb
I suppose this is inevitable when you have the equivalent of five to seven software engineers running your computer at the same time…the machine only has so many gigs of RAM.
Anyway, whenever my computer would crash I would spin up a nori claude instance and ask it to figure out and profile what is going on. Most of the time it was able to do a decent job. It would look through the journalctl and so on. But at some point it is just fundamentally limited, because the “problem” it is trying to debug is transient and is not currently happening. The best time to figure out why your system is crashing is to be present while it is crashing, but this is really hard. You have to first get really lucky; and second, your system has to remain responsive enough that you can still interact with it using your normal GUI.
Fast forward a bit and I realized that if I could mimic the situation in which a certain OOM was occurring, I could spin up a coding agent while the OOM was happening and get the model to debug. And that worked! In real time the nori coding agent ran through a suite of system commands, figured out which jobs were causing problems, and then even started to dig into the explicit function calls (python and node processes can both be inspected at the function call level by sideloaded processes).
Besides being extremely cool, I realized that with a few tweaks I could make this a legitimately useful SRE tool. The basic idea: any time certain system vitals cross a threshold, spin up a coding agent and have the agent debug what is going on as aggressively as possible, with all logs being streamed to a third party server (in addition to being stored on disk). This basic abstraction would solve two huge problems:
Most of the time it is very hard to figure out why exactly a machine went down. This tool would effectively act as an airplane blackbox, a sort of last record of what was going on that specifically is focused on debugging the failure as it happened. Massive speed up on figuring out system-breaking issues.
Most of the time there are available interventions that someone could take that would prevent the system from going down at all, if a human was around when the crash was happening. For example, if I see that I’m about to OOM from vitest, I can just kill a bunch of the processes that are spiking memory and prevent the system from crashing that way.
TLDR: I built a nifty tool that I’m calling Nori Premortem. It’s a very simple CLI tool. It takes in a config, sets up a watcher process on your system vitals, and if any of them are ever breached it will spin up a coding agent tasked with figuring out wtf is going on. Nori Premortem is by default hooked into the broader nori observability platform (which we recently branded as nori watchtower), but you can bring your own server if you like, you’ll just have to backwards engineer the API.
You can find the nori-premortem on github (https://github.com/tilework-tech/nori-premortem) and on npm (https://www.npmjs.com/package/nori-premortem).
By the way, a little plug! My company Tilework has shifted into fully focusing on coding agents and their myriad applications, because they are so fun and weird and they can power all sorts of interesting and cool things. We have 40-odd years of developer ecosystem tooling — IDEs and CI/CD and version control and so on. AI coding agents are new. Fundamentally new. And I think people who are building with these tools need a new ecosystem built up around them. That ecosystem is what we’re building. We organically ended up there because we kept building tools that we needed ourselves! Our motto is something like ‘if an AI can’t do it, figure out why until the AI can do it’, and its been working really well for us. I’ll be writing about Tilework more, so I added a little ‘Tilework’ tab on my home screen. If you’re interested in seeing what we’re building, drop us a line or check out our github. More to come.


