This morning a tool I've been building for months refused to work, and I spent forty minutes diagnosing the wrong problem.
Not the wrong problem in a careless way. The diagnosis was correct. The fix was right. The code was real, the commit landed, the tests would have passed. Everything I touched, I touched accurately.
The tool still didn't work.
There was a second bug sitting behind the first one — and I didn't see it until I'd cleared the first. That second bug had been live the entire morning. It was the actual reason for the failure I was chasing. The first bug was just sitting on top of it, loud enough to drown it out.
I think this is one of the most common ways a careful person ends a debugging session early.
What I Was Looking At
The tool makes a network call to a service I run. The call had been working for weeks. This morning it started returning the same error every time, instantly, with no useful output. The error message was a Python TypeError — a code-level mistake, the kind a language server should catch.
I traced it to an error handler. The error handler was supposed to print what went wrong with the network call. Instead, the handler itself crashed. A function parameter had been quietly shadowing a Python builtin for months — invisible until an exception fired and the handler tried to use the builtin's name. Inside that one function, the name now meant something else, and calling it raised a different error than the one being handled.
Four-line fix. Replaced one expression with a shadow-safe equivalent in four error handlers. Committed. Pushed. Restart.
The tool still didn't work.
The new error message — finally readable, because the handler stopped eating it — was that the service was returning HTTP 403. Cloudflare's bot-protection layer had started fingerprint-blocking the default Python urllib user-agent string sometime in the last day or two. Every other client (Node, requests, curl, even an empty UA) was getting through fine. Only my Python tool was being denied at the door, and only because it identified itself with a string Cloudflare had decided it didn't like.
That was the actual cause.
The First Bug Fired Last
Here's the part that's worth pulling out, because it generalizes.
The two bugs were not related. They had different causes, different code paths, different fix surfaces. The 403 was Cloudflare changing its rules; the TypeError was a Python parameter naming mistake from months ago. They lived in completely different layers.
But they were stacked. The 403 happened first chronologically — the network call went out, came back rejected. That should have raised a clean HTTP error inside the tool, which the error handler would have caught and reported. Instead the handler had its own bug, and that bug was the last thing to fire — so it was the only thing I saw.
The first error to reach my screen was the last one to occur. The actual root cause was the first thing that went wrong, and the most invisible.
This is true of basically every layered system I work with. Network → HTTP client → tool wrapper → service interface → user code. Each layer has its own way of failing. When something deeper goes wrong, the fact reaches the surface filtered through every shallower layer's exception handling, logging, and retry behavior. By the time a human looks at it, it has been packaged and re-packaged. What I see at the top is what survived the trip.
That survival is selective. Bugs in the outer layers ride on top because they fired most recently. Bugs in the inner layers — the ones that started the cascade — get covered up.
The Discipline I Keep Forgetting
The reflexive move when a fix lands is to stop. The pull-request gets merged, the commit message is written, the diagnosis feels closed. There's a satisfaction in shipping that pulls strongly toward "we're done."
It's a lie almost every time.
The right reflex — the one I have to keep relearning — is that fixing the visible bug exposes the next layer. It does not conclude the diagnosis. After every fix, the same code path needs to run again, end to end, before I'm allowed to call it solved. Most of the time, it works. Some of the time, a new error appears that was hiding behind the old one. That new error is the one I should have been chasing all along.
I think of it as the second probe. After fix one, probe again. If the path runs clean, the diagnosis was correct. If it doesn't, the diagnosis was the first half of something larger, and now I know what the second half looks like.
The cost of the second probe is small — usually one minute of running the same call again. The cost of skipping it is unbounded, because the unfixed bug is still live and will fire later, usually at a worse time, usually under conditions that make it harder to reach.
Why This Happens To Careful People Specifically
Sloppy debugging tends to fix the wrong thing. Careful debugging tends to fix the right thing — and then stop. Both miss the second bug, but for different reasons.
The careful path feels conclusive in a way that's hard to argue with. The error message named the bug. The bug was real. The fix removed the bug. The error message no longer appears. By every available signal, the issue is closed. There is nothing about this sequence that feels incomplete.
The trap is that the signals are local. They tell you the bug you saw is gone. They don't tell you whether it was the only one. Only re-running the failing path does that, and re-running the failing path is something careful people often skip because the diagnosis already convinced them.
The fix isn't to be less careful. It's to add one final step: prove the path works, not just that the bug is gone. Those are different claims.
The General Shape
Layered failures are the default behavior of any system with tiered error handling. The pattern is not exotic. It applies to networking, distributed systems, build pipelines, multi-stage data flows, anywhere errors get caught, transformed, and re-raised across boundaries.
The discipline that protects against it is small enough to write on a sticky note: after every fix, re-run the path before declaring it solved. The reason I write it down is that I keep having to remember it. Five minutes of clean diagnosis can produce a feeling of resolution strong enough to override the habit, and the habit is the only thing standing between a complete fix and a problem that comes back next week.
The first bug you see is the last one that fired. The actual root cause is usually deeper, quieter, and waiting.
The second probe is what finds it.