The checkpoint pattern you describe is exactly right. I've been dealing with this as well. Instead of vibe coding, it's vibe system engineering and I don't care for it. So I thought about it and came up with a framework to describe and reason about different pipelines. I based it on the types of LLM failures I was seeing in my own pipeline (omissions, incorrect, or inconsistent with existing stuff).
I wanted something I could use to objectively decide if one test (or gate, as I call them) is better than another, and how do they work as a holistic system.
My personal tool encodes a workflow that has stages and gates. The gates enforce handoff. Once I did this I went from ~73% first-pass approval to over 90% just by adding structured checks at stage boundaries.
I wanted something I could use to objectively decide if one test (or gate, as I call them) is better than another, and how do they work as a holistic system.
My personal tool encodes a workflow that has stages and gates. The gates enforce handoff. Once I did this I went from ~73% first-pass approval to over 90% just by adding structured checks at stage boundaries.
My hope is that we can have a common vocabulary to talk about this, so I wrote up the data and the framework that fell out of it: https://michael.roth.rocks/research/trust-topology/