← Field notes
CTO Agentic AI · production reliability

What the demo doesn't tell you

The agent demo got a standing ovation. Eleven weeks later it was quietly turned off. A CTO on the gap between a thing that works on stage and a thing you can trust on Tuesday.

We spoke with Marcus, a CTO who shipped an AI agent to a standing ovation and turned it off eleven weeks later — about the gap between a thing that works on stage and a thing you can trust on Tuesday.

Q Field Notes asksWas it a bad demo? Did you get sold?

No — and I want to be clear, because the easy version of this story is that someone fell for a slick pitch. The demo was genuinely good. An agent read a customer's support ticket, pulled the account history, reasoned about the problem, resolved it — end to end, no human — while the room watched. When it finished, the board applauded. I'm not a sentimental man and I felt the future arrive in my chest. I shipped it.

Eleven weeks later I turned it off. Not in a crisis, not after a disaster that made the news — quietly, the way you put down a project everyone can see isn't working but no one wants to name. The reason is the whole essay: it worked about ninety percent of the time, and the other ten percent was unpredictable, occasionally catastrophic, and impossible to see coming.

90%of tickets handled perfectly
10%wrong, with total confidence
11 wksfrom ovation to switched off
Q Field Notes asksWhy did the demo mislead you?

Because of what a demo is, structurally. It's the happy path, under good lighting, with data someone cleaned the night before. The median case at its best. And the median case was never the question. You don't lose sleep over the ticket the agent nails. You lose sleep over the one where the account data was stale, the customer asked something sideways, and the agent — sounding exactly as confident as it did on stage — did something wrong with total conviction.

A demo tells you the agent can be right. It doesn't tell you the agent won't be wrong. Those are different facts — and the second is the only one that matters in production.

My ten percent wasn't random noise to shrug off. It was the long tail — the weird tickets, the edge accounts, the malformed inputs that arrive at three in the morning when no human is watching. A demo has no long tail, because you don't demo the long tail. But production is the long tail. Production is nothing but edge cases wearing trench coats, lined up, waiting their turn. My mistake — and it was mine — was treating the agent as a feature I could ship, rather than a system I had to operate.

Q Field Notes asksSo how did you rebuild it?

I stopped talking about "the agent" almost entirely. The model — the clever part the demo was selling — turned out to be the smallest, most interchangeable piece. What I'd been missing was everything around it. That's where Databricks came in, and the unglamorous truth is the work that fixed it had almost nothing to do with intelligence and almost everything to do with discipline.

Databricks
The vendor in the room · Databricks

The rebuild treated the agent like any other production system: grounded in governed data, wrapped in an evaluation harness that scored every change against a real test set of the ugly tickets, and fully observable — every decision traced back to the data it stood on. The model barely changed. The system around it changed completely, and that was the whole difference.

01 · GroundReal, current dataanswers anchored in governed data with lineage, not improvised
02 · EvaluateA test set of the ugly cases"is the new version better?" becomes a number, not a vibe
03 · ObserveTrace every decisionsee exactly where the reasoning turned, and fix that

The worst answers had come from the agent improvising on thin or stale data — it was reasoning over a half-remembered version of the truth. An agent is only ever as trustworthy as the data under its feet, and mine had been quietly lying to it. The eval harness sounds boring and is completely essential: I could change something and know whether it helped, before a customer found out for me. And observability meant that when it got something wrong — it still does, everything does — I could see where, instead of shrugging at a black box. The ten percent didn't vanish. It shrank, became visible, and stopped being catastrophic. That's the most you can honestly ask of any system that makes decisions.

Q Field Notes asksWho wins with agents, then?

The version I run now never got a standing ovation. It shipped quietly, to a narrow slice of tickets, with a human in the loop on anything it wasn't sure about, and it earned more responsibility one boring, measured week at a time. By every demo metric it's less impressive than the thing the board applauded. It's also still on.

floor > ceiling
The demo shows you the ceiling. The grounding, evals and observability build the floor. And you live on the floor.

The companies that win with agents this decade won't have the cleverest model — models are becoming a commodity, everyone will have a good one. The winners will be the ones who did the unglamorous work of turning a clever thing into a dependable one.