Why most agentic pilots don't scale.
The shape of a pilot that worked
Pick a small team. Pick a use case where the user is on the team. Run the pilot for four weeks with engineers in the loop. Ship something that works. The team is proud, the demo is good, the slide deck almost writes itself.
This is the easy 20 percent. The pilot worked because the engineers were the users and the engineers were also the operators. When the engineers leave the room and the pilot becomes a rollout, three things break.
Failure mode 1: the platform substrate was not real
The pilot ran in a sandbox account with hardcoded API keys, a side cluster, and a Streamlit interface. The architecture diagram showed Lambda and S3 because that's what fits on a slide; the actual code ran on a laptop and a Vercel preview.
On rollout, the platform team asks the basic questions. Where do the secrets live? How does the service authenticate to the data source? Where do the logs go? Who pages when it breaks? Nothing in the pilot answered these questions because the pilot did not have to.
Fix it during the pilot, not after. Build the pilot on the same landing zone as everything else. Use the same identity, the same observability, the same paging path. The pilot is slower to ship; the rollout is faster because there is no architectural translation.
Failure mode 2: there was no quality signal
The pilot was evaluated by the engineers eyeballing the output. The output looked plausible most of the time, occasionally weird, never disastrous. The team called it a success.
On rollout, the user base is no longer the engineering team. The user base does not know the system well enough to spot weird output. Quality regressions stack up silently. By the time someone notices, the dashboard is dead because no one was watching it.
Fix it with a written evaluation set. Twenty real inputs, twenty expected outputs (or expected output classes, when the answer is open-ended), and a CI step that runs the evaluation on every change. Quality regression catches itself, the same way a unit test would, with the same boring cadence.
"The pilot worked because the engineers were the users and the engineers were also the operators. When they leave the room, three things break."
Sebastiaan van Parijs / Founder
Failure mode 3: no one wrote down the failure mode
Every AI feature has a class of failure that is unique to it. RAG over policy documents fails when the chunking strategy misses the relevant clause. Agentic workflows fail when the model picks the wrong tool. Summarisation fails when the input is mostly noise.
The pilot did not write down its failure class. So on rollout, when a user hits the failure, no one knows whether it is a known issue, a regression, or a new problem. The team triages from first principles every time. By month three the team is exhausted.
Fix it during the pilot by naming the failure class. One paragraph in the runbook: this is what we expect to break, this is what we do when it does, this is the fallback. Future failures slot into the existing runbook or trigger a runbook update. Either way, the team is not starting from zero.
What a pilot that scales looks like
Same platform substrate as the rest of the services. Identity, secrets, observability, paging from day one. A written evaluation set with a CI gate. A named failure class with a written runbook entry. Cost attribution on the same dashboard as the team's other services.
It is slower to ship the first pilot this way. The second pilot is faster. The fifth pilot is much faster. By the time the team is shipping pilots monthly, the substrate is the company's AI platform, the evaluation harness is shared, and the runbooks compose.
This is the boring shape. It does not demo as well. It is what scales.