Where AI-built apps break in production
An AI agent can take an app from nothing to a working demo in an afternoon. Routes render, the schema looks reasonable, the happy path works. The trouble is that the demo is the first 70%. The last 30% — the part that decides whether the app survives a real customer — is where agent-generated code quietly falls apart, and it falls apart in the same places every time.
The trust gap is real and getting wider
Adoption of AI coding tools is near-universal: roughly 84% of developers now use them. Trust is not keeping pace. In the same surveys, only about 29% of developers say they trust the output — down from around 40% in 2024. The gap isn't irrational. Veracode's 2025 research found that roughly 45% of AI-generated applications contained an exploitable OWASP vulnerability, and a large share of AI-authored changes need debugging after they reach production. Gartner has gone further, projecting that 40% of agentic-AI projects will be cancelled by 2027, largely on inadequate risk controls.
None of this means agents write bad code. It means they write plausible code — code that compiles, passes the tests it also wrote, and reads fine in review — while missing the invariants that only matter under production load, hostile input, and money changing hands. Here is where that shows up.
1. Auth boundaries
Auth is the single most dangerous surface for generated code, because the failure is invisible until someone exploits it. An agent will happily produce a route that looks authenticated — it reads a token, it has a middleware — but skips the scope check on one admin endpoint because the prompt didn't name it. Token rotation gets implemented as "issue a new token" without invalidating the old one. Session fixation, missing audience checks, JWKS endpoints that never rotate keys. Each is a few lines, each passes a smoke test, each is a breach waiting for traffic.
2. Stripe webhooks and payment side-effects
Payments are where a subtle bug costs real money. The canonical failures: a webhook handler that doesn't verify the Stripe signature, so anyone can POST a fake payment_succeeded; a handler that isn't idempotent, so a retried event grants two months of access or refunds twice; no handling for out-of-order delivery; no reconciliation when the webhook is missed entirely. An agent writes the checkout.session.completed branch because that's the demo. The retry, the duplicate, the dispute, the partial refund — the cases that actually occur in production — are the ones it skips.
3. Transactional email and deliverability
"Send the email" is one line. Making email arrive is not. Generated code rarely separates the render step from the delivery step, so a template error takes down the signup flow. It rarely handles provider failure, retries, or bounces. SPF, DKIM, and DMARC are infrastructure the agent can't see, so password-reset mail lands in spam and users quietly can't get in. The code "works" in the demo because the developer's own inbox is forgiving.
4. Audit logs
Audit trails are the feature nobody asks for until an enterprise customer's security review demands one — or until you need to answer "who changed this, and when." Agents don't add append-only, structured event logs unprompted, and bolting one on after the fact means threading it through code that was never designed to emit events. By then the events you most needed are the ones you never recorded.
5. Multi-tenant isolation
The moment an app serves more than one customer, isolation becomes the highest-stakes invariant in the system — and the subtlest. The places agents get it wrong are exactly the places that turn into a breach: a query that forgets the tenant predicate, a shared cache key, a token minted for one tenant accepted by another, custom-domain routing that leaks across boundaries. The code looks identical to the single-tenant version that worked. It just shares data it shouldn't.
6. Migrations and deploys
The last mile breaks quietly too. An agent generates a migration that's correct in isolation but isn't reversible and isn't ordered against the schema already in production. Deploys go out with no check that the database has the columns the new code reads. There's no gate between "the agent finished" and "this is live," so the first signal that something is wrong is an error in front of a user.
Why agents get this wrong — structurally
It's tempting to read the list above as "the agent made mistakes." It's more accurate to say the agent had no way not to. Three structural reasons:
- No contract to satisfy. The agent generates against a prompt, not against an interface that specifies idempotency, required scopes, or which events must be emitted. There's nothing for it — or you — to check the output against.
- Training rewards plausibility. Models are extraordinarily good at producing code that looks like correct code. The happy path is heavily represented in training data; the dispute-webhook-retry edge case is not.
- No runtime feedback at authoring time. The agent never sees the retried webhook, the cross-tenant token, the bounced email. It can't learn from production it never touches.
This is why "use a better model" only goes so far. Better models write more plausible code faster. The missing piece isn't intelligence — it's a specification the output is held to.
What "production-ready" actually requires
Strip away the marketing and production-grade means a small set of unglamorous properties: auth with enforced scopes and real token rotation; payment handlers that are idempotent and signature-verified; email with a render/deliver split and retry; an append-only audit trail; tenant isolation that's enforced, not hoped for; migrations that are reversible and ordered; and a check that runs before traffic, not after. These don't depend on the app's features. They're the same every time — which is exactly why rebuilding them per project, by hand or by prompt, is both wasteful and dangerous.
The structural fix: a contract the output is held to
If the problem is the absence of a specification, the fix is to give the agent one. That's the bet behind module contracts: ship the dangerous 30% as modules with a typed contract — declared permissions, typed hooks, the events that must be emitted, an approval gate on migrations and payment side-effects — pinned to a version and backed by tests. The agent reads the contract before it composes, and a check verifies the result before anything deploys. Customization happens through the hooks, so your specific logic doesn't fork you off the upgrade path.
That's the approach we take with microservices.sh on Cloudflare Workers: verified auth, payments, email, audit, and multi-tenant dispatch as inspectable modules an agent can compose, rather than auth it improvises from scratch on every project. The agent still does the interesting work — your features, the custom 30%. It just stops re-deriving the parts that break.
The first 70% was never the hard part. Treat the last 30% as a specification problem, not an intelligence problem, and AI-built apps stop breaking in the same six places.