The frustrating problem
Last month a teammate showed me a slick "agentic workflow" that was supposed to triage incoming bug reports. It worked beautifully in his terminal. Then we shipped it. By the end of week one we had duplicate Jira tickets, half-finished Slack threads, and one very angry PM whose calendar had been blocked out by an LLM that confused "schedule" with "create".
Sound familiar? You're not alone. There's a growing realization across teams that the moment you put an agent loop into production, you stop having an AI problem and start having a workflow orchestration problem. The same problems we solved for CI/CD pipelines a decade ago show up again — flaky steps, missing idempotency, no retries, no observability — except now the steps are non-deterministic.
I've spent the last few months debugging this exact class of failure across three different projects. Here's what I learned about why agentic workflows fall apart in production and how to actually fix them.
The root cause: treating LLM calls like function calls
Most broken agent systems I've inspected share one architectural mistake. The author treats agent.run() like a regular function. Input goes in, output comes out, move on.
But an agent step is closer to a remote API call to a flaky third party than it is to a function. It can:
- Time out halfway through
- Return slightly different output for the same input
- Decide to call a tool you didn't expect
- Hallucinate a parameter that crashes the next step
- Cost real money on every retry
When any of those happen inside a naive while not done: loop, you get cascading failures. The agent retries, partial side effects accumulate, and you end up with three duplicate tickets and a passive-aggressive Slack DM.
The fix is to stop thinking of it as "AI" and start thinking of it as a distributed workflow with non-deterministic steps. The good news is we already know how to build those.
Step 1: Make every step idempotent
This is the single biggest fix. If a step runs twice, the second run must not create a second side effect. CI/CD figured this out years ago with deterministic artifact hashes.
For an agent that creates tickets, that means deriving a stable key from the input before the LLM call, then deduping on it:
import hashlib
import json
def idempotency_key(event: dict) -> str:
# Hash the meaningful fields, not the whole event (timestamps will differ)
payload = json.dumps({
'source_id': event['source_id'],
'kind': event['kind'],
}, sort_keys=True)
return hashlib.sha256(payload.encode()).hexdigest()
def create_ticket_once(event, store):
key = idempotency_key(event)
if store.exists(key):
# Already handled — return the prior result instead of re-doing the work
return store.get(key)
ticket = ticket_api.create(...)
store.put(key, ticket)
return ticketNote what's not in the LLM's hands here: the dedup decision. The agent can decide what to put in the ticket, but the "have we done this already" question belongs to deterministic code. That separation is the whole game.
Step 2: Persist state between steps
In-memory agent state dies the moment your process crashes — and your process will crash. The same way a CI job checkpoints between stages, your agent loop needs a durable store of where it is.
I've had good luck modeling each agent run as a state machine persisted to Postgres or Redis. Even a simple JSON blob keyed by run ID is enough to start:
from dataclasses import dataclass, asdict
from enum import Enum
class Stage(str, Enum):
TRIAGED = 'triaged'
TICKET_CREATED = 'ticket_created'
NOTIFIED = 'notified'
DONE = 'done'
@dataclass
class RunState:
run_id: str
stage: Stage
ticket_id: str | None = None
def advance(run_id, store):
state = store.load(run_id)
if state.stage == Stage.TRIAGED:
ticket = create_ticket_once(state.input, store)
state.ticket_id = ticket.id
state.stage = Stage.TICKET_CREATED
store.save(run_id, asdict(state)) # Save BEFORE next step
# ... and so on per stageThe critical line is the save before you move on. If the notification step crashes, the next attempt picks up at TICKET_CREATED and skips the duplicate ticket creation entirely.
Step 3: Bound the agent's autonomy with explicit guardrails
Giving an agent unbounded tool access is how you end up with cleared calendars and rogue refunds. Wrap every tool call in code that enforces preconditions:
def tool_block_calendar(start, end, attendees):
# Reject anything outside reasonable bounds before the API call happens
duration_hours = (end - start).total_seconds() / 3600
if duration_hours > 2:
raise ToolGuardError('refusing to block >2h')
if len(attendees) > 5:
raise ToolGuardError('refusing to mass-invite')
return calendar_api.block(start, end, attendees)These guards aren't there to make the LLM smarter. They're there because the LLM is fundamentally untrusted user input, and you wouldn't let untrusted input near a production API without validation either.
Step 4: Add the observability you'd add for any pipeline
If you can't answer "what did the agent do on Tuesday at 3:47pm and why," you can't debug it. The minimum I now ship with every agent system:
- Structured logs per step (run_id, stage, tool called, inputs, outputs, latency)
- A traces/spans view — OpenTelemetry works fine for this
- Cost tracking per run (tokens in, tokens out, dollars)
- A simple dashboard of success/failure/timeout rates per stage
I haven't found a single "perfect" agent observability tool yet, but rolling your own with OTel and a logs aggregator gets you 80% of the way there in an afternoon.
Prevention: design for the failure modes you'll actually hit
If you take one thing from this: stop treating agentic workflows as a new category of software. They're workflows. Apply the boring lessons.
- Assume every step will fail eventually and design for resumability
- Make side effects idempotent at the code layer, not the prompt layer
- Put hard guardrails on tools — the LLM is a planner, not an authority
- Log everything, measure everything, alert on the SLOs that matter
- Start with a deterministic skeleton and let the agent fill in only the genuinely ambiguous decisions
The boring infrastructure work is what makes agents safe to run in production. The exciting LLM bits are maybe 10% of the system. Once your team accepts that, the failure rate drops fast — and the "agentic workflow" stops feeling like a mystery and starts looking a lot like the CI/CD pipelines you already know how to operate.
