Most teams build a demo. Demos look impressive on a screen and die two weeks later. A production AI agent quietly takes real work off your team every day and keeps doing it.

This playbook is for operators and engineering leads shipping their first production AI agent. It's the pattern we use on AI Agent Sprint engagements.

Rule one: pick one workflow

The single biggest failure mode is scope. "An agent for support" is not a workflow. "Classify every inbound support ticket by intent and draft a first response from the knowledge base, with a human-review gate" is a workflow.

Pick a workflow with these properties:

Recurring: the same shape of work, many times a week
Observable: you can tell right away when a response is wrong
Bounded: inputs and outputs are clear
Valuable: saving 10 minutes per instance × 200 instances/week is real money

If you can't describe the workflow in a paragraph with specific inputs and outputs, go back and narrow it.

Rule two: define success before you build

Pick a small set of metrics and commit to them:

Quality: accuracy on a golden set, measured per release
Speed: p50/p95 latency, end to end
Cost: dollars per task at steady state
Coverage: percentage of the workflow the agent can handle without escalation
Human load: time per review, reviews per day

If you can't measure it, you won't be able to defend it to leadership, and you won't know when to expand.

Rule three: build an eval harness on day one

Before you ship a single token to production, build:

A golden test set: 20 to 50 realistic examples with expected outputs, curated with the domain owner
An automated evaluator: runs your golden set on every prompt change and scores quality
A regression trigger: CI-style so no prompt or model change ships without passing the eval

This is the single biggest predictor of whether your agent survives its second month. Without evals, every change becomes a guess.

Rule four: pick the right model, not your favorite

For each top-ranked workflow, run a short shootout on your actual data:

| Model class | When it wins | | -------------- | ------------------------------------------------- | | GPT-class | Broad reasoning, tool use, Assistants API fit | | Claude (Sonnet / Opus) | Long context, nuance, careful drafting | | Claude Haiku | Classification and structured output at low cost | | Azure OpenAI | You need Microsoft-native data residency | | Open models | High volume + cost sensitive + willing to host |

Cost matters. A $0.30-per-task agent at 5,000 tasks a week eats $78K/year. A $0.01-per-task agent at the same volume is $2.6K/year. Model choice is a business decision.

Rule five: design the system, not the prompt

A production agent is not "a prompt". It's a system that usually includes:

Input validation: reject malformed inputs before they touch the model
Retrieval: pull the right context from your knowledge base, DB, or docs
Prompt: system, user, and few-shot, with strict output schema
Structured output: JSON schema, not free-form text
Tool calls: read from or write to your systems
Guardrails: output validation, safety checks, PII scrubbing
Human-in-the-loop: approval gates where they matter
Observability: log every trace, track cost, quality, latency
Evaluation harness: run before every deploy

Skip any of these and you'll feel it within weeks.

Rule six: ship to a small, real audience first

Production means real users. Real users will do things your golden set didn't cover. That's the point.

Rollout pattern that works:

Internal alpha: one or two champions who want the agent and will give fast feedback
Limited beta: the owning team, with clear rollback plan
Team-wide: full team, with the champion as the support layer
Cross-team: only after metrics hold

Don't skip stages. Each stage surfaces a different class of bug.

Rule seven: make it easy to fix

Things you'll need to change weekly for the first month:

Prompts (always)
Retrieval (usually)
Escalation thresholds (often)
Structured output schema (sometimes)

Build deployment so any of these can change without a full release: prompt files in a config store, a re-runnable eval, a CI pipeline that ships on green. If changing a prompt requires a full engineering cycle, you'll fall behind your golden set.

Rule eight: instrument everything

At minimum, log for every agent call:

Inputs (scrubbed for PII where needed)
Retrieved context (with sources)
Prompt + model + parameters
Output + any post-processing
Tool calls and their results
Latency breakdown
Cost estimate
Eval score (if this was a golden example)
Human feedback (thumbs up/down, rewrite)

Use a purpose-built LLM observability tool (Arize, Langfuse, LangSmith) or roll your own on top of your APM. Do not ship without this.

Rule nine: separate the agent from the workflow

The agent is the LLM-powered brain. The workflow is everything around it: intake, routing, approval, storage, notifications. Keep them separate.

Why? Because you'll replace models over time. The workflow stays. If your workflow logic is tangled into a single prompt, every model change is a rewrite. If your workflow logic lives in code and calls the agent as a function, a model swap is an afternoon.

Rule ten: plan the handoff from day one

Who owns this agent in six months? They need:

A runbook ("it's returning errors, here's the playbook")
An eval set they understand
An observability dashboard
A rollback procedure
A cost alert threshold
A clear escalation path for model upgrades

If the only person who understands the agent is the consultant who built it, it will silently degrade until someone decides to "rebuild it properly".

A realistic timeline

An honest first-agent-in-production timeline looks like:

Week 0: workflow scoping and success metrics (one week, before the build)
Weeks 1 to 2: system design, retrieval, evals, first working version
Week 3: internal alpha, guardrails, observability wired up
Week 4: limited beta, first round of prompt and eval iteration
Weeks 5 to 6: team-wide rollout, runbook, handoff

That's the shape of our AI Agent Sprint.

Common failure modes

Building a chat UI when the workflow doesn't need one
Skipping the eval harness ("we'll add it later", you won't)
Picking the model based on blog posts instead of your data
No observability in production
Designing for the happy path only
No plan for who maintains it after launch
Scoping too broadly ("an agent for operations")

Related resources

AI process audit guide for picking the right workflow
Copilot vs ChatGPT Enterprise vs Claude for model selection
AI Agent Sprint for our engagement that ships one to production

Next step

Have the workflow but not the team to build it? 20 minutes on the phone will tell us both whether an AI Agent Sprint fits. Find my AI opportunity.

How to Build Your First AI Agent for Business Operations

Rule one: pick one workflow

Rule two: define success before you build

Rule three: build an eval harness on day one

Rule four: pick the right model, not your favorite

Rule five: design the system, not the prompt

Rule six: ship to a small, real audience first

Rule seven: make it easy to fix

Rule eight: instrument everything

Rule nine: separate the agent from the workflow

Rule ten: plan the handoff from day one

A realistic timeline

Common failure modes

Related resources

Next step

Engagements where this guide fits.

AI Agent Sprint

More guides for operators.

AI Process Audit: The Complete Guide for Business Operators

Copilot vs ChatGPT Enterprise vs Claude for Business: How to Choose

Power Automate vs an AI Agent: How to Choose (and When to Combine Them)

Want the plan, not just the playbook?