AI Agent Observability: What to Log and Why You'll Regret Not Doing It

Introduction

Most AI agent projects ship without observability. The first production incident makes it clear why this was a mistake. By then, the system has been live for weeks, and you have no way to understand what it's been doing.

This post covers the minimum observability stack we install for every agent we ship to production: what fields to log, what alerts matter (and which ones become noise), how to instrument cost and quality drift without a six-figure tool bill, and the operational patterns that turn observability data into actual improvements.

If you're building AI agents and haven't thought about observability yet, this is what to set up before you go to production — not after.

Why observability is non-negotiable for agents

Traditional software is deterministic. Given the same input, you get the same output. Bugs reproduce. Logs tell you what happened.

AI agents are not deterministic. The same input can produce different outputs across runs. Quality varies based on model behavior, context, prompt details, and time-of-day model load. Failure modes are subtle: the agent didn't error, it just made a worse decision than usual. None of this is debuggable without rich observability.

Specific things that go wrong in production agents that traditional observability doesn't catch:

Quality drift: Model behavior changes slightly over time. Aggregate accuracy stays stable; specific use cases degrade.
Cost drift: Average tokens per request grows as users discover edge cases. Bills surprise the team.
Tool call patterns shift: Agent starts calling tools in different sequences. Latency increases. Customer-visible behavior changes.
Edge case explosions: Specific input patterns cause the agent to loop, retry excessively, or hallucinate. Affects a small percentage of users but they're very vocal about it.
Silent regressions: Prompt update or model upgrade subtly degrades quality. No errors; just worse outputs.

The 10 fields every decision should log

Per agent decision (per turn or per request), log:

Trace ID: Links the decision to the user session, request, and downstream effects.
Input context: What the agent saw. Truncate if long; sample if PII-sensitive.
Model version: Exact model + provider used. Critical when models update.
Prompt version: Which prompt template was used. Critical when prompts iterate.
Output: What the agent decided. Structured outputs are easier to query later.
Confidence or score: If available. Useful for drift detection.
Tool calls: Which tools were invoked, with arguments and results.
Latency: End-to-end and per-stage (model call, tool calls, response generation).
Cost: Token counts × pricing, broken down per call. Critical for cost monitoring.
Outcome (if available): Did the decision lead to a successful resolution? Feedback signal.

Storage pattern

Structured logs to Postgres or a dedicated logging service. Trace IDs link decisions across the request flow. Indexes on trace ID, model version, prompt version, user ID for common queries.

sql
CREATE TABLE agent_decisions (
  decision_id UUID PRIMARY KEY,
  trace_id UUID NOT NULL,
  user_id UUID,
  tenant_id UUID,
  agent_name TEXT NOT NULL,
  model TEXT NOT NULL,
  prompt_version TEXT NOT NULL,
  input_context JSONB NOT NULL,
  output JSONB NOT NULL,
  confidence FLOAT,
  tool_calls JSONB,
  latency_ms INTEGER NOT NULL,
  cost_usd DECIMAL(10, 6) NOT NULL,
  outcome JSONB,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_trace ON agent_decisions(trace_id);
CREATE INDEX idx_user ON agent_decisions(user_id, created_at);
CREATE INDEX idx_model_version ON agent_decisions(model, prompt_version);

Alerts that matter (and the noise to avoid)

Most teams over-alert. The 3am page for a single high-latency decision teaches them to ignore alerts. Then they miss the actual issue.

Useful alerts

Cost spike alerts: 10x baseline cost in a 1-hour window. Catches runaway loops or expensive prompt regressions.
Latency regression: p95 latency moves outside expected band for >30 minutes. Catches infrastructure issues or model API slowdowns.
Schema validation failures: Output doesn't match expected JSON schema at elevated rate. Catches model behavior changes.
Tool call error spikes: Downstream integration errors. Catches infrastructure or auth issues.
Confidence distribution shifts: Distribution of confidence scores moves significantly. Catches quality drift before accuracy degrades.
Outcome rate drops: If you track outcomes (resolution rate, customer satisfaction), alert on significant drops.

Alerts that become noise

Per-decision high latency: One slow call is meaningless. Alert on aggregate behavior, not individual decisions.
Alerts on output content: "The agent said something unexpected." You'll get woken up at 3am for content the team should review the next morning.
Cost per individual decision: You want cost trends, not point alerts. Aggregate over time.
Tool call failure alerts on retryable failures: If the retry succeeds, don't alert. Only alert on persistent failures.

Dashboards you'll actually use

Three dashboards we install for every production agent system:

Operational dashboard

Request rate, latency p50/p95/p99, error rate, cost per hour, active model versions. Watched by ops team during deploys and incidents.

Quality dashboard

Output quality scores (from evals or sampled human review), confidence distribution, tool call success rates, outcome rate. Reviewed weekly to catch quality drift.

Cost dashboard

Cost per tenant, cost per request type, cost per model, cost over time. Often shows surprises that nobody anticipated. Useful for capacity planning and cost optimization.

Connecting observability to evals

Observability data feeds your eval suite. The pattern that works:

Production logs include enough context to replay decisions against current code.
Sample interesting cases (failures, edge cases, low confidence) to your eval test set.
Run eval suite on every prompt change or model update; compare against production baseline.
When eval scores regress, investigate before deploying.

This closes the loop: production teaches you what cases matter, evals catch regressions before they reach production.

Observability without a six-figure tool bill

Commercial AI observability tools (LangSmith, Helicone, Arize) work well and cost real money. For smaller deployments or cost-sensitive teams, you can get most of the value from:

Postgres for structured logs. Your application is already on Postgres. Add a decisions table.
Grafana for dashboards. Free; connects to Postgres. Build the three dashboards above.
OpenTelemetry for traces. Standard, vendor-neutral, integrates with most observability backends.
Custom eval harness. Python script that runs your eval suite against current code and compares to baseline.
Sentry for errors. You're probably already using it.

This stack delivers 80% of the value of commercial AI observability tools at roughly 5% of the cost. Move to commercial tools when scale or team size justifies it — usually when you have multiple production agents and a dedicated team operating them.

Operationalizing observability

Observability data is only valuable if someone's looking at it. Habits we put in place:

Weekly quality review. Team looks at sampled decisions, flags issues, adds cases to eval set.
Cost review in product reviews. Cost dashboard is part of every product check-in. Surprises get investigated.
Pre-deploy eval gates. Eval suite runs on every code change. Quality regressions block deploys.
Post-incident replay. When something goes wrong, replay the decisions to understand what happened.
Monthly drift review. Look at confidence distributions, outcome rates, cost trends. Catch slow degradations.

Conclusion

Eval harnesses + production observability + sensible alerts are the difference between an agent that ships and an agent that gets rolled back. Build the boring infrastructure first; the model work is the easy part.

The discipline of logging the right fields and looking at the data regularly is more important than the specific tools. Teams with expensive observability tools and no observability culture miss issues that teams with Postgres + grep catch routinely.

If you're shipping an agent and haven't set up observability yet, do this before you go to production. We help clients install the minimum viable observability stack and the operational practices around it. The setup takes 1-2 weeks; the value pays for itself the first time it catches a production issue.