ULTRATHINK
Solutions
← Back to The Signal
Architecture February 16, 2026

AI Agent Observability in Production: openclaw-logfire

AI agents are taking on real work that moves the P&L. But most teams deploy them blind—no cost governance, no audit trail, no traceable connection between an agent action and a business outcome. We built the missing observability layer. Then we open-sourced it.

Nick Amabile
Nick Amabile
Founder & CEO

Would you ship a payment system with no logs? An ETL pipeline with no monitoring? A microservice with no traces?

Then why are teams deploying AI agents—autonomous processes making expensive API calls, executing tool actions, and interacting with customers around the clock—with nothing but stdout and hope?

We did this ourselves, briefly. When we integrated OpenClaw into our Axon platform over a weekend, we had four agents running real business workflows: a Chief of Staff coordinating approvals in Slack, a Marketing agent executing ABM campaigns, a Content agent writing brand-aligned copy, a Coding agent shipping features. Real work. Real spend. Real consequences if something goes wrong at 3am.

The Ultrathink Axon™ backend already had full observability—Langfuse traced every LLM call, Logfire instrumented every API endpoint and workflow. But the OpenClaw agent layer was a black box. We could see the inputs and outputs. We couldn’t see what happened in between. We couldn’t answer the most basic operational questions: which agent is burning tokens? Which tool call is failing silently? Is this agent actually creating value, or is it just generating cost?

This isn’t a niche concern. OpenClaw has become the fastest-growing AI agent framework in history—160,000+ GitHub stars, 100,000+ active installations, enterprise adoption accelerating past 30%. But the security reality is sobering: CrowdStrike, Fortune, and VentureBeat are sounding alarms: tens of thousands of exposed instances, nearly 900 malicious skills, a critical RCE vulnerability (CVE-2026-25253), and shadow AI agents appearing on corporate networks without IT’s knowledge. The attack surface is growing faster than most security teams can audit.

That’s not a production system. That’s a liability. Observability isn’t just about cost governance—it’s how you detect when an agent is compromised, when a malicious skill is exfiltrating data, or when an autonomous process exceeds its scope at 3am. So we built the missing layer. Then we open-sourced it.

The trust gap: demos don’t need governance, production systems do

Here’s the uncomfortable truth about most AI agent deployments: they’re demos running against live data. They look impressive. The Slack messages are polished. The tool calls seem smart. But ask the team running them five basic questions and watch the confidence evaporate:

  • “What did this agent cost last month?” — “Uh, we can check the API invoice… maybe $400? Or was that the whole team?”
  • “Why did token usage spike 3x on Tuesday?” — “We’d have to check the logs. Maybe the agent got stuck in a loop?”
  • “What tool calls did the agent make before it sent that email?” — “We don’t really track that granularly.”
  • “Can you prove this agent generated revenue?” — “It definitely helped. We think.”
  • “What happens when the agent hits a rate limit at 2am?” — “It… probably retries?”

This is the Execution Gap applied to operations. The agent works in the demo. But nobody can prove it’s actually creating value. Nobody can govern its costs. Nobody can audit its actions. And when it breaks—which it will—nobody can diagnose the problem faster than “let me grep through some log files.”

If you’re running AI agents on real business workflows—workflows that move the P&L—this gap is not a technical inconvenience. It’s an operational and financial risk.

What AI agent observability actually means in production

AI observability is the ability to understand what your AI agents are doing, why they’re doing it, what it costs, and whether it’s working. It’s not logs. It’s not dashboards. It’s traces, metrics, and governance unified into a system that lets you answer four questions about every agent action:

  1. 1. Cost. How much did this cost, broken down by agent, model, and task type? Are costs trending up or down? Where is caching effective?
  2. 2. Performance. What’s the p95 completion time? Which tool calls are slowest? Where are the bottlenecks?
  3. 3. Auditability. What exactly did the agent do, in what order, with what data? Can you reconstruct the full decision chain?
  4. 4. Governance. Is the agent operating within its budget? Within its permissions? Within its scope?

These aren’t nice-to-haves. They’re the same questions you ask about every production system. The difference is that for web apps and data pipelines, the tooling to answer them has existed for a decade. For AI agents, it barely exists at all.

Most agent frameworks give you log lines. Maybe structured JSON if you’re lucky. That’s not observability—that’s a prayer disguised as engineering.

The answer already exists. Nobody’s using it.

The OTEL standard solved this problem for microservices years ago. Traces, metrics, and logs unified under a single data model—vendor-neutral, standardized, battle-tested. And as of 2025, OTEL published GenAI semantic conventions: a standardized vocabulary specifically for LLM telemetry.

  • Span names like chat anthropic.claude-sonnet-4-5-20250929 that identify every model call
  • Attributes like gen_ai.usage.input_tokens and gen_ai.usage.output_tokens for precise cost attribution
  • Events for prompt and completion content (opt-in, redactable) for full auditability
  • Span hierarchies that nest LLM calls inside tool calls inside agent invocations—the full decision chain, traceable

Almost no agent framework implements them. OpenClaw doesn’t. LangChain has partial support through LangSmith, but it’s proprietary. The standards exist. Nobody’s using them. That’s the gap we closed.

openclaw-logfire: the questions you can now answer

openclaw-logfire is an OpenClaw plugin that instruments the full agent lifecycle using OTEL GenAI semantic conventions. Zero-config. Set LOGFIRE_TOKEN, install via openclaw plugins install @ultrathink-solutions/openclaw-logfire, and every agent invocation becomes fully traceable.

Cost governance

Every LLM completion span carries gen_ai.usage.input_tokens and gen_ai.usage.output_tokens. Query them via SQL in Logfire. Our Chief of Staff averages 1,200 output tokens per invocation. The Content agent averages 3,400. The Marketing agent spikes to 8,000 during ABM research.

We set LiteLLM budget caps knowing the expected burn rate—not guessing. When costs drift, we see it in the histogram before the invoice arrives.

Performance tracing

Every agent action becomes a trace tree—agent invocation → context loading → LLM completion → tool calls → response delivery. When the Marketing agent’s ABM workflow takes 45 seconds instead of 15, the trace shows exactly where: the Apollo MCP server is hitting rate limits. Diagnosis in seconds, not hours.

Full auditability

The trace chain shows every tool call, every model interaction, every decision the agent made. When the Content agent publishes a blog post, we trace backward: which knowledge base chunks did it retrieve? What brand guidelines did it reference? What model generated the draft? Every step, attributable.

Automatic secret redaction

OTEL GenAI events capture prompt content for debugging—but openclaw-logfire automatically redacts API keys (sk-*, axn_live_*, Bearer tokens), environment variable references, and optionally PII before data leaves the process. The trace backend never sees sensitive content. You can’t have a debug tool that creates a new attack surface.

Distributed tracing

When an agent calls an MCP server that’s also instrumented with OTEL, trace context propagates automatically. One trace ID from Slack message to OpenClaw gateway to Axon API to Temporal workflow to Apollo MCP server and back. No log correlation. No timestamp matching. One waterfall view.

What the trace looks like

Every agent action maps to a trace tree that mirrors actual execution. This is the full decision chain—reconstructable, attributable, auditable:

Agent Invocation (root span)
├── Context Loading
│   ├── Memory Retrieval (Mem0)
│   └── Knowledge Base Query (Qdrant)
├── LLM Completion (gen_ai span)
│   ├── gen_ai.system: "anthropic"
│   ├── gen_ai.request.model: "claude-sonnet-4-5-20250929"
│   ├── gen_ai.usage.input_tokens: 3847
│   ├── gen_ai.usage.output_tokens: 1203
│   └── gen_ai.response.finish_reasons: ["end_turn"]
├── Tool Call: apollo_enrich_company
│   ├── LLM Completion (tool response parsing)
│   └── HTTP Request to MCP Server
├── Tool Call: shared_memory_store
│   └── HTTP Request to Mem0 API
└── Response Delivery
    └── Webhook POST to Slack

The root span captures the full invocation duration and outcome. Child spans capture each phase. LLM completion spans carry the GenAI attributes that power cost dashboards and token histograms. When an agent burns through its budget at 3am, you open Logfire, filter by gen_ai.usage.output_tokens > 5000, and see exactly which invocation, which tool call, and which model generated the runaway completion.

That’s the difference between operating a system and hoping a system works.

Five layers of observability, one platform

openclaw-logfire is one layer of the Ultrathink Axon observability stack:

  • LLM layer: Langfuse traces every LLM call with source attribution and cost tracking
  • Agent layer: openclaw-logfire instruments the OpenClaw agent lifecycle with OTEL GenAI spans
  • Infrastructure layer: Logfire instruments FastAPI, Temporal, Redis, Qdrant, and HTTP clients
  • Network layer: Tailscale provides connection logs and access audit trails
  • Cost layer: LiteLLM provides per-agent budget tracking with automatic cutoff

You can’t improve what you can’t measure. You can’t govern what you can’t see. And you can’t trust what you can’t trace.

This is the observability philosophy behind the Modern AI Application Stack—and it’s table stakes for our Outcome Partnership model, where we operate, monitor, and continuously improve production AI systems alongside our clients. When we put skin in the game on outcomes, we need to see every layer. So do you.

Get started

# Install the plugin
openclaw plugins install @ultrathink-solutions/openclaw-logfire

# Set your Logfire token
export LOGFIRE_TOKEN=your-token-here

# Restart OpenClaw — traces start flowing immediately

~400 lines of TypeScript. We kept it minimal because the hard part isn’t the plugin—it’s deciding that observability is non-negotiable before you learn the hard way.

Need observability and governance for your AI agent infrastructure?

Start with a Pathfinder Engagement™: we’ll assess your AI agent stack, design the observability and governance architecture, and deliver a production-ready system on the Ultrathink Axon platform. You own it. No lock-in.

Prove value in 6 weeks

This is a companion post to We Put OpenClaw Into Production in a Weekend. For more on building production-grade AI systems, see The Modern AI Application Stack and AI Agents: Build vs. Buy vs. Partner.

Ready to Close the Execution Gap?

Take the next step from insight to action.

No sales pitches. No buzzwords. Just a straightforward discussion about your challenges.