AI agents are taking on real work that moves the P&L. But most teams deploy them blind—no cost governance, no audit trail, no traceable connection between an agent action and a business outcome. We built the missing observability layer. Then we open-sourced it.
Would you ship a payment system with no logs? An ETL pipeline with no monitoring? A microservice with no traces?
Then why are teams deploying AI agents—autonomous processes making expensive API calls, executing tool actions, and interacting with customers around the clock—with nothing but stdout and hope?
We did this ourselves, briefly. When we integrated OpenClaw into our Axon platform over a weekend, we had four agents running real business workflows: a Chief of Staff coordinating approvals in Slack, a Marketing agent executing ABM campaigns, a Content agent writing brand-aligned copy, a Coding agent shipping features. Real work. Real spend. Real consequences if something goes wrong at 3am.
The Ultrathink Axon™ backend already had full observability—Langfuse traced every LLM call, Logfire instrumented every API endpoint and workflow. But the OpenClaw agent layer was a black box. We could see the inputs and outputs. We couldn’t see what happened in between. We couldn’t answer the most basic operational questions: which agent is burning tokens? Which tool call is failing silently? Is this agent actually creating value, or is it just generating cost?
This isn’t a niche concern. OpenClaw has become the fastest-growing AI agent framework in history—160,000+ GitHub stars, 100,000+ active installations, enterprise adoption accelerating past 30%. But the security reality is sobering: CrowdStrike, Fortune, and VentureBeat are sounding alarms: tens of thousands of exposed instances, nearly 900 malicious skills, a critical RCE vulnerability (CVE-2026-25253), and shadow AI agents appearing on corporate networks without IT’s knowledge. The attack surface is growing faster than most security teams can audit.
That’s not a production system. That’s a liability. Observability isn’t just about cost governance—it’s how you detect when an agent is compromised, when a malicious skill is exfiltrating data, or when an autonomous process exceeds its scope at 3am. So we built the missing layer. Then we open-sourced it.
Here’s the uncomfortable truth about most AI agent deployments: they’re demos running against live data. They look impressive. The Slack messages are polished. The tool calls seem smart. But ask the team running them five basic questions and watch the confidence evaporate:
This is the Execution Gap applied to operations. The agent works in the demo. But nobody can prove it’s actually creating value. Nobody can govern its costs. Nobody can audit its actions. And when it breaks—which it will—nobody can diagnose the problem faster than “let me grep through some log files.”
If you’re running AI agents on real business workflows—workflows that move the P&L—this gap is not a technical inconvenience. It’s an operational and financial risk.
AI observability is the ability to understand what your AI agents are doing, why they’re doing it, what it costs, and whether it’s working. It’s not logs. It’s not dashboards. It’s traces, metrics, and governance unified into a system that lets you answer four questions about every agent action:
These aren’t nice-to-haves. They’re the same questions you ask about every production system. The difference is that for web apps and data pipelines, the tooling to answer them has existed for a decade. For AI agents, it barely exists at all.
Most agent frameworks give you log lines. Maybe structured JSON if you’re lucky. That’s not observability—that’s a prayer disguised as engineering.
The OTEL standard solved this problem for microservices years ago. Traces, metrics, and logs unified under a single data model—vendor-neutral, standardized, battle-tested. And as of 2025, OTEL published GenAI semantic conventions: a standardized vocabulary specifically for LLM telemetry.
chat anthropic.claude-sonnet-4-5-20250929
that identify every model call
gen_ai.usage.input_tokens
and
gen_ai.usage.output_tokens
for precise cost attribution
openclaw-logfire
is an OpenClaw plugin that instruments the full agent lifecycle using
OTEL GenAI semantic conventions. Zero-config. Set
LOGFIRE_TOKEN, install via
openclaw plugins install @ultrathink-solutions/openclaw-logfire, and every agent invocation becomes fully traceable.
Every LLM completion span carries
gen_ai.usage.input_tokens
and
gen_ai.usage.output_tokens.
Query them via SQL in Logfire. Our Chief of Staff averages 1,200
output tokens per invocation. The Content agent averages 3,400.
The Marketing agent spikes to 8,000 during ABM research.
We set LiteLLM budget caps knowing the expected burn rate—not guessing. When costs drift, we see it in the histogram before the invoice arrives.
Every agent action becomes a trace tree—agent invocation → context loading → LLM completion → tool calls → response delivery. When the Marketing agent’s ABM workflow takes 45 seconds instead of 15, the trace shows exactly where: the Apollo MCP server is hitting rate limits. Diagnosis in seconds, not hours.
The trace chain shows every tool call, every model interaction, every decision the agent made. When the Content agent publishes a blog post, we trace backward: which knowledge base chunks did it retrieve? What brand guidelines did it reference? What model generated the draft? Every step, attributable.
OTEL GenAI events capture prompt content for debugging—but
openclaw-logfire automatically redacts API keys (sk-*,
axn_live_*, Bearer tokens),
environment variable references, and optionally PII before data
leaves the process. The trace backend never sees sensitive
content. You can’t have a debug tool that creates a new
attack surface.
When an agent calls an MCP server that’s also instrumented with OTEL, trace context propagates automatically. One trace ID from Slack message to OpenClaw gateway to Axon API to Temporal workflow to Apollo MCP server and back. No log correlation. No timestamp matching. One waterfall view.
Every agent action maps to a trace tree that mirrors actual execution. This is the full decision chain—reconstructable, attributable, auditable:
Agent Invocation (root span)
├── Context Loading
│ ├── Memory Retrieval (Mem0)
│ └── Knowledge Base Query (Qdrant)
├── LLM Completion (gen_ai span)
│ ├── gen_ai.system: "anthropic"
│ ├── gen_ai.request.model: "claude-sonnet-4-5-20250929"
│ ├── gen_ai.usage.input_tokens: 3847
│ ├── gen_ai.usage.output_tokens: 1203
│ └── gen_ai.response.finish_reasons: ["end_turn"]
├── Tool Call: apollo_enrich_company
│ ├── LLM Completion (tool response parsing)
│ └── HTTP Request to MCP Server
├── Tool Call: shared_memory_store
│ └── HTTP Request to Mem0 API
└── Response Delivery
└── Webhook POST to Slack
The root span captures the full invocation duration and outcome.
Child spans capture each phase. LLM completion spans carry the
GenAI attributes that power cost dashboards and token histograms.
When an agent burns through its budget at 3am, you open Logfire,
filter by
gen_ai.usage.output_tokens > 5000, and see exactly which invocation, which tool call, and which
model generated the runaway completion.
That’s the difference between operating a system and hoping a system works.
openclaw-logfire is one layer of the Ultrathink Axon observability stack:
You can’t improve what you can’t measure. You can’t govern what you can’t see. And you can’t trust what you can’t trace.
This is the observability philosophy behind the Modern AI Application Stack—and it’s table stakes for our Outcome Partnership model, where we operate, monitor, and continuously improve production AI systems alongside our clients. When we put skin in the game on outcomes, we need to see every layer. So do you.
# Install the plugin
openclaw plugins install @ultrathink-solutions/openclaw-logfire
# Set your Logfire token
export LOGFIRE_TOKEN=your-token-here
# Restart OpenClaw — traces start flowing immediately @ultrathink-solutions/openclaw-logfire ~400 lines of TypeScript. We kept it minimal because the hard part isn’t the plugin—it’s deciding that observability is non-negotiable before you learn the hard way.
Need observability and governance for your AI agent infrastructure?
Start with a Pathfinder Engagement™: we’ll assess your AI agent stack, design the observability and governance architecture, and deliver a production-ready system on the Ultrathink Axon platform. You own it. No lock-in.
Prove value in 6 weeksThis is a companion post to We Put OpenClaw Into Production in a Weekend. For more on building production-grade AI systems, see The Modern AI Application Stack and AI Agents: Build vs. Buy vs. Partner.
Take the next step from insight to action.
No sales pitches. No buzzwords. Just a straightforward discussion about your challenges.