ULTRATHINK
Solutions
← Back to The Signal
Architecture February 19, 2026

MCP Gateway: Why Your AI Application Needs One

The Model Context Protocol solved the integration problem. Connect your LLM to any service. But connect it to ten services and watch what happens: 150+ tool schemas flood the context window before the model touches a single user request. MCP created a discovery problem. We built the fix—and open-sourced it.

Nick Amabile
Nick Amabile
Founder & CEO

The Model Context Protocol is having a moment. Every SaaS vendor is shipping an MCP server. Anthropic, OpenAI, Google—every major model provider now supports it. The vision is compelling: a universal protocol that lets LLMs talk to any service through a standardized interface. Slack, HubSpot, Jira, your database, your internal APIs—all accessible through one protocol.

It looks like the promised land. There’s just one problem.

We build production AI systems on the Ultrathink Axon™ platform. Our agents connect to Apollo, HubSpot, Ahrefs, Google Analytics, LinkedIn, SEC EDGAR, Google Docs—and that list grows every month. When we wired up our OpenClaw deployment to these MCP servers, the LLM received over 150 tool schemas dumped into context before it processed a single user message.

The result was predictable. Context windows filled up. Token costs spiked. The model started picking the wrong tools—or worse, hallucinating tool names that didn’t exist. We’d solved the integration problem and created a discovery problem.

The context collapse problem

The MCP spec’s tools/list endpoint returns every tool schema upfront. No pagination. No lazy loading. No concept of “give me the tools I actually need.” That design made sense for a world with one or two MCP servers. It falls apart when you connect five. Or ten. Or twenty.

The math is straightforward

Each tool schema is 500–2,000 tokens (name, description, parameter JSON Schema with nested types, descriptions for each field). Connect a few real-world MCP servers:

  • Ahrefs: 100+ tools across site explorer, keywords, web analytics, rank tracker, brand radar
  • Apollo: 30+ tools for people search, enrichment, sequences, contacts
  • HubSpot: 20+ tools for CRM objects, properties, associations, workflows
  • Google Docs + Sheets: 40+ tools for document operations, spreadsheet editing, Drive management

That’s 190+ tools. At 1,000 tokens per schema on average, you’re burning ~190,000 tokens just on tool definitions—before the system prompt, before the conversation history, before the model does any reasoning. On Claude, that’s roughly $0.57 in input tokens per conversation turn. For an always-on agent making 50 calls a day, that’s over $850/month in wasted context.

But cost isn’t even the biggest issue. Tool selection accuracy degrades as tool count increases. When a model sees 190 tools, it has to figure out which one to call based on names and descriptions alone. Similar names collide. Descriptions blur together. The model picks apollo_people_search when it meant apollo_contacts_search. Or it fabricates a tool name entirely because it “felt right” based on the pattern.

The tradeoff enterprises face today: limit your MCP integrations to keep context manageable, or connect everything and accept degraded agent performance. Neither option is acceptable.

What existing solutions get wrong

We looked at every approach in the ecosystem before building our own. None of them solved the actual problem.

  • Simple proxies (mcp-proxy, stdio-to-SSE bridges) solve transport. They convert between stdio and SSE or Streamable HTTP. They don’t reduce context—the LLM still sees every tool from every upstream.
  • Manual tool curation works until your third integration. Maintaining a hardcoded list of which tools to expose defeats the purpose of a protocol designed for interoperability. Every new upstream requires a config change.
  • Multiple separate connections keep each MCP server in its own silo. No unified discovery. No cross-domain search. The LLM can’t say “find a tool that does X” across all available services.
  • Naive filtering hides tools the LLM might need. You reduce context but lose flexibility. The model can’t find what it doesn’t know exists.

The fundamental issue is architectural. Every approach above treats the tool list as a flat, all-or-nothing payload. What we needed was progressive discovery—the ability for the LLM to explore available tools in layers, loading full schemas only for the tools it actually needs.

The pattern: 3 meta-tools instead of 150

Think about how you explore an unfamiliar API. You don’t read all 200 endpoint specs first. You browse the categories. You find the section that looks relevant. You read the spec for the one endpoint you need. Then you call it.

LLMs can work the same way. Instead of dumping every tool schema into context, give the model 3 meta-tools that enable on-demand discovery:

  • 01 discover_tools—Browse available domains (services), drill into tool groups, or search by keyword. No args returns every domain with tool counts. Passing a domain returns its tools. Passing a query runs a keyword search across all tools.
  • 02 get_tool_schema—Fetch the full JSON Schema for a specific tool. This is the only moment a complete schema enters context. Includes fuzzy matching: if the model types apollo_people_serch, the gateway suggests apollo_people_search.
  • 03 execute_tool—Run any discovered tool against its upstream server. The gateway routes the call to the correct upstream, passes through authentication headers, and returns the result.

Here’s the workflow in practice. A user asks: “Find me the contact info for the VP of Engineering at Anthropic.”

Step 1: discover_tools()
→ Domains: apollo (30 tools), hubspot (20), ahrefs (100), google_docs (40)...

Step 2: discover_tools(domain="apollo")
→ Groups: people, organizations, contacts, sequences...

Step 3: discover_tools(domain="apollo", group="people")
→ Tools: apollo_people_search, apollo_people_enrichment,
         apollo_get_person_email, apollo_bulk_people_enrichment

Step 4: get_tool_schema("apollo_people_search")
→ Full JSON Schema with parameters, types, descriptions

Step 5: execute_tool("apollo_people_search",
          {"person_titles": ["VP Engineering"], "q_organization_name": "Anthropic"})
→ Results from Apollo API

The model loaded 1 tool schema instead of 190. It navigated to the right tool through a natural browse-then-drill workflow. No wasted context. No ambiguity. No hallucinated tool names. The same pattern that makes IDE autocomplete and API documentation usable at scale—applied to LLM tool discovery.

Introducing fastmcp-gateway

fastmcp-gateway is our open-source implementation of the progressive discovery pattern. It’s a Python gateway built on FastMCP that aggregates any number of upstream MCP servers behind those 3 meta-tools. Apache 2.0 licensed. No lock-in.

We built it because we hit this exact problem in production. Our Ultrathink Axon™ platform connects enterprise clients to 10+ services through custom MCP servers. Context collapse was degrading agent accuracy. The existing solutions didn’t solve the root cause. So we built one that does.

What you get

  • Automatic domain and group organization. Tools are grouped by naming convention (domain_group_action). No manual categorization needed.
  • Keyword search + fuzzy matching. The LLM can search across all tools by description (“find tools for people search”) and get typo corrections on tool names.
  • Request passthrough authentication. User context flows through to upstream servers. Each user’s auth reaches the right upstream without the gateway storing credentials. Per-domain header overrides for upstreams with different auth schemes.
  • Health endpoints for Kubernetes. Liveness (/healthz) and readiness (/readyz) probes with tool count reporting. Readiness returns 503 until all upstreams are discovered.
  • Fully environment-driven configuration. No config files. Set GATEWAY_UPSTREAMS as JSON, point at your MCP servers, and go.
  • Graceful degradation. If an upstream is unreachable at startup, it’s logged and skipped. If a tool fails during execution, the gateway returns an error response—it doesn’t crash. Partial functionality when some upstreams are down.

Getting started

pip install fastmcp-gateway

# Configure upstreams as JSON
export GATEWAY_UPSTREAMS='{
  "apollo": "http://apollo:8080/mcp",
  "hubspot": "http://hubspot:8080/mcp",
  "ahrefs": "http://ahrefs:8080/mcp"
}'

# Start the gateway
python -m fastmcp_gateway

That’s it. The gateway discovers tools from all upstreams at startup, organizes them into domains and groups, and exposes the 3 meta-tools over Streamable HTTP. Kubernetes deployment examples (Dockerfile + manifests) are included in the repo.

The numbers

Metric Without gateway With gateway
Tools in base context 190+ 3
Tokens consumed by tool schemas ~190,000 ~2,000
Schemas loaded per task All 190+ 1–3 (on demand)
Context available for reasoning Constrained Near-full window
New upstream integration Config change + context budget Config change only

The critical number: adding a new upstream MCP server no longer costs context. Without a gateway, every new integration increases the context burden on every conversation. With progressive discovery, the base cost stays at 3 tools regardless of how many upstreams you connect. Your tenth integration is as cheap as your first.

Why we open-sourced it

This is infrastructure, not competitive advantage. An MCP gateway is plumbing—necessary plumbing, but plumbing. The MCP ecosystem needs shared tooling for production-grade problems, and the progressive discovery pattern should be available to anyone building with MCP at scale.

fastmcp-gateway is one component of the infrastructure that powers the Ultrathink Axon™ platform. We use it internally every day. Our competitive advantage isn’t the gateway—it’s the Outcome Partnership model, the production architecture around it, and the operational expertise to deploy, monitor, and improve AI systems over time. Sharing the gateway builds credibility and helps the ecosystem mature faster.

We’d rather help the MCP ecosystem scale than hoard plumbing.

What this means for production AI

The MCP ecosystem is maturing fast. Six months ago, most developers were connecting one or two MCP servers. Now enterprises are connecting ten, twenty, fifty. The infrastructure assumptions that worked at small scale—load all tools upfront, manage context manually, hope the LLM picks the right one—break at production scale.

This is a pattern we see over and over in AI infrastructure. The demo works beautifully. The pilot impresses stakeholders. Then production requirements—scale, governance, observability, cost control—expose the gap between what the demo handled and what the business actually needs. We call it the Execution Gap, and it shows up everywhere from agent architecture to tool discovery.

What’s next

  • Tool usage analytics. Which tools get discovered most? Which get executed? Which domains are the LLM navigating to first? Usage patterns that help you optimize your MCP server portfolio.
  • Schema caching. Cache frequently-accessed schemas to skip the discovery step for commonly-used tools.
  • Access control. Per-user or per-agent tool visibility. Not every agent needs access to every upstream.

Production AI needs production infrastructure. The MCP protocol gave us a universal integration layer. Progressive discovery gives us a way to use it at scale without drowning in context. That’s the gap fastmcp-gateway closes.

Star the repo. Try it with your MCP servers. Open issues if something breaks. And if you’re building production AI systems and need the full stack—strategy, platform, operations—start the conversation.

This is part of our series on building production-grade AI systems. For more, see OpenClaw in Production, The Modern AI Application Stack, and The AI Execution Gap.

Ready to Close the Execution Gap?

Take the next step from insight to action.

No sales pitches. No buzzwords. Just a straightforward discussion about your challenges.