MCP Gateway

Distributed tracing, Prometheus metrics, and audit logging — the visibility layer across all three pillars.

Overview

Observability is not a pillar — it is the fabric that connects all three. Every tool call through MCP servers, every skill execution, and every sandbox command passes through a consistent monitoring pipeline that provides complete visibility into what your agents are doing, how they are performing, and what happened in the past.

MCP Gateway has three monitoring dimensions:

Telemetry (traces) — distributed traces showing the full call chain for each request, pushed to your observability platform via OpenTelemetry
Metrics — Prometheus-format counters, gauges, and histograms scraped by your monitoring stack
Logs — structured audit trail of every tool call, stored in PostgreSQL with automatic secret masking

These are complementary: traces show you the story of individual requests, metrics show you aggregate trends, and logs provide a permanent audit trail for compliance and debugging.

Distributed Tracing

At startup, OpenTelemetry auto-instruments four libraries — no tracing code needed in application logic:

Library	What It Traces
FastAPI	Every HTTP request (method, path, status, duration)
SQLAlchemy	Every database query (query text, duration)
HTTPX	Every outgoing HTTP call to MCP servers
Python logging	Injects `trace_id` and `span_id` into every log line

Per-tool visibility today comes from three places: the auto-instrumented HTTPX spans for outbound calls to remote MCP servers, the tool_calls_total / tool_latency_seconds Prometheus counters emitted by the audit logger, and the persistent per-call rows in the tool_call_logs table. A @trace_tool_call decorator (adding mcp.server, mcp.tool, and mcp.status span attributes) ships with the gateway and is wired in as tool-execution paths adopt it — subscribe to the changelog for when per-tool spans become the default.

Hot-Reloadable Export

The key architectural feature is a ConfigurableSpanExporter that allows swapping the underlying trace exporter at runtime without restarting the application. When an admin changes the telemetry provider in the Settings UI, the new exporter is created and swapped in via a thread-safe lock — zero downtime.

Eight Providers

Provider	Protocol	Auth Method
Datadog	gRPC	`DD-API-KEY` header
Azure Application Insights	HTTP	Connection string
AWS CloudWatch	gRPC (via collector sidecar)	SigV4
Google Cloud Operations	gRPC	Service account / ADC
Grafana Cloud	gRPC	Basic auth
New Relic	HTTP	`api-key` header
Splunk	gRPC	`X-SF-Token` header
Custom OTLP	gRPC or HTTP	Configurable headers

All provider credentials are encrypted at rest using AES-256-GCM. The API never returns actual credential values — only boolean flags indicating whether credentials are configured.

Prometheus Metrics

The metrics endpoint (GET /api/v1/metrics) exposes data in Prometheus text format. Three categories:

HTTP metrics — http_requests_total (counter by method, endpoint, status) and http_request_duration_seconds (histogram). URL paths are normalized to replace UUIDs with :id to prevent high-cardinality metrics.

Business metrics — tool_calls_total and tool_latency_seconds per server and tool, server_status gauge, tokens_total for LLM usage tracking, active_connections per server, and errors_total by category. These metrics span all three pillars — tool calls through MCP servers, skill generation progress, and sandbox command execution.

Process metrics — CPU time, memory usage, open file descriptors, garbage collector stats, and Python version.

Audit Logging

Every MCP tool call is recorded in the tool_call_logs table. Before storage, the logger:

Masks secrets in arguments and responses — API keys, tokens, and passwords are replaced with ***
Records Prometheus metrics — increments counters and observes latency histograms
Stores the log in PostgreSQL with the server name, tool name, masked arguments, masked response, latency, and status

Log retention is configurable (default 90 days). The Settings UI provides a preview of how many logs would be deleted before running cleanup.

Request Correlation

Every HTTP request is assigned a unique X-Request-ID header. OpenTelemetry injects trace_id and span_id into every log line. This means you can click a trace in Datadog and see the associated logs, or search logs by trace ID to find the corresponding distributed trace — across MCP server tool calls, skill operations, and sandbox commands.

How It All Connects

A single MCP tool call flows through all three monitoring dimensions:

Trace span created — OpenTelemetry auto-creates spans for the HTTP request, database queries, and outgoing MCP server calls
Prometheus metrics recorded — request count, latency histogram, tool call counters, server status
Audit log stored — tool name, masked arguments, masked response, latency, and status in PostgreSQL
Log line emitted — structured JSON with trace correlation IDs for cross-referencing

Result: from one tool call, you get a trace in your observability platform, Prometheus metrics for dashboards and alerting, an audit log entry for compliance, and a structured log line with trace correlation.

Key Features

Auto-instrumented tracing — FastAPI, SQLAlchemy, HTTPX, and Python logging traced with zero application code
Per-tool observability — HTTPX outbound spans for remote MCP servers, tool_calls_total and tool_latency_seconds Prometheus metrics, and the tool_call_logs audit table; a @trace_tool_call decorator is shipped and is being wired into execution paths for native per-tool spans with mcp.server / mcp.tool / mcp.status attributes
Eight telemetry providers — Datadog, Azure, AWS, Google Cloud, Grafana, New Relic, Splunk, and custom OTLP
Hot-reload configuration — change telemetry provider without restarting the application
Prometheus metrics — HTTP, business, and process metrics in standard Prometheus format
Full audit trail — every tool call logged to PostgreSQL with automatic secret masking
Configurable retention — set how long audit logs are kept, with dry-run preview before cleanup
Encrypted credentials — all provider secrets encrypted at rest with AES-256-GCM
Request correlation — trace IDs in logs enable cross-referencing between traces and log entries
Settings UI — configure telemetry export, metrics endpoint, and log retention from the web interface

API Reference

View logs — query and filter the audit log
Metrics summary — aggregated metrics for dashboards
Export configuration — configure telemetry export providers

Observability

On this page