Skip to main content

Overview

LangSight connects to each configured MCP server on a schedule, runs the MCP initialize handshake followed by tools/list, measures round-trip latency, and classifies the result into one of four status states. Results are stored in ClickHouse for trend analysis and used to drive the Scorecard grades.

Status states

StatusIconMeaning
upServer responded normally — latency within bounds
degradedMCP layer is up but the backend is unhealthy — schema drift detected, or a health_tool probe failed
downMCP server unreachable, timed out, or initialize/tools/list failed
staleNo health check has run in the configured interval

How degraded is set

A server transitions to degraded in two ways: Schema drift — the tool schema hash changes from the stored baseline. This covers:
  • A tool was added or removed
  • A tool’s input schema changed (required parameters, types)
  • A tool description was modified (potential poisoning vector)
The degraded state prompts you to inspect the drift before deciding whether to accept or reject the change. See Schema Drift for the full diff. Health tool probe failure — if you configure a health_tool (see Backend probe below), and that tool call fails after a successful tools/list, the server is marked degraded. This means the MCP layer is reachable but the backend application it wraps is down or unhealthy.

How down is set

A server is marked down when:
  • The TCP connection is refused or times out (default: 5 seconds)
  • The MCP initialize call fails
  • tools/list returns an error

The stale state

If the monitor daemon is not running and no manual check has been triggered in the last configured interval, LangSight marks servers as stale. This prevents the dashboard from showing optimistically green servers when checks have silently stopped.

One-shot health check

langsight mcp-health
Connects to every configured server, runs one check each, and prints a table:
MCP Server Health  (7 servers)
──────────────────────────────────────────────────────────────────────
Server            Status       Latency   Tools   Schema
snowflake-mcp     ✓ UP         89ms      8       bcf0ec26…
github-mcp        ✓ UP         54ms      12      d2125e3a…
slack-mcp         ✓ UP        142ms      4       a9e31f77…
jira-mcp          ✗ DOWN       —         —       —         timeout after 5s
postgres-mcp      ✓ UP         31ms      5       f4a2b1c9…
filesystem-mcp    ✓ UP         12ms      6       8e3d7f2a…
search-mcp        ⚠ DEGRADED   67ms      5       c4b7d91e…  schema drift detected

6/7 servers healthy
1 server degraded — run: langsight mcp-health --drift search-mcp

Options

OptionDescription
--server <name>Check a single server only
--jsonOutput results as JSON
--drift <name>Show the structural diff for a server with drift
--config <path>Use a specific .langsight.yaml

JSON output

langsight mcp-health --json | jq '.[] | {name: .server_name, status, latency_ms}'
[
  {
    "server_name": "postgres-mcp",
    "status": "up",
    "latency_ms": 31.4,
    "tools_count": 5,
    "schema_hash": "f4a2b1c9...",
    "schema_drift": false,
    "checked_at": "2026-03-26T08:00:00Z",
    "error": null
  }
]

Exit codes

CodeMeaning
0All servers UP
1One or more servers DOWN or DEGRADED
This makes langsight mcp-health useful as a pre-flight check in CI/CD:
# .github/workflows/deploy.yml
- name: MCP health pre-check
  run: langsight mcp-health
  # exits 1 if any server is down → blocks deploy

Continuous monitoring daemon

langsight monitor
Runs health checks on a configurable interval and fires alerts on state changes (UP → DOWN, DOWN → UP, schema drift). Keeps running until stopped.
LangSight Monitor — watching 7 servers
  Check interval:  30s
  Alert channel:   Slack (#mcp-alerts)

14:00:01  snowflake-mcp     ✓ UP       89ms
14:00:01  github-mcp        ✓ UP       54ms
14:00:01  jira-mcp          ✗ DOWN     —
          Alert fired → Slack (#mcp-alerts)

14:00:31  jira-mcp          ✗ DOWN     —     (3 consecutive failures)
...
14:02:31  jira-mcp          ✓ UP       112ms
          Recovery alert → Slack (#mcp-alerts)

Configure the interval

# .langsight.yaml
monitor:
  health_check_interval_seconds: 30   # per-server or global

# Per-server override:
servers:
  - name: postgres-mcp
    transport: streamable_http
    url: https://postgres-mcp.internal.company.com/mcp
    health_check_interval_seconds: 60   # check this one less frequently

Alert on state changes only

By default, langsight monitor fires one alert when a server goes DOWN, not an alert on every failed check. A recovery alert fires when the server comes back UP. This prevents alert storms. To alert on every failure instead:
# .langsight.yaml
monitor:
  alert_on_every_failure: true

Health check protocol

This section describes what LangSight does under the hood when it checks a server. You don’t need to know this to use LangSight — it’s here for debugging and for operators writing custom health check integrations.

Stdio transport

  1. Spawn the process with the configured command and args
  2. Send initialize via stdin
  3. Wait for the initialized response (timeout: 5s by default)
  4. Send tools/list
  5. Compare returned schema hash to stored baseline
  6. Kill the process

StreamableHTTP transport

  1. POST {url} with Content-Type: application/json and the initialize MCP message
  2. Parse the SSE/JSON response stream
  3. Send tools/list in the same session
  4. Compare schema hash
  5. Close the connection

SSE transport (legacy)

  1. GET {url} to open the SSE stream
  2. Send initialize via a POST to the session endpoint
  3. Wait for initialized event
  4. Send tools/list
  5. Compare schema hash
  6. Close the connection

Timeout behavior

The default timeout is 5 seconds per server. If the server does not respond within the timeout, the result is down with error: "timeout after 5s". Configure per-server:
servers:
  - name: slow-mcp
    transport: streamable_http
    url: https://slow-mcp.company.com/mcp
    health_check_timeout_seconds: 15

Backend probe

Some MCP servers can respond to tools/list while the backend application they wrap is down. For example, a DataHub MCP server may serve its tool manifest from a local cache while the DataHub REST API is unreachable. The standard up/down check would show the server as healthy. The health_tool probe solves this. Configure a tool to call as a liveness check — if it fails, LangSight marks the server degraded instead of up.
servers:
  - name: datahub
    transport: streamable_http
    url: https://datahub-mcp.internal.company.com/mcp
    health_tool: search_entities      # tool to call as a backend liveness probe
    health_tool_args:                 # arguments passed to the tool
      query: "test"
      count: 1
    timeout_seconds: 15              # raise timeout for slower backends
The probe runs after tools/list on every health check cycle. If the tool call raises an error or returns an unexpected response, the result is degraded (not down) — preserving the distinction between “MCP layer unreachable” and “MCP layer up, backend down”. Choose a tool that:
  • Is cheap (a narrow read or search with a small result set)
  • Exercises the critical backend path (database read, API call)
  • Does not mutate state
Omitting health_tool skips the probe entirely and restores the previous behavior.

Data persistence

Health check results are stored in ClickHouse with a 90-day TTL. The schema:
CREATE TABLE health_checks (
    server_name   LowCardinality(String),
    status        LowCardinality(String),    -- up / degraded / down / stale
    latency_ms    Nullable(Float32),
    tools_count   Nullable(UInt16),
    schema_hash   Nullable(String),
    error         Nullable(String),
    checked_at    DateTime64(3, 'UTC')
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(checked_at)
ORDER BY (server_name, checked_at)
TTL checked_at + INTERVAL 90 DAY;
Query via the API:
# Latest status for all servers
curl http://localhost:8000/api/health/servers

# History for one server (last 20 checks)
curl "http://localhost:8000/api/health/servers/postgres-mcp/history?limit=20"

# Trigger an immediate check
curl -X POST http://localhost:8000/api/health/check
See Health API for the full endpoint reference.

Dashboard

The MCP Servers page (/servers) shows:
  • A sortable catalog with current status, p99 latency sparkline, uptime %, Last Used, and Last OK? columns
    • Last Used: timestamp of the most recent tool call by any agent (7-day window)
    • Last OK?: whether the most recent tool call completed without error
  • A Needs Attention banner listing any server that is DOWN or DEGRADED
  • A detail panel with four tabs: About, Tools, Health history, and Consumers
  • Tool reliability metrics (calls, errors, p99 latency, success rate) in the Tools tab
The former /health (Tool Health) page redirects to /servers — all tool reliability data is now in the Tools tab. The dashboard pulls from the same ClickHouse data as the CLI.