Overview
LangSight connects to each configured MCP server on a schedule, runs the MCPinitialize handshake followed by tools/list, measures round-trip latency, and classifies the result into one of four status states. Results are stored in ClickHouse for trend analysis and used to drive the Scorecard grades.
Status states
| Status | Icon | Meaning |
|---|---|---|
up | ✓ | Server responded normally — latency within bounds |
degraded | ⚠ | MCP layer is up but the backend is unhealthy — schema drift detected, or a health_tool probe failed |
down | ✗ | MCP server unreachable, timed out, or initialize/tools/list failed |
stale | — | No health check has run in the configured interval |
How degraded is set
A server transitions to degraded in two ways:
Schema drift — the tool schema hash changes from the stored baseline. This covers:
- A tool was added or removed
- A tool’s input schema changed (required parameters, types)
- A tool description was modified (potential poisoning vector)
degraded state prompts you to inspect the drift before deciding whether to accept or reject the change. See Schema Drift for the full diff.
Health tool probe failure — if you configure a health_tool (see Backend probe below), and that tool call fails after a successful tools/list, the server is marked degraded. This means the MCP layer is reachable but the backend application it wraps is down or unhealthy.
How down is set
A server is marked down when:
- The TCP connection is refused or times out (default: 5 seconds)
- The MCP
initializecall fails tools/listreturns an error
The stale state
If the monitor daemon is not running and no manual check has been triggered in the last configured interval, LangSight marks servers as stale. This prevents the dashboard from showing optimistically green servers when checks have silently stopped.
One-shot health check
Options
| Option | Description |
|---|---|
--server <name> | Check a single server only |
--json | Output results as JSON |
--drift <name> | Show the structural diff for a server with drift |
--config <path> | Use a specific .langsight.yaml |
JSON output
Exit codes
| Code | Meaning |
|---|---|
0 | All servers UP |
1 | One or more servers DOWN or DEGRADED |
langsight mcp-health useful as a pre-flight check in CI/CD:
Continuous monitoring daemon
Configure the interval
Alert on state changes only
By default,langsight monitor fires one alert when a server goes DOWN, not an alert on every failed check. A recovery alert fires when the server comes back UP. This prevents alert storms.
To alert on every failure instead:
Health check protocol
This section describes what LangSight does under the hood when it checks a server. You don’t need to know this to use LangSight — it’s here for debugging and for operators writing custom health check integrations.Stdio transport
- Spawn the process with the configured
commandandargs - Send
initializevia stdin - Wait for the
initializedresponse (timeout: 5s by default) - Send
tools/list - Compare returned schema hash to stored baseline
- Kill the process
StreamableHTTP transport
POST {url}withContent-Type: application/jsonand theinitializeMCP message- Parse the SSE/JSON response stream
- Send
tools/listin the same session - Compare schema hash
- Close the connection
SSE transport (legacy)
GET {url}to open the SSE stream- Send
initializevia a POST to the session endpoint - Wait for
initializedevent - Send
tools/list - Compare schema hash
- Close the connection
Timeout behavior
The default timeout is 5 seconds per server. If the server does not respond within the timeout, the result isdown with error: "timeout after 5s".
Configure per-server:
Backend probe
Some MCP servers can respond totools/list while the backend application they wrap is down. For example, a DataHub MCP server may serve its tool manifest from a local cache while the DataHub REST API is unreachable. The standard up/down check would show the server as healthy.
The health_tool probe solves this. Configure a tool to call as a liveness check — if it fails, LangSight marks the server degraded instead of up.
tools/list on every health check cycle. If the tool call raises an error or returns an unexpected response, the result is degraded (not down) — preserving the distinction between “MCP layer unreachable” and “MCP layer up, backend down”.
Choose a tool that:
- Is cheap (a narrow read or search with a small result set)
- Exercises the critical backend path (database read, API call)
- Does not mutate state
health_tool skips the probe entirely and restores the previous behavior.
Data persistence
Health check results are stored in ClickHouse with a 90-day TTL. The schema:Dashboard
The MCP Servers page (/servers) shows:
- A sortable catalog with current status, p99 latency sparkline, uptime %, Last Used, and Last OK? columns
- Last Used: timestamp of the most recent tool call by any agent (7-day window)
- Last OK?: whether the most recent tool call completed without error
- A Needs Attention banner listing any server that is
DOWNorDEGRADED - A detail panel with four tabs: About, Tools, Health history, and Consumers
- Tool reliability metrics (calls, errors, p99 latency, success rate) in the Tools tab
/health (Tool Health) page redirects to /servers — all tool reliability data is now in the Tools tab.
The dashboard pulls from the same ClickHouse data as the CLI.
Related
- Schema Drift — what
DEGRADEDmeans in detail - Server Scorecard — composite grade across 5 dimensions
- langsight monitor — full CLI reference for the daemon
- Health API — REST endpoints