Documentation Index
Fetch the complete documentation index at: https://docs.langsight.dev/llms.txt
Use this file to discover all available pages before exploring further.
Overview
LangSight connects to each configured MCP server on a schedule, runs the MCP initialize handshake followed by tools/list, measures round-trip latency, and classifies the result into one of four status states. Results are stored in ClickHouse for trend analysis and used to drive the Scorecard grades.
Status states
| Status | Icon | Meaning |
|---|
up | ✓ | Server responded normally — latency within bounds |
degraded | ⚠ | MCP layer is up but the backend is unhealthy — schema drift detected, or a health_tool probe failed |
down | ✗ | MCP server unreachable, timed out, or initialize/tools/list failed |
stale | — | No health check has run in the configured interval |
How degraded is set
A server transitions to degraded in two ways:
Schema drift — the tool schema hash changes from the stored baseline. This covers:
- A tool was added or removed
- A tool’s input schema changed (required parameters, types)
- A tool description was modified (potential poisoning vector)
The degraded state prompts you to inspect the drift before deciding whether to accept or reject the change. See Schema Drift for the full diff.
Health tool probe failure — if you configure a health_tool (see Backend probe below), and that tool call fails after a successful tools/list, the server is marked degraded. This means the MCP layer is reachable but the backend application it wraps is down or unhealthy.
How down is set
A server is marked down when:
- The TCP connection is refused or times out (default: 5 seconds)
- The MCP
initialize call fails
tools/list returns an error
The stale state
If the monitor daemon is not running and no manual check has been triggered in the last configured interval, LangSight marks servers as stale. This prevents the dashboard from showing optimistically green servers when checks have silently stopped.
One-shot health check
Connects to every configured server, runs one check each, and prints a table:
MCP Server Health (7 servers)
──────────────────────────────────────────────────────────────────────
Server Status Latency Tools Schema
snowflake-mcp ✓ UP 89ms 8 bcf0ec26…
github-mcp ✓ UP 54ms 12 d2125e3a…
slack-mcp ✓ UP 142ms 4 a9e31f77…
jira-mcp ✗ DOWN — — — timeout after 5s
postgres-mcp ✓ UP 31ms 5 f4a2b1c9…
filesystem-mcp ✓ UP 12ms 6 8e3d7f2a…
search-mcp ⚠ DEGRADED 67ms 5 c4b7d91e… schema drift detected
6/7 servers healthy
1 server degraded — run: langsight mcp-health --drift search-mcp
Options
| Option | Description |
|---|
--server <name> | Check a single server only |
--json | Output results as JSON |
--drift <name> | Show the structural diff for a server with drift |
--config <path> | Use a specific .langsight.yaml |
JSON output
langsight mcp-health --json | jq '.[] | {name: .server_name, status, latency_ms}'
[
{
"server_name": "postgres-mcp",
"status": "up",
"latency_ms": 31.4,
"tools_count": 5,
"schema_hash": "f4a2b1c9...",
"schema_drift": false,
"checked_at": "2026-03-26T08:00:00Z",
"error": null
}
]
Exit codes
| Code | Meaning |
|---|
0 | All servers UP |
1 | One or more servers DOWN or DEGRADED |
This makes langsight mcp-health useful as a pre-flight check in CI/CD:
# .github/workflows/deploy.yml
- name: MCP health pre-check
run: langsight mcp-health
# exits 1 if any server is down → blocks deploy
Continuous monitoring daemon
Runs health checks on a configurable interval and fires alerts on state changes (UP → DOWN, DOWN → UP, schema drift). Keeps running until stopped.
LangSight Monitor — watching 7 servers
Check interval: 30s
Alert channel: Slack (#mcp-alerts)
14:00:01 snowflake-mcp ✓ UP 89ms
14:00:01 github-mcp ✓ UP 54ms
14:00:01 jira-mcp ✗ DOWN —
Alert fired → Slack (#mcp-alerts)
14:00:31 jira-mcp ✗ DOWN — (3 consecutive failures)
...
14:02:31 jira-mcp ✓ UP 112ms
Recovery alert → Slack (#mcp-alerts)
# .langsight.yaml
monitor:
health_check_interval_seconds: 30 # per-server or global
# Per-server override:
servers:
- name: postgres-mcp
transport: streamable_http
url: https://postgres-mcp.internal.company.com/mcp
health_check_interval_seconds: 60 # check this one less frequently
Alert on state changes only
By default, langsight monitor fires one alert when a server goes DOWN, not an alert on every failed check. A recovery alert fires when the server comes back UP. This prevents alert storms.
To alert on every failure instead:
# .langsight.yaml
monitor:
alert_on_every_failure: true
Health check protocol
This section describes what LangSight does under the hood when it checks a server. You don’t need to know this to use LangSight — it’s here for debugging and for operators writing custom health check integrations.
Stdio transport
- Spawn the process with the configured
command and args
- Send
initialize via stdin
- Wait for the
initialized response (timeout: 5s by default)
- Send
tools/list
- Compare returned schema hash to stored baseline
- Kill the process
StreamableHTTP transport
POST {url} with Content-Type: application/json and the initialize MCP message
- Parse the SSE/JSON response stream
- Send
tools/list in the same session
- Compare schema hash
- Close the connection
SSE transport (legacy)
GET {url} to open the SSE stream
- Send
initialize via a POST to the session endpoint
- Wait for
initialized event
- Send
tools/list
- Compare schema hash
- Close the connection
Timeout behavior
The default timeout is 5 seconds per server. If the server does not respond within the timeout, the result is down with error: "timeout after 5s".
Configure per-server:
servers:
- name: slow-mcp
transport: streamable_http
url: https://slow-mcp.company.com/mcp
health_check_timeout_seconds: 15
Backend probe
Some MCP servers can respond to tools/list while the backend application they wrap is down. For example, a DataHub MCP server may serve its tool manifest from a local cache while the DataHub REST API is unreachable. The standard up/down check would show the server as healthy.
The health_tool probe solves this. Configure a tool to call as a liveness check — if it fails, LangSight marks the server degraded instead of up.
servers:
- name: datahub
transport: streamable_http
url: https://datahub-mcp.internal.company.com/mcp
health_tool: search_entities # tool to call as a backend liveness probe
health_tool_args: # arguments passed to the tool
query: "test"
count: 1
timeout_seconds: 15 # raise timeout for slower backends
The probe runs after tools/list on every health check cycle. If the tool call raises an error or returns an unexpected response, the result is degraded (not down) — preserving the distinction between “MCP layer unreachable” and “MCP layer up, backend down”.
Choose a tool that:
- Is cheap (a narrow read or search with a small result set)
- Exercises the critical backend path (database read, API call)
- Does not mutate state
Omitting health_tool skips the probe entirely and restores the previous behavior.
Data persistence
Health check results are stored in ClickHouse with a 90-day TTL. The schema:
CREATE TABLE health_checks (
server_name LowCardinality(String),
status LowCardinality(String), -- up / degraded / down / stale
latency_ms Nullable(Float32),
tools_count Nullable(UInt16),
schema_hash Nullable(String),
error Nullable(String),
checked_at DateTime64(3, 'UTC')
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(checked_at)
ORDER BY (server_name, checked_at)
TTL checked_at + INTERVAL 90 DAY;
Query via the API:
# Latest status for all servers
curl http://localhost:8000/api/health/servers
# History for one server (last 20 checks)
curl "http://localhost:8000/api/health/servers/postgres-mcp/history?limit=20"
# Trigger an immediate check
curl -X POST http://localhost:8000/api/health/check
See Health API for the full endpoint reference.
Dashboard
The MCP Servers page (/servers) shows:
- A sortable catalog with current status, p99 latency sparkline, uptime %, Last Used, and Last OK? columns
- Last Used: timestamp of the most recent tool call by any agent (7-day window)
- Last OK?: whether the most recent tool call completed without error
- A Needs Attention banner listing any server that is
DOWN or DEGRADED
- A detail panel with four tabs: About, Tools, Health history, and Consumers
- Tool reliability metrics (calls, errors, p99 latency, success rate) in the Tools tab
The former /health (Tool Health) page redirects to /servers — all tool reliability data is now in the Tools tab.
The dashboard pulls from the same ClickHouse data as the CLI.