Health Monitoring - LangSight

Overview

LangSight connects to each configured MCP server on a schedule, runs the MCP initialize handshake followed by tools/list, measures round-trip latency, and classifies the result into one of four status states. Results are stored in ClickHouse for trend analysis and used to drive the Scorecard grades.

Status states

Status	Icon	Meaning
`up`	✓	Server responded normally — latency within bounds
`degraded`	⚠	MCP layer is up but the backend is unhealthy — schema drift detected, or a `health_tool` probe failed
`down`	✗	MCP server unreachable, timed out, or `initialize`/`tools/list` failed
`stale`	—	No health check has run in the configured interval

How `degraded` is set

A server transitions to degraded in two ways: Schema drift — the tool schema hash changes from the stored baseline. This covers:

A tool was added or removed
A tool’s input schema changed (required parameters, types)
A tool description was modified (potential poisoning vector)

The degraded state prompts you to inspect the drift before deciding whether to accept or reject the change. See Schema Drift for the full diff. Health tool probe failure — if you configure a health_tool (see Backend probe below), and that tool call fails after a successful tools/list, the server is marked degraded. This means the MCP layer is reachable but the backend application it wraps is down or unhealthy.

How `down` is set

A server is marked down when:

The TCP connection is refused or times out (default: 5 seconds)
The MCP initialize call fails
tools/list returns an error

The `stale` state

If the monitor daemon is not running and no manual check has been triggered in the last configured interval, LangSight marks servers as stale. This prevents the dashboard from showing optimistically green servers when checks have silently stopped.

One-shot health check

langsight mcp-health

Connects to every configured server, runs one check each, and prints a table:

MCP Server Health  (7 servers)
──────────────────────────────────────────────────────────────────────
Server            Status       Latency   Tools   Schema
snowflake-mcp     ✓ UP         89ms      8       bcf0ec26…
github-mcp        ✓ UP         54ms      12      d2125e3a…
slack-mcp         ✓ UP        142ms      4       a9e31f77…
jira-mcp          ✗ DOWN       —         —       —         timeout after 5s
postgres-mcp      ✓ UP         31ms      5       f4a2b1c9…
filesystem-mcp    ✓ UP         12ms      6       8e3d7f2a…
search-mcp        ⚠ DEGRADED   67ms      5       c4b7d91e…  schema drift detected

6/7 servers healthy
1 server degraded — run: langsight mcp-health --drift search-mcp

Options

Option	Description
`--server <name>`	Check a single server only
`--json`	Output results as JSON
`--drift <name>`	Show the structural diff for a server with drift
`--config <path>`	Use a specific `.langsight.yaml`

JSON output

langsight mcp-health --json | jq '.[] | {name: .server_name, status, latency_ms}'

[
  {
    "server_name": "postgres-mcp",
    "status": "up",
    "latency_ms": 31.4,
    "tools_count": 5,
    "schema_hash": "f4a2b1c9...",
    "schema_drift": false,
    "checked_at": "2026-03-26T08:00:00Z",
    "error": null
  }
]

Exit codes

Code	Meaning
`0`	All servers `UP`
`1`	One or more servers `DOWN` or `DEGRADED`

This makes langsight mcp-health useful as a pre-flight check in CI/CD:

# .github/workflows/deploy.yml
- name: MCP health pre-check
  run: langsight mcp-health
  # exits 1 if any server is down → blocks deploy

Continuous monitoring daemon

langsight monitor

Runs health checks on a configurable interval and fires alerts on state changes (UP → DOWN, DOWN → UP, schema drift). Keeps running until stopped.

LangSight Monitor — watching 7 servers
  Check interval:  30s
  Alert channel:   Slack (#mcp-alerts)

14:00:01  snowflake-mcp     ✓ UP       89ms
14:00:01  github-mcp        ✓ UP       54ms
14:00:01  jira-mcp          ✗ DOWN     —
          Alert fired → Slack (#mcp-alerts)

14:00:31  jira-mcp          ✗ DOWN     —     (3 consecutive failures)
...
14:02:31  jira-mcp          ✓ UP       112ms
          Recovery alert → Slack (#mcp-alerts)

Configure the interval

# .langsight.yaml
monitor:
  health_check_interval_seconds: 30   # per-server or global

# Per-server override:
servers:
  - name: postgres-mcp
    transport: streamable_http
    url: https://postgres-mcp.internal.company.com/mcp
    health_check_interval_seconds: 60   # check this one less frequently

Alert on state changes only

By default, langsight monitor fires one alert when a server goes DOWN, not an alert on every failed check. A recovery alert fires when the server comes back UP. This prevents alert storms. To alert on every failure instead:

# .langsight.yaml
monitor:
  alert_on_every_failure: true

Health check protocol

This section describes what LangSight does under the hood when it checks a server. You don’t need to know this to use LangSight — it’s here for debugging and for operators writing custom health check integrations.

Stdio transport

Spawn the process with the configured command and args
Send initialize via stdin
Wait for the initialized response (timeout: 5s by default)
Send tools/list
Compare returned schema hash to stored baseline
Kill the process

StreamableHTTP transport

POST {url} with Content-Type: application/json and the initialize MCP message
Parse the SSE/JSON response stream
Send tools/list in the same session
Compare schema hash
Close the connection

SSE transport (legacy)

GET {url} to open the SSE stream
Send initialize via a POST to the session endpoint
Wait for initialized event
Send tools/list
Compare schema hash
Close the connection

Timeout behavior

The default timeout is 5 seconds per server. If the server does not respond within the timeout, the result is down with error: "timeout after 5s". Configure per-server:

servers:
  - name: slow-mcp
    transport: streamable_http
    url: https://slow-mcp.company.com/mcp
    health_check_timeout_seconds: 15

Backend probe

Some MCP servers can respond to tools/list while the backend application they wrap is down. For example, a DataHub MCP server may serve its tool manifest from a local cache while the DataHub REST API is unreachable. The standard up/down check would show the server as healthy. The health_tool probe solves this. Configure a tool to call as a liveness check — if it fails, LangSight marks the server degraded instead of up.

servers:
  - name: datahub
    transport: streamable_http
    url: https://datahub-mcp.internal.company.com/mcp
    health_tool: search_entities      # tool to call as a backend liveness probe
    health_tool_args:                 # arguments passed to the tool
      query: "test"
      count: 1
    timeout_seconds: 15              # raise timeout for slower backends

The probe runs after tools/list on every health check cycle. If the tool call raises an error or returns an unexpected response, the result is degraded (not down) — preserving the distinction between “MCP layer unreachable” and “MCP layer up, backend down”. Choose a tool that:

Is cheap (a narrow read or search with a small result set)
Exercises the critical backend path (database read, API call)
Does not mutate state

Omitting health_tool skips the probe entirely and restores the previous behavior.

Data persistence

Health check results are stored in ClickHouse with a 90-day TTL. The schema:

CREATE TABLE health_checks (
    server_name   LowCardinality(String),
    status        LowCardinality(String),    -- up / degraded / down / stale
    latency_ms    Nullable(Float32),
    tools_count   Nullable(UInt16),
    schema_hash   Nullable(String),
    error         Nullable(String),
    checked_at    DateTime64(3, 'UTC')
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(checked_at)
ORDER BY (server_name, checked_at)
TTL checked_at + INTERVAL 90 DAY;

Query via the API:

# Latest status for all servers
curl http://localhost:8000/api/health/servers

# History for one server (last 20 checks)
curl "http://localhost:8000/api/health/servers/postgres-mcp/history?limit=20"

# Trigger an immediate check
curl -X POST http://localhost:8000/api/health/check

See Health API for the full endpoint reference.

Dashboard

The MCP Servers page (/servers) shows:

A sortable catalog with current status, p99 latency sparkline, uptime %, Last Used, and Last OK? columns
- Last Used: timestamp of the most recent tool call by any agent (7-day window)
- Last OK?: whether the most recent tool call completed without error
A Needs Attention banner listing any server that is DOWN or DEGRADED
A detail panel with four tabs: About, Tools, Health history, and Consumers
Tool reliability metrics (calls, errors, p99 latency, success rate) in the Tools tab

The former /health (Tool Health) page redirects to /servers — all tool reliability data is now in the Tools tab. The dashboard pulls from the same ClickHouse data as the CLI.

Schema Drift — what DEGRADED means in detail
Server Scorecard — composite grade across 5 dimensions
langsight monitor — full CLI reference for the daemon
Health API — REST endpoints

Documentation Index

​Overview

​Status states

​How degraded is set

​How down is set

​The stale state

​One-shot health check

​Options

​JSON output

​Exit codes

​Continuous monitoring daemon

​Configure the interval

​Alert on state changes only

​Health check protocol

​Stdio transport

​StreamableHTTP transport

​SSE transport (legacy)

​Timeout behavior

​Backend probe

​Data persistence

​Dashboard

​Related