Scaling Guide - LangSight

What we measured

All numbers below come from load tests on a single MacBook Pro M-series running the full Docker stack locally. A properly provisioned cloud server will be faster.

Scenario	Spans in DB	Window queried	Sessions list	Session trace
Pilot (30k spans)	30k	24h	127ms	~50ms
1M spans, 24h window	1M	24h	127ms	~50ms
1M spans, 7-day window	1M	7 days	135ms	~50ms
1M spans, 30-day window	1M	30 days	530ms	~50ms

The sessions list scales with rows in the time window, not total table size. ClickHouse’s MinMax index on started_at prunes partitions before any row is read. The 24h default window stays fast regardless of historical data volume.

Current default settings

API workers

Setting	Default	Env var
Worker processes	`1`	`LANGSIGHT_WORKERS`
Requires Redis when > 1	yes	`LANGSIGHT_REDIS_URL`

Single-worker is correct for pilots and small teams. Rate limiting, SSE broadcasting, and the auth cache are all in-process and work correctly without Redis.

Postgres connection pool

Setting	Default	Env var
Min connections per worker	`2`	`LANGSIGHT_PG_POOL_MIN`
Max connections per worker	`50`	`LANGSIGHT_PG_POOL_MAX`
Postgres `max_connections`	`300` (docker-compose)	postgres `command:` flag

With 4 workers at the default 50 max per worker: 4 × 50 = 200 connections. The docker-compose sets max_connections=300 to leave headroom. Rule of thumb: max_connections ≥ LANGSIGHT_WORKERS × LANGSIGHT_PG_POOL_MAX × 1.2

ClickHouse

Setting	Default	Where
Container memory	`4G`	`docker-compose.yml` → ClickHouse `deploy.resources.limits.memory`
Per-query memory cap (session trace)	`500 MB`	hardcoded in `clickhouse.py`
Concurrent span inserts per worker	`20 ÷ LANGSIGHT_WORKERS`	`traces.py` `_INSERT_SEM_LIMIT`
Span retention TTL	90 days	ClickHouse `TTL` clause in table DDL

The concurrent insert semaphore divides the budget evenly across workers so total ClickHouse insert pressure stays constant regardless of worker count.

Redis (multi-worker only)

What Redis provides	Fallback (no Redis)
Cross-worker rate limiting	Per-worker limit (weaker)
Cross-worker SSE broadcasting	Only clients on same worker see events
Distributed login lockout	Per-worker only

Scaling tiers

Tier 1 — Pilot / small team (default)

Up to ~50 concurrent dashboard users, ~20 simultaneous agent runs.

# .env — no changes needed from defaults
LANGSIGHT_WORKERS=1
# LANGSIGHT_REDIS_URL not set

# docker-compose.yml — defaults
clickhouse: memory: 4G
postgres:   memory: 1G
api:        memory: 512M

When to move to Tier 2: sessions list > 500ms, or you see SSE events missing on some dashboard tabs.

Tier 2 — Team deployment (Redis + 4 workers)

Up to ~200 concurrent users, ~100 simultaneous agent runs.

# .env
LANGSIGHT_WORKERS=4
REDIS_PASSWORD=<generate with: python3 -c "import secrets; print(secrets.token_hex(32))">
LANGSIGHT_REDIS_URL=redis://:${REDIS_PASSWORD}@redis:6379

# Start with Redis profile
docker compose --profile redis up -d

Redis enables:

Rate limiter shared across all 4 workers (consistent limits)
SSE broadcaster via Redis pub/sub (all workers deliver live events to all clients)
Login lockout persists across restarts

When to move to Tier 3: ClickHouse query time > 1s, or you’re ingesting > 500k spans/day.

Tier 3 — High-volume (dedicated ClickHouse)

1M spans/day, > 500 concurrent users.

The bottleneck at this scale is ClickHouse, not the API. Options: Option A — Scale vertically: Give ClickHouse more memory and CPU.

# docker-compose.yml
clickhouse:
  deploy:
    resources:
      limits:
        memory: 16G   # was 4G
        cpus: "8"     # was 2

Option B — ClickHouse Cloud or dedicated server: Point LANGSIGHT_CLICKHOUSE_URL at an external ClickHouse instance. The API never connects to ClickHouse directly from user requests — all writes are async and reads have per-query memory caps.

# .env
LANGSIGHT_CLICKHOUSE_URL=https://your-instance.clickhouse.cloud:8443
LANGSIGHT_CLICKHOUSE_USERNAME=default
LANGSIGHT_CLICKHOUSE_PASSWORD=<password>
LANGSIGHT_CLICKHOUSE_DATABASE=langsight

Option C — Add a materialized view for sessions: At 10M+ spans, even with time-window pruning the GROUP BY aggregation for the sessions list becomes slow. A pre-aggregated AggregatingMergeTree view would keep the sessions list under 100ms regardless of volume. This is planned for a future release.

Performance knobs reference

API

# Number of Uvicorn worker processes. Requires Redis when > 1.
LANGSIGHT_WORKERS=4

# Postgres pool size per worker. Total connections = WORKERS × this value.
LANGSIGHT_PG_POOL_MAX=20   # reduce from default 50 when running 4+ workers

# Redis URL (required for WORKERS > 1)
LANGSIGHT_REDIS_URL=redis://:password@redis:6379

ClickHouse

# docker-compose.yml
services:
  clickhouse:
    deploy:
      resources:
        limits:
          memory: 4G    # minimum for production; increase for high volume
          cpus: "2"

# Span data retention (default: 90 days). Set in ClickHouse TTL clause.
# Reduce to lower storage costs: ALTER TABLE mcp_tool_calls MODIFY TTL started_at + INTERVAL 30 DAY

Postgres

# docker-compose.yml
services:
  postgres:
    command: postgres -c max_connections=300 -c shared_buffers=256MB
    deploy:
      resources:
        limits:
          memory: 1G

Load test results summary

These numbers come from k6 tests against a local Docker stack (MacBook Pro M-series, all containers on one machine). Cloud deployments will perform better due to dedicated resources.

Dashboard reads (100 concurrent users)

Configuration	p95 latency	Error rate
1 worker, no Redis	359ms	0%
4 workers + Redis (tuned)	~200ms	0%

Sessions list, health check, and lineage all stay under 400ms at 100 VUs with the default 24h window.

Span ingestion (100 concurrent agent flushes)

Configuration	p95 latency	Error rate	Throughput
1 worker, semaphore=20	229ms	0.54% (cold ramp only)	274 spans/s
4 workers, semaphore=5/worker	229ms	<0.5%	~1,000 spans/s

The semaphore caps concurrent ClickHouse inserts at 20 total (regardless of worker count) to prevent memory exhaustion.

Running 100 concurrent dashboard users and 100 simultaneous agent flushes from the same IP saturates single-node ClickHouse. In real deployments these loads are staggered: agents flush after runs complete (bursts, not continuous) and dashboard users poll at 30–300s intervals.

Bottleneck decision tree

Sessions list slow? (> 500ms)
  └─ Check: how many spans in the time window?
       > 500k → increase ClickHouse memory or reduce TTL
       < 500k → check if session_health_tags has many unmerged parts:
                SELECT count() FROM session_health_tags (should be < 100k)

Span ingestion returning 429?
  └─ Rate limit: 10,000 req/min per IP
       Multiple agent hosts? → each has its own limit, no issue
       Single host? → distribute agents across IPs or increase limit in traces.py

Span ingestion returning 503?
  └─ ClickHouse OOM: increase container memory from 4G → 8G

SSE live view missing events?
  └─ LANGSIGHT_WORKERS > 1 without Redis → add LANGSIGHT_REDIS_URL

Dashboard login lockout not working after restart?
  └─ Expected: lockout state is now Postgres-backed and survives restarts
       If still resetting → check login_failures table exists in Postgres

Documentation Index

​What we measured

​Current default settings

​API workers

​Postgres connection pool

​ClickHouse

​Redis (multi-worker only)

​Scaling tiers

​Tier 1 — Pilot / small team (default)

​Tier 2 — Team deployment (Redis + 4 workers)

​Tier 3 — High-volume (dedicated ClickHouse)

​Performance knobs reference

​API

​ClickHouse

​Postgres

​Load test results summary

​Dashboard reads (100 concurrent users)

​Span ingestion (100 concurrent agent flushes)

​Bottleneck decision tree

What we measured

Current default settings

API workers

Postgres connection pool

ClickHouse

Redis (multi-worker only)

Scaling tiers

Tier 1 — Pilot / small team (default)

Tier 2 — Team deployment (Redis + 4 workers)

Tier 3 — High-volume (dedicated ClickHouse)

Performance knobs reference

API

ClickHouse

Postgres

Load test results summary

Dashboard reads (100 concurrent users)

Span ingestion (100 concurrent agent flushes)

Bottleneck decision tree