Files
volar-docs/docs/lsp-benchmarking.md
2025-11-09 22:22:52 -06:00

6.3 KiB
Raw Blame History

Benchmarking Volar-Based Language Servers

Docs IndexRepo READMEPerformance GuideTesting & CI

Reliable performance requires repeatable measurements. This guide explains how to benchmark a Volar-powered LSP—whether for regression testing, capacity planning, or comparing implementation choices. We cover synthetic microbenchmarks, realistic end-to-end scenarios, metrics to track, and tooling suggestions.

Benchmarking Goals

  1. Latency response time for common requests (diagnostics, completion, hover, definition).
  2. Throughput how many requests per second the server can handle under load.
  3. Resource usage CPU, memory, and IO characteristics.
  4. Stability behavior under long-running or adverse conditions (massive workspaces, schema outages).

Benchmarking Environments

Environment Use Case
Local machine Quick iteration, microbenchmarks
Dedicated CI runner Repeatable, no background processes
Containerized setup Shareable configuration, reproducible

Ensure you run benchmarks on idle machines to reduce noise (disable background indexing, close unrelated apps).

Metrics to Capture

  • Latency (ms): time from request start to response.
  • P99 latency: worst-case tail latency for each request type.
  • CPU usage (%): average and peaks.
  • Memory usage (MB): total process memory, heap snapshots.
  • Garbage Collection stats: number/duration of GC cycles.
  • Cache hit rate: schema cache hits, TypeScript project reuse, etc.
  • Errors per minute: schema fetch failures, exceptions.

Tooling

1. LSP Bench Harness

Use lsp-bench or similar tools to replay LSP traces:

npx @vscode/lsp-bench run \
  --server "node dist/server.js" \
  --trace ./fixtures/sample.trace.json \
  --out ./bench/results.json
  • Prepare trace files capturing representative user sessions (diagnostics, completion, hover, rename).
  • lsp-bench reports per-method latency, throughput, and timeouts.

2. Custom Harness

Write a Node script to issue requests in sequence or parallel:

import { createConnection } from 'vscode-languageserver/node';

async function benchmarkCompletion(docUri: string, positions: Position[]) {
  const latencies: number[] = [];
  for (const pos of positions) {
    const start = performance.now();
    await connection.sendRequest('textDocument/completion', { textDocument: { uri: docUri }, position: pos });
    latencies.push(performance.now() - start);
  }
  return latencies;
}
  • Use performance.now() or process.hrtime.bigint() for high-resolution timing.
  • Run requests concurrently to test throughput (e.g., Promise.all on multiple files).

3. Profilers

  • node --prof to generate V8 CPU profiles; use node --prof-process to inspect.
  • node --inspect for interactive CPU/memory profiling.

4. OS Metrics

  • macOS: Instruments, Activity Monitor.
  • Linux: perf, htop, pidstat.
  • Windows: Performance Monitor.

Benchmark Scenarios

  1. Cold start: time from process launch to first diagnostic result on a large workspace.
  2. Hot completion: repeated textDocument/completion calls while typing in large files.
  3. Full diagnostics: workspace/diagnostic on a monorepo.
  4. Schema outage: simulate offline schema servers to ensure fallback paths are fast.
  5. Take Over Mode: measure TS+Volar integration performance with .ts/.vue editing.
  6. Concurrent editing: multiple documents edited simultaneously, triggering overlapping diagnostics.

Benchmark Workflow

  1. Prepare fixtures real-world repositories or synthetic workloads (e.g., 1,000 .vue files).
  2. Record baseline run benchmarks on the main branch to capture baseline metrics.
  3. Introduce changes branch with your modifications.
  4. Rerun benchmarks identical environment, compare results.
  5. Analyze highlight regressions/improvements with absolute numbers and percentages.
  6. Automate integrate into CI to catch regressions before merging.

Reporting

Include:

  • Method: textDocument/completion
  • Sample size: number of requests.
  • Mean/P95/P99 latency.
  • CPU/memory usage snapshots.
  • Notes on environment (machine specs, Node version).

Example summary:

Scenario Baseline P95 New P95 Delta
Completion (component props) 42ms 38ms -9.5%
Hover 31ms 34ms +9.7%
Cold diagnostics (monorepo) 3.2s 2.8s -12.5%

Tips & Best Practices

  1. Isolate background noise disable incremental builds, watchers, or other processes during runs.
  2. Warm caches run a warm-up cycle before recording metrics to simulate typical usage.
  3. Measure tails averages hide spikes; always record P95/P99.
  4. Instrument in code log per-request durations (e.g., around collectDiagnostics) to correlate with client-side measurements.
  5. Version your traces keep benchmark traces under version control so future runs stay consistent.
  6. Automate regression detection fail CI if key metrics regress beyond a threshold (e.g., P95 > baseline + 20%).
  7. Consider user hardware run benchmarks on lower-spec machines to mimic real users (e.g., 4-core laptops).
  8. Monitor GC long GC pauses can inflate tail latency; capture heap snapshots and consider tuning Nodes GC flags if needed.

Sample CI Step

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - run: npm run build
      - run: npx @vscode/lsp-bench run --server "node dist/server.js" --trace ./bench/trace.json --out bench/results.json
      - run: node scripts/compare-benchmarks.js bench/results.json bench/baseline.json

The comparison script should exit non-zero if metrics exceed allowed regression thresholds.

Conclusion

Benchmarking isnt a one-off task—embed it into your development lifecycle to catch regressions early and keep the Volar experience snappy. By combining realistic traces, automated harnesses, and clear reporting, you can confidently evolve your language server while preserving top-tier performance.