6.3 KiB
Benchmarking Volar-Based Language Servers
Reliable performance requires repeatable measurements. This guide explains how to benchmark a Volar-powered LSP—whether for regression testing, capacity planning, or comparing implementation choices. We cover synthetic microbenchmarks, realistic end-to-end scenarios, metrics to track, and tooling suggestions.
Benchmarking Goals
- Latency – response time for common requests (diagnostics, completion, hover, definition).
- Throughput – how many requests per second the server can handle under load.
- Resource usage – CPU, memory, and IO characteristics.
- Stability – behavior under long-running or adverse conditions (massive workspaces, schema outages).
Benchmarking Environments
| Environment | Use Case |
|---|---|
| Local machine | Quick iteration, microbenchmarks |
| Dedicated CI runner | Repeatable, no background processes |
| Containerized setup | Shareable configuration, reproducible |
Ensure you run benchmarks on idle machines to reduce noise (disable background indexing, close unrelated apps).
Metrics to Capture
- Latency (ms): time from request start to response.
- P99 latency: worst-case tail latency for each request type.
- CPU usage (%): average and peaks.
- Memory usage (MB): total process memory, heap snapshots.
- Garbage Collection stats: number/duration of GC cycles.
- Cache hit rate: schema cache hits, TypeScript project reuse, etc.
- Errors per minute: schema fetch failures, exceptions.
Tooling
1. LSP Bench Harness
Use lsp-bench or similar tools to replay LSP traces:
npx @vscode/lsp-bench run \
--server "node dist/server.js" \
--trace ./fixtures/sample.trace.json \
--out ./bench/results.json
- Prepare trace files capturing representative user sessions (diagnostics, completion, hover, rename).
lsp-benchreports per-method latency, throughput, and timeouts.
2. Custom Harness
Write a Node script to issue requests in sequence or parallel:
import { createConnection } from 'vscode-languageserver/node';
async function benchmarkCompletion(docUri: string, positions: Position[]) {
const latencies: number[] = [];
for (const pos of positions) {
const start = performance.now();
await connection.sendRequest('textDocument/completion', { textDocument: { uri: docUri }, position: pos });
latencies.push(performance.now() - start);
}
return latencies;
}
- Use
performance.now()orprocess.hrtime.bigint()for high-resolution timing. - Run requests concurrently to test throughput (e.g., Promise.all on multiple files).
3. Profilers
node --profto generate V8 CPU profiles; usenode --prof-processto inspect.node --inspectfor interactive CPU/memory profiling.
4. OS Metrics
- macOS: Instruments,
Activity Monitor. - Linux:
perf,htop,pidstat. - Windows: Performance Monitor.
Benchmark Scenarios
- Cold start: time from process launch to first diagnostic result on a large workspace.
- Hot completion: repeated
textDocument/completioncalls while typing in large files. - Full diagnostics:
workspace/diagnosticon a monorepo. - Schema outage: simulate offline schema servers to ensure fallback paths are fast.
- Take Over Mode: measure TS+Volar integration performance with
.ts/.vueediting. - Concurrent editing: multiple documents edited simultaneously, triggering overlapping diagnostics.
Benchmark Workflow
- Prepare fixtures – real-world repositories or synthetic workloads (e.g., 1,000
.vuefiles). - Record baseline – run benchmarks on the
mainbranch to capture baseline metrics. - Introduce changes – branch with your modifications.
- Rerun benchmarks – identical environment, compare results.
- Analyze – highlight regressions/improvements with absolute numbers and percentages.
- Automate – integrate into CI to catch regressions before merging.
Reporting
Include:
- Method:
textDocument/completion - Sample size: number of requests.
- Mean/P95/P99 latency.
- CPU/memory usage snapshots.
- Notes on environment (machine specs, Node version).
Example summary:
| Scenario | Baseline P95 | New P95 | Delta |
|---|---|---|---|
| Completion (component props) | 42ms | 38ms | -9.5% |
| Hover | 31ms | 34ms | +9.7% |
| Cold diagnostics (monorepo) | 3.2s | 2.8s | -12.5% |
Tips & Best Practices
- Isolate background noise – disable incremental builds, watchers, or other processes during runs.
- Warm caches – run a warm-up cycle before recording metrics to simulate typical usage.
- Measure tails – averages hide spikes; always record P95/P99.
- Instrument in code – log per-request durations (e.g., around
collectDiagnostics) to correlate with client-side measurements. - Version your traces – keep benchmark traces under version control so future runs stay consistent.
- Automate regression detection – fail CI if key metrics regress beyond a threshold (e.g., P95 > baseline + 20%).
- Consider user hardware – run benchmarks on lower-spec machines to mimic real users (e.g., 4-core laptops).
- Monitor GC – long GC pauses can inflate tail latency; capture heap snapshots and consider tuning Node’s GC flags if needed.
Sample CI Step
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
- run: npm ci
- run: npm run build
- run: npx @vscode/lsp-bench run --server "node dist/server.js" --trace ./bench/trace.json --out bench/results.json
- run: node scripts/compare-benchmarks.js bench/results.json bench/baseline.json
The comparison script should exit non-zero if metrics exceed allowed regression thresholds.
Conclusion
Benchmarking isn’t a one-off task—embed it into your development lifecycle to catch regressions early and keep the Volar experience snappy. By combining realistic traces, automated harnesses, and clear reporting, you can confidently evolve your language server while preserving top-tier performance.