Evaluating Model Performance
Prerequisites
Quick Start
Run the full benchmark pipeline (output-length estimation → performance sweep → CSV):
This runs all 9 subsets from RedHatAI/speculator_benchmarks and produces perf_results_<timestamp>/perf_results.csv.
For acceptance rates only (skips the sweep):
See examples/evaluate/ for end-to-end examples that launch a vLLM server and run the pipeline.
Options
Both throughput and sweep share the same options:
--target URL vLLM server endpoint (required)
--dataset DATASET HF dataset ID or local dir (default: RedHatAI/speculator_benchmarks)
--subsets LIST Comma-separated subset names (default: all 9)
--output-dir DIR Output directory (default: perf_results_TIMESTAMP)
--max-concurrency N Max concurrent requests (default: 128)
--max-requests N Max requests per sweep point (default: 200)
--gen-len-rate N Request rate for gen-len estimation (default: 128)
--gen-kwargs JSON Generation kwargs, e.g. '{"temperature":0.6}'
--data-column-mapper JSON Column mapping for guidellm (default: '{"text_column":"prompt"}')
Visualization
# Compare multiple versions
python plot.py compare \
--source "No Spec=nospec/perf_results.csv" \
--source "DFlash=dflash/perf_results.csv" \
--metric latency --output-dir ./plots
# Pairwise speedup (blue = faster, red = regression)
python plot.py speedup \
--baseline "No Spec=nospec/perf_results.csv" \
--target "DFlash=dflash/perf_results.csv" \
--metric latency --title "Qwen3-8B" --output-dir ./plots
Both accept CSVs or raw GuideLLM sweep JSONs. Available metrics: latency, itl, ttft, output_tps.