Skip to content

Evaluating Model Performance

Prerequisites

cd scripts/evaluate
pip install -r requirements.txt

Quick Start

Run the full benchmark pipeline (output-length estimation → performance sweep → CSV):

python evaluate.py sweep --target http://localhost:8000/v1

This runs all 9 subsets from RedHatAI/speculator_benchmarks and produces perf_results_<timestamp>/perf_results.csv.

For acceptance rates only (skips the sweep):

python evaluate.py throughput --target http://localhost:8000/v1

See examples/evaluate/ for end-to-end examples that launch a vLLM server and run the pipeline.

Options

Both throughput and sweep share the same options:

  --target URL               vLLM server endpoint (required)
  --dataset DATASET          HF dataset ID or local dir (default: RedHatAI/speculator_benchmarks)
  --subsets LIST             Comma-separated subset names (default: all 9)
  --output-dir DIR           Output directory (default: perf_results_TIMESTAMP)
  --max-concurrency N        Max concurrent requests (default: 128)
  --max-requests N           Max requests per sweep point (default: 200)
  --gen-len-rate N           Request rate for gen-len estimation (default: 128)
  --gen-kwargs JSON          Generation kwargs, e.g. '{"temperature":0.6}'
  --data-column-mapper JSON  Column mapping for guidellm (default: '{"text_column":"prompt"}')

Visualization

# Compare multiple versions
python plot.py compare \
    --source "No Spec=nospec/perf_results.csv" \
    --source "DFlash=dflash/perf_results.csv" \
    --metric latency --output-dir ./plots

# Pairwise speedup (blue = faster, red = regression)
python plot.py speedup \
    --baseline "No Spec=nospec/perf_results.csv" \
    --target "DFlash=dflash/perf_results.csv" \
    --metric latency --title "Qwen3-8B" --output-dir ./plots

Both accept CSVs or raw GuideLLM sweep JSONs. Available metrics: latency, itl, ttft, output_tps.