Response Regeneration
This tutorial walks you through regenerating assistant responses in an existing dataset using a target model served by vLLM. The resulting dataset pairs the original user prompts with freshly generated responses, which can be used as training data for speculator models with improved alignment to your target model.
Overview
Time required: ~10 mins on 2x H100 GPUs (for 1K samples)
Prerequisites:
- Python 3.10+
- CUDA-capable GPU(s)
vllminstalled (uv pip install "vllm>=0.19.1")
Step 1: Run the Pipeline
The simplest way to regenerate responses is using the run_all.sh script, which handles starting a vLLM server, running the regeneration, and stopping the server.
./scripts/response_regeneration/run_all.sh \
--model "meta-llama/Llama-3.3-70B-Instruct" \
--dataset magpie \
--limit 1000
This will:
- Start a vLLM server with the specified model
- Extract prompts from the dataset and generate new responses
- Save results to a JSONL file (e.g.,
magpie_Llama-3.3-70B-Instruct.jsonl) - Stop the server
Multi-GPU Configurations
For larger models, use data parallelism and/or tensor parallelism:
# Llama 3.3 70B on 8 GPUs (4 data-parallel replicas with TP=2)
./scripts/response_regeneration/run_all.sh \
--model "meta-llama/Llama-3.3-70B-Instruct" \
--dp-size 4 --tp-size 2 \
--dataset magpie
# Select specific GPUs
./scripts/response_regeneration/run_all.sh \
--model "Qwen/Qwen2.5-72B-Instruct" \
--gpus 0,1,2,4 --tp-size 4 \
--dataset magpie
Step 2: Verify the Output
The output is a JSONL file where each line contains the original prompt and the generated response in a conversations format compatible with fine-tuning:
{
"id": "sample_0",
"conversations": [
{"from": "human", "value": "What is the capital of France?"},
{"from": "gpt", "value": "The capital of France is Paris."}
],
"metadata": {
"idx": 0,
"finish_reason": "stop",
"latency_s": 1.234,
"usage": {...},
"endpoint": "http://127.0.0.1:8000/v1/chat/completions"
}
}
Check that the output looks correct:
# Count completed rows
wc -l magpie_Llama-3.3-70B-Instruct.jsonl
# Inspect first row
head -1 magpie_Llama-3.3-70B-Instruct.jsonl | python -m json.tool
Step 3: Use the Data for Training
The output JSONL can be passed directly to prepare_data.py for speculator training:
python scripts/prepare_data.py \
--model meta-llama/Llama-3.3-70B-Instruct \
--data ./magpie_Llama-3.3-70B-Instruct.jsonl \
--output ./output \
--seq-length 8192
Advanced: Manual Control
If you prefer to manage the vLLM server yourself (e.g., to reuse a server across multiple runs), you can run the regeneration script directly:
# 1. Start vLLM server
vllm serve "meta-llama/Llama-3.3-70B-Instruct" \
--data-parallel-size 4 --tensor-parallel-size 2 \
--port 8000
# 2. Run regeneration (model auto-detected from server)
python scripts/response_regeneration/script.py \
--dataset magpie \
--limit 1000
# 3. Stop server when done (Ctrl+C)
Resuming Interrupted Processing
If processing is interrupted, use the --resume flag to skip already-processed rows:
python scripts/response_regeneration/script.py \
--dataset magpie \
--outfile magpie_Llama-3.3-70B-Instruct.jsonl \
--resume
Keeping the Server Running
Use --keep-server with run_all.sh to leave the vLLM server running after processing, useful when running multiple regeneration jobs:
# First run - start server and keep it
./scripts/response_regeneration/run_all.sh \
--model "Qwen/Qwen2.5-72B-Instruct" \
--dataset magpie --keep-server
# Second run - use the already-running server directly
python scripts/response_regeneration/script.py --dataset ultrachat
Next Steps
After regenerating your dataset:
- Train a speculator - See Train Eagle-3 Online or Train Eagle-3 Offline
- Evaluate performance - See Evaluating Performance
- Deploy to production - See Serve in vLLM
For the full list of arguments for both scripts, see the response_regeneration CLI reference.