prepare_data.py
Prepares data for speculator training by:
- Applying chat template and tokenizing each sample
- Producing a loss/assistant mask for each sample
- Recording token frequency statistics
The output is a processed dataset ready for online training or offline hidden states generation.
Basic Usage
python scripts/prepare_data.py \
--model meta-llama/Llama-3.1-8B-Instruct \
--data sharegpt \
--output ./training_data \
--max-samples 5000
Arguments
Model Arguments
--model(str, required) HuggingFace model ID or local path for the target model.
Example: meta-llama/Llama-3.1-8B-Instruct
Data Arguments
--data(str, required, repeatable) Path to training data. Can be a HuggingFace dataset name or local path. Use multiple times to specify multiple datasets.
Example: --data sharegpt --data ./custom_data.jsonl
-
--seq-length(int, default:8192) Maximum sequence length for each sample. Longer samples will be truncated. -
--max-samples(int, default:None) Maximum number of samples to process. IfNone, processes all samples. -
--token-freq-path(str, default:{output}/token_freq.pt) Path to save token frequency distribution. Defaults totoken_freq.ptin the output directory. -
--assistant-pattern(str, default:None) Custom regex pattern for matching assistant responses. If not provided, auto-detected from chat template. -
--turn-dropout(flag) Enable turn dropout: randomly keeps first N consecutive turns per conversation for data augmentation. -
--minimum-valid-tokens(int, default:None) Drop samples whose loss mask contains fewer than this many trainable tokens.
Output Arguments
-
--output(str, required) Directory to save the processed dataset. -
--overwrite(flag) Forcibly rerun preprocessing and overwrite existing content in output directory.
Processing Arguments
-
--seed(int, default:0) Random seed for reproducibility. Must match the seed used in other scripts. -
--num-preprocessing-workers(int, default:8) Number of CPU processes for dataset preprocessing.