Custom dataset implementation for JSON and CSV data sources.
This module provides a CustomDataset class for loading and processing
local JSON and CSV files for text generation fine-tuning. Supports
flexible data formats and custom preprocessing pipelines for
user-provided datasets.
Classes:
-
CustomDataset
–
Child text generation class for custom local dataset supporting load
CustomDataset(
dataset_args: DatasetArguments,
split: str,
processor: Processor,
)
Bases: TextGenerationDataset
Child text generation class for custom local dataset supporting load
for csv and json
Parameters:
-
dataset_args
(DatasetArguments)
–
configuration settings for dataset loading
-
split
(str)
–
split from dataset to load, for instance test or train[:5%]
Can also be set to None to load all the splits
-
processor
(Processor)
–
processor or tokenizer to use on dataset
Source code in src/llmcompressor/transformers/data/base.py
| def __init__(
self,
dataset_args: DatasetArguments,
split: str,
processor: Processor,
):
self.dataset_args = dataset_args
self.split = split
self.processor = processor
# get tokenizer
self.tokenizer = getattr(self.processor, "tokenizer", self.processor)
if self.tokenizer is not None:
# fill in pad token
if not self.tokenizer.pad_token:
self.tokenizer.pad_token = self.tokenizer.eos_token
# configure sequence length
max_seq_length = dataset_args.max_seq_length
if max_seq_length is not None:
if max_seq_length > self.tokenizer.model_max_length:
logger.warning(
f"The max_seq_length passed ({max_seq_length}) is larger "
f"than maximum length for model "
f"({self.tokenizer.model_max_length}). "
f"Using max_seq_length={self.tokenizer.model_max_length}."
)
self.max_seq_length = min(
max_seq_length, self.tokenizer.model_max_length
)
else:
self.max_seq_length = self.tokenizer.model_max_length
# configure padding
self.padding = (
False
if self.dataset_args.concatenate_data
else "max_length"
if self.dataset_args.pad_to_max_length
else False
)
else:
self.max_seq_length = None
self.padding = False
|