vllm_omni.tokenizers.mammoth_moda2_tokenizer ¶
PAT_STR module-attribute ¶
PAT_STR = "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
QWEN_SPECIAL_TOKENS module-attribute ¶
QWEN_SPECIAL_TOKENS = (
"<|object_ref_start|>",
"<|object_ref_end|>",
"<|box_start|>",
"<|box_end|>",
"<|quad_start|>",
"<|quad_end|>",
"<|vision_start|>",
"<|vision_end|>",
"<|vision_pad|>",
"<|image_pad|>",
"<|video_pad|>",
"<tool_call>",
"</tool_call>",
"<|fim_prefix|>",
"<|fim_middle|>",
"<|fim_suffix|>",
"<|fim_pad|>",
"<|repo_name|>",
"<|file_sep|>",
)
VOCAB_FILES_NAMES module-attribute ¶
VOCAB_FILES_NAMES = {
"vocab_file": "mammothu.tiktoken",
"special_tokens_file": "mammothu_vision_tokens.txt",
}
MammothUTokenizer ¶
Bases: PreTrainedTokenizer
MammothU tokenizer.
gen_image_placeholder_token instance-attribute ¶
special_tokens instance-attribute ¶
visual_tokens instance-attribute ¶
visual_tokens_ids instance-attribute ¶
add_special_tokens ¶
Add special tokens to the tokenizer and update the special tokens mapping. Only adds tokens that are already in the special_tokens_set.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
special_tokens_dict | dict[str, str | AddedToken] | dictionary of special tokens to add. The key is the token type and the value is the token to add. | required |
Returns:
| Type | Description |
|---|---|
int | Number of tokens added to the vocabulary. |
bytes_to_str ¶
Convert byte tokens to string representation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
byte_tokens | dict | A dictionary where keys are byte objects and values are integers, or a single byte object. | required |
Returns:
| Type | Description |
|---|---|
str | If input is a dictionary, returns a new dictionary with byte keys converted to strings. |
str | If input is a single byte object, returns the string representation. |
convert_tokens_to_ids ¶
convert_tokens_to_string ¶
Converts a sequence of tokens in a single string.
save_vocabulary ¶
tokenize ¶
tokenize(
text: str,
allowed_special: set | str = "all",
disallowed_special: Collection | str = (),
**kwargs,
) -> list[bytes | str]
Converts a string in a sequence of tokens.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text | `str` | The sequence to be encoded. | required |
allowed_special | `Literal["all"]` or `set` | The surface forms of the tokens to be encoded as special tokens in regular texts. Default to "all". | 'all' |
disallowed_special | `Literal["all"]` or `Collection` | The surface forms of the tokens that should not be in regular texts and trigger errors. Default to an empty tuple. | () |
kwargs | additional keyword arguments, *optional* | Will be passed to the underlying model specific encode method. | {} |
Returns:
| Type | Description |
|---|---|
list[bytes | str] |
|