GLM-OCR Usage Guide¶
This guide describe how to run GLM-OCR with MTP support in vllm.
Installing vllm¶
# install the nightly build of vLLM for GLM-4.7
uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
# install transformers from source
uv pip install git+https://github.com/huggingface/transformers.git
Running GLM-OCR with MTP¶
GLM-OCR model include built-in Multi-Token Prediction (MTP) layers that can be used for speculative decoding to accelerate generation throughput.
Add the --speculative-config flags to server command to enable MTP speculative decoding:
vllm serve zai-org/GLM-OCR \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1
CURL Usage¶
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "zai-org/GLM-OCR",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
}
},
{
"type": "text",
"text": "Text Recognition:"
}
]
}
],
"max_tokens": 2048,
"temperature": 0.0
}'
OpenAI SDK Client Usage¶
import time
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
timeout=3600
)
messages = [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
}
},
{
"type": "text",
"text": "Text Recognition:"
}
]
}
]
start = time.time()
response = client.chat.completions.create(
model="zai-org/GLM-OCR",
messages=messages,
max_tokens=2048,
temperature=0.0
)
print(f"Response costs: {time.time() - start:.2f}s")
print("Generated text:")
print(response.choices[0].message.content)
Notes¶
- Transformers Version: This model requires
transformers >= 5.0.0for optimal compatibility.