Skip to content

GLM-GA Usage Guide

This guide describes how to run GLM-GA for image and video understanding with vLLM.

GLM-GA is a dense vision-language model (~10B parameters) based on the GLM-4.6V-Flash architecture. It uses dedicated GlmgaImageProcessor and GlmgaVideoProcessor sub-processors. The key difference from GLM-4.6V is in video processing: GLM-GA samples at a fixed 2 fps and supports up to 640 frames, enabling long-video understanding.

Installing vLLM

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
uv pip install git+https://github.com/huggingface/transformers.git    # Installed from main branch

Running GLM-GA on a single H100/H200

export VLLM_VIDEO_LOADER_BACKEND=glm4_6v
vllm serve zai-org/GLM-GA \
     --tool-call-parser glm47 \
     --reasoning-parser glm45 \
     --enable-auto-tool-choice \
     --allowed-local-media-path / \
     --mm-processor-cache-type shm
  • vLLM conservatively uses 90% of GPU memory; set --gpu-memory-utilization=0.95 to maximize KV cache.

Client Usage

Image Understanding

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
    model="zai-org/GLM-GA",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "https://example.com/image.png"}},
            {"type": "text", "text": "Describe the image."}
        ]
    }],
    max_tokens=512,
)
print(resp.choices[0].message.content)

Video Understanding

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
    model="zai-org/GLM-GA",
    messages=[{
        "role": "user",
        "content": [
            {"type": "video_url", "video_url": {"url": "https://example.com/video.mp4"}},
            {"type": "text", "text": "Summarize what happens in this video."}
        ]
    }],
    max_tokens=1024,
)
print(resp.choices[0].message.content)

Video Processing Details

GLM-GA uses a dedicated GlmgaVideoProcessor that differs from GLM-4.6V:

Feature GLM-4.6V GLM-GA
FPS Dynamic (3/1/0.5 by duration) Fixed 2 fps
Max frames 640 640
Max pixels (video) 47M 87M
Frame upsampling Duration-based math.floor aligned with HF

Troubleshooting

  • Long-context memory: At 128K context, tune --max-num-batched-tokens and --gpu-memory-utilization to prevent OOM.
  • Video loading errors: Ensure OpenCV or PyAV is installed for video decoding.