LFM2.5 Usage Guide¶

LFM2.5 is Liquid AI's family of small, efficient open-weight models built on the LFM2 hybrid backbone — short-range gated convolution blocks interleaved with grouped-query attention. The hybrid design keeps the KV cache small and decode fast, so LFM2.5 models punch above their weight while remaining cheap to serve, down to edge / on-device GPUs. The family spans dense chat models (230M, 350M, 1.2B), a mixture-of-experts model (8B-A1B), reasoning and Japanese variants, base checkpoints, and vision-language models (VL 450M / 1.6B) — all served through vLLM's OpenAI-compatible API.

All LFM2.5 language and vision-language models are supported natively by vLLM (the Lfm2ForCausalLM, Lfm2MoeForCausalLM, and Lfm2VlForConditionalGeneration architectures). Tool calling is handled by the built-in lfm2 tool parser, and <think> reasoning by the qwen3 reasoning parser.

Supported Models¶

Dense Chat Models¶

Model	Parameters	Min NVIDIA GPU (BF16)	Context	Tools	HuggingFace
LFM2.5 230M	230M	1× (any)	32K	✓	LiquidAI/LFM2.5-230M
LFM2.5 350M	350M	1× (any)	32K	✓	LiquidAI/LFM2.5-350M
LFM2.5 1.2B Instruct	1.2B	1× (8 GB+)	32K	✓	LiquidAI/LFM2.5-1.2B-Instruct
LFM2.5 1.2B Thinking	1.2B	1× (8 GB+)	32K	✓ (+reasoning)	LiquidAI/LFM2.5-1.2B-Thinking
LFM2.5 1.2B JP	1.2B	1× (8 GB+)	32K	–	LiquidAI/LFM2.5-1.2B-JP
LFM2.5 1.2B JP (202606)	1.2B	1× (8 GB+)	32K	✓	LiquidAI/LFM2.5-1.2B-JP-202606
LFM2.5 1.2B Base	1.2B	1× (8 GB+)	32K	– (completions)	LiquidAI/LFM2.5-1.2B-Base

Mixture-of-Experts (MoE) Model¶

Model	Total / Active Params	Min NVIDIA GPU (BF16)	Context	HuggingFace
LFM2.5 8B-A1B	8B / ~1B active	1× (24 GB+)	128K	LiquidAI/LFM2.5-8B-A1B

The 8B-A1B keeps every expert resident in VRAM, so size the GPU for the full ~8B of weights even though only ~1B is active per token. It supports <think> reasoning and tool calling.

Vision-Language Models¶

Model	Parameters	Min NVIDIA GPU (BF16)	Context	HuggingFace
LFM2.5 VL 450M	450M	1× (any)	32K	LiquidAI/LFM2.5-VL-450M
LFM2.5 VL 1.6B	1.6B	1× (8 GB+)	32K	LiquidAI/LFM2.5-VL-1.6B

The VL models pair the LFM2 hybrid language backbone with a SigLIP2 vision encoder (Lfm2VlForConditionalGeneration).

Installing vLLM¶

LFM2.5's dense, MoE, and VL architectures (Lfm2ForCausalLM, Lfm2MoeForCausalLM, Lfm2VlForConditionalGeneration) run on vLLM 0.23.0, which min_vllm_version pins.

pip (NVIDIA CUDA)¶

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Docker¶

docker pull vllm/vllm-openai:latest        # CUDA 12.x
docker pull vllm/vllm-openai:latest-cu130  # CUDA 13.0 (Blackwell)

Running LFM2.5¶

Quick Start (Single GPU)¶

vllm serve LiquidAI/LFM2.5-1.2B-Instruct

Cap the context to fit a smaller GPU (models support up to 32K; 128K on 8B-A1B):

vllm serve LiquidAI/LFM2.5-1.2B-Instruct --max-model-len 32768

Multi-GPU¶

Every LFM2.5 model fits on a single GPU (the 8B-A1B MoE on a 24 GB+ GPU), so tensor parallelism gives no meaningful speedup on basically any GPU — there's nothing to split that helps. You can still try it on a multi-GPU node, but expect no throughput gain:

vllm serve LiquidAI/LFM2.5-8B-A1B --tensor-parallel-size 2

Docker Deployment¶

docker run -itd --name lfm2.5 \
    --ipc=host --network host --shm-size 16G --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    vllm/vllm-openai:latest \
        --model LiquidAI/LFM2.5-1.2B-Instruct \
        --host 0.0.0.0 --port 8000

Recommended Sampling¶

LFM2.5 uses per-model sampling presets from the model cards. top_k, min_p, and repetition_penalty are vLLM extra sampling params — pass them via extra_body on the OpenAI client (or top-level in a raw /v1/... request).

Model	temperature	top_k	min_p	repetition_penalty
LFM2.5 230M	0.1	50	–	1.05
LFM2.5 350M	0.1	50	–	1.05
LFM2.5 1.2B Instruct	0.1	50	–	1.05
LFM2.5 1.2B Thinking	0.05	50	–	1.05
LFM2.5 1.2B JP	0.3	–	0.15	1.05
LFM2.5 1.2B JP (202606)	0.1	50	–	1.05
LFM2.5 1.2B Base	0.3	–	0.15	1.05
LFM2.5 8B-A1B	0.2	80	–	1.05
LFM2.5 VL 450M / 1.6B	0.1	–	0.15	1.05

⚠️ Do not bake these into vllm serve — they are per-request client defaults, not server flags. Leave max_tokens unset — capping it truncates the reasoning models' chain-of-thought.

Text Generation¶

Online Serving (OpenAI SDK)¶

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="LiquidAI/LFM2.5-1.2B-Instruct",
    messages=[{"role": "user", "content": "What is C. elegans? Answer in one sentence."}],
    temperature=0.1,
    extra_body={"top_k": 50, "repetition_penalty": 1.05},
)
print(response.choices[0].message.content)

Base Model (completions)¶

LFM2.5-1.2B-Base is not instruction-tuned and has no chat template — use the completions endpoint:

resp = client.completions.create(
    model="LiquidAI/LFM2.5-1.2B-Base",
    prompt="The three laws of thermodynamics are:",
    temperature=0.3,
    extra_body={"min_p": 0.15, "repetition_penalty": 1.05},
)
print(resp.choices[0].text)

Reasoning Mode¶

LFM2.5-8B-A1B and LFM2.5-1.2B-Thinking emit an explicit <think>…</think> chain-of-thought. Launch with the qwen3 reasoning parser to split it into a separate reasoning_content field:

vllm serve LiquidAI/LFM2.5-1.2B-Thinking \
  --reasoning-parser qwen3

response = client.chat.completions.create(
    model="LiquidAI/LFM2.5-1.2B-Thinking",
    messages=[{"role": "user", "content": "If a train travels 60 km in 45 minutes, what is its speed in km/h?"}],
    temperature=0.05,
    extra_body={"top_k": 50, "repetition_penalty": 1.05},
)
msg = response.choices[0].message
print("reasoning:", msg.reasoning_content)
print("answer:", msg.content)

ℹ️ These models open the <think> channel for non-trivial problems; a trivial prompt may be answered directly, in which case reasoning_content is empty. That's expected behavior.

Function Calling / Tool Use¶

LFM2.5 emits Pythonic tool calls wrapped in <|tool_call_start|>…<|tool_call_end|>. The built-in lfm2 tool parser converts these into standard OpenAI tool_calls:

vllm serve LiquidAI/LFM2.5-1.2B-Instruct \
  --enable-auto-tool-choice \
  --tool-call-parser lfm2

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"],
        },
    },
}]

response = client.chat.completions.create(
    model="LiquidAI/LFM2.5-1.2B-Instruct",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=tools,
    temperature=0.1,
    extra_body={"top_k": 50, "repetition_penalty": 1.05},
)
print(response.choices[0].message.tool_calls)

Reasoning and tool calling can be combined for the reasoning models:

vllm serve LiquidAI/LFM2.5-8B-A1B \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser lfm2

Image Understanding (VL)¶

The VL models accept image + text turns through the standard chat API:

vllm serve LiquidAI/LFM2.5-VL-1.6B

response = client.chat.completions.create(
    model="LiquidAI/LFM2.5-VL-1.6B",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"}},
            {"type": "text", "text": "What is in this image?"},
        ],
    }],
    temperature=0.1,
    extra_body={"min_p": 0.15, "repetition_penalty": 1.05},
)
print(response.choices[0].message.content)

Allow more than one image per request with --limit-mm-per-prompt '{"image": 4}'.

Benchmarking¶

Disable prefix caching for consistent measurements:

vllm serve LiquidAI/LFM2.5-8B-A1B \
  --no-enable-prefix-caching

vllm bench serve \
  --model LiquidAI/LFM2.5-8B-A1B \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --request-rate 10000 \
  --num-prompts 64 \
  --ignore-eos

Speed-of-Light Tuning¶

The defaults are already strong for LFM2.5 — the model fits a single GPU (so tensor parallelism gives no speedup) and the standard serve path is well optimized. A few flags give a small extra gain:

Flag	Effect	Notes
`VLLM_USE_OINK_OPS=1`	~+2.6% on B200	Routes RMSNorm to the Blackwell `oink` kernels (bundled in the `vllm-openai` image); output identical. Auto-enabled on Blackwell by these recipes; inert (native fallback) elsewhere.
`--optimization-level 3`	~+2% (all GPUs)	More aggressive compilation. Trades a longer one-time startup/compile for steady-state throughput — opt-in.
`VLLM_USE_FASTOKENS=1`	lower TTFT on tokenization-bound loads	Swaps the HF fast-tokenizer Rust BPE backend for the `fastokens` shim (`pip install fastokens`). Helps high-QPS / long-prompt workloads; the gain is in tokenization latency, so it doesn't show up in steady-state decode throughput.

# -O3 (opt-in); on Blackwell, VLLM_USE_OINK_OPS=1 is applied for you by the recipe
vllm serve LiquidAI/LFM2.5-1.2B-Instruct --optimization-level 3

# tokenization-bound serving (after `pip install fastokens`)
VLLM_USE_FASTOKENS=1 vllm serve LiquidAI/LFM2.5-1.2B-Instruct

These knobs make little difference for single-GPU LFM2.5, but experiment if you like: VLLM_SSM_CONV_STATE_LAYOUT (SD vs DS), --mamba-backend (triton vs flashinfer), --mamba-cache-mode (none vs all), --mm-processor-cache-type (lru vs shm).

Coming: the Lfm2VL encoder CUDA graph (vllm-project/vllm#44930, ~10–20% lower e2e latency at low batch) is not in 0.23.0 — it will be added once it ships in a stable release.

Server Flags Reference¶

Flag	Description	When
`--reasoning-parser qwen3`	Split `<think>…</think>` into `reasoning_content`	8B-A1B, 1.2B-Thinking
`--tool-call-parser lfm2`	Surface Pythonic tool calls as `tool_calls`	tool-capable models
`--enable-auto-tool-choice`	Auto-detect tool calls in output	with `--tool-call-parser`
`--max-model-len N`	Cap context (up to 32K; 128K on 8B-A1B)	small GPUs / fixed workload
`--limit-mm-per-prompt '{"image": N}'`	Max images per request	VL models

Modal runs this recipe on cloud GPUs with a single command — no infrastructure to manage. The deployment script is lfm25-modal.py in this directory: it serves an LFM2.5 model with vLLM behind an OpenAI-compatible endpoint, with the model and GPU selectable via environment variables.

Deploy¶

pip install modal
modal setup                  # one-time: authenticate with Modal
modal deploy lfm25-modal.py  # serves LiquidAI/LFM2.5-1.2B-Instruct on an L4 by default

Test¶

modal run lfm25-modal.py

Pick a model / GPU¶

MODEL=LiquidAI/LFM2.5-8B-A1B GPU=H100 modal run lfm25-modal.py

LFM2.5's small footprint means even a budget GPU is plenty — the 1.2B dense checkpoint runs across NVIDIA T4, L4, A10G, L40S, A100 (40/80 GB), H100, H200, and B200; the 8B-A1B MoE and the VL models run on H100, H200, and B200. Size up to a 24 GB+ GPU (L4 / A10G or larger) for the 8B-A1B MoE, which keeps all ~8B of experts resident in VRAM.