LFM2.5 Usage Guide¶
LFM2.5 is Liquid AI's family of small, efficient open-weight models built on the LFM2 hybrid backbone — short-range gated convolution blocks interleaved with grouped-query attention. The hybrid design keeps the KV cache small and decode fast, so LFM2.5 models punch above their weight while remaining cheap to serve, down to edge / on-device GPUs. The family spans dense chat models (230M, 350M, 1.2B), a mixture-of-experts model (8B-A1B), reasoning and Japanese variants, base checkpoints, and vision-language models (VL 450M / 1.6B) — all served through vLLM's OpenAI-compatible API.
All LFM2.5 language and vision-language models are supported natively by vLLM (the
Lfm2ForCausalLM, Lfm2MoeForCausalLM, and Lfm2VlForConditionalGeneration architectures). Tool
calling is handled by the built-in lfm2 tool parser, and <think> reasoning by the qwen3
reasoning parser.
Supported Models¶
Dense Chat Models¶
| Model | Parameters | Min NVIDIA GPU (BF16) | Context | Tools | HuggingFace |
|---|---|---|---|---|---|
| LFM2.5 230M | 230M | 1× (any) | 32K | ✓ | LiquidAI/LFM2.5-230M |
| LFM2.5 350M | 350M | 1× (any) | 32K | ✓ | LiquidAI/LFM2.5-350M |
| LFM2.5 1.2B Instruct | 1.2B | 1× (8 GB+) | 32K | ✓ | LiquidAI/LFM2.5-1.2B-Instruct |
| LFM2.5 1.2B Thinking | 1.2B | 1× (8 GB+) | 32K | ✓ (+reasoning) | LiquidAI/LFM2.5-1.2B-Thinking |
| LFM2.5 1.2B JP | 1.2B | 1× (8 GB+) | 32K | – | LiquidAI/LFM2.5-1.2B-JP |
| LFM2.5 1.2B JP (202606) | 1.2B | 1× (8 GB+) | 32K | ✓ | LiquidAI/LFM2.5-1.2B-JP-202606 |
| LFM2.5 1.2B Base | 1.2B | 1× (8 GB+) | 32K | – (completions) | LiquidAI/LFM2.5-1.2B-Base |
Mixture-of-Experts (MoE) Model¶
| Model | Total / Active Params | Min NVIDIA GPU (BF16) | Context | HuggingFace |
|---|---|---|---|---|
| LFM2.5 8B-A1B | 8B / ~1B active | 1× (24 GB+) | 128K | LiquidAI/LFM2.5-8B-A1B |
The 8B-A1B keeps every expert resident in VRAM, so size the GPU for the full ~8B of weights even
though only ~1B is active per token. It supports <think> reasoning and tool calling.
Vision-Language Models¶
| Model | Parameters | Min NVIDIA GPU (BF16) | Context | HuggingFace |
|---|---|---|---|---|
| LFM2.5 VL 450M | 450M | 1× (any) | 32K | LiquidAI/LFM2.5-VL-450M |
| LFM2.5 VL 1.6B | 1.6B | 1× (8 GB+) | 32K | LiquidAI/LFM2.5-VL-1.6B |
The VL models pair the LFM2 hybrid language backbone with a SigLIP2 vision encoder
(Lfm2VlForConditionalGeneration).
Installing vLLM¶
LFM2.5's dense, MoE, and VL architectures (Lfm2ForCausalLM, Lfm2MoeForCausalLM,
Lfm2VlForConditionalGeneration) run on vLLM 0.23.0, which min_vllm_version pins.
pip (NVIDIA CUDA)¶
Docker¶
docker pull vllm/vllm-openai:latest # CUDA 12.x
docker pull vllm/vllm-openai:latest-cu130 # CUDA 13.0 (Blackwell)
Running LFM2.5¶
Quick Start (Single GPU)¶
Cap the context to fit a smaller GPU (models support up to 32K; 128K on 8B-A1B):
Multi-GPU¶
Every LFM2.5 model fits on a single GPU (the 8B-A1B MoE on a 24 GB+ GPU), so tensor parallelism gives no meaningful speedup on basically any GPU — there's nothing to split that helps. You can still try it on a multi-GPU node, but expect no throughput gain:
Docker Deployment¶
docker run -itd --name lfm2.5 \
--ipc=host --network host --shm-size 16G --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--model LiquidAI/LFM2.5-1.2B-Instruct \
--host 0.0.0.0 --port 8000
Recommended Sampling¶
LFM2.5 uses per-model sampling presets from the model cards. top_k, min_p, and
repetition_penalty are vLLM extra sampling params — pass them via extra_body on the OpenAI
client (or top-level in a raw /v1/... request).
| Model | temperature | top_k | min_p | repetition_penalty |
|---|---|---|---|---|
| LFM2.5 230M | 0.1 | 50 | – | 1.05 |
| LFM2.5 350M | 0.1 | 50 | – | 1.05 |
| LFM2.5 1.2B Instruct | 0.1 | 50 | – | 1.05 |
| LFM2.5 1.2B Thinking | 0.05 | 50 | – | 1.05 |
| LFM2.5 1.2B JP | 0.3 | – | 0.15 | 1.05 |
| LFM2.5 1.2B JP (202606) | 0.1 | 50 | – | 1.05 |
| LFM2.5 1.2B Base | 0.3 | – | 0.15 | 1.05 |
| LFM2.5 8B-A1B | 0.2 | 80 | – | 1.05 |
| LFM2.5 VL 450M / 1.6B | 0.1 | – | 0.15 | 1.05 |
⚠️ Do not bake these into
vllm serve— they are per-request client defaults, not server flags. Leavemax_tokensunset — capping it truncates the reasoning models' chain-of-thought.
Text Generation¶
Online Serving (OpenAI SDK)¶
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="LiquidAI/LFM2.5-1.2B-Instruct",
messages=[{"role": "user", "content": "What is C. elegans? Answer in one sentence."}],
temperature=0.1,
extra_body={"top_k": 50, "repetition_penalty": 1.05},
)
print(response.choices[0].message.content)
Base Model (completions)¶
LFM2.5-1.2B-Base is not instruction-tuned and has no chat template — use the completions
endpoint:
resp = client.completions.create(
model="LiquidAI/LFM2.5-1.2B-Base",
prompt="The three laws of thermodynamics are:",
temperature=0.3,
extra_body={"min_p": 0.15, "repetition_penalty": 1.05},
)
print(resp.choices[0].text)
Reasoning Mode¶
LFM2.5-8B-A1B and LFM2.5-1.2B-Thinking emit an explicit <think>…</think> chain-of-thought.
Launch with the qwen3 reasoning parser to split it into a separate reasoning_content field:
response = client.chat.completions.create(
model="LiquidAI/LFM2.5-1.2B-Thinking",
messages=[{"role": "user", "content": "If a train travels 60 km in 45 minutes, what is its speed in km/h?"}],
temperature=0.05,
extra_body={"top_k": 50, "repetition_penalty": 1.05},
)
msg = response.choices[0].message
print("reasoning:", msg.reasoning_content)
print("answer:", msg.content)
ℹ️ These models open the
<think>channel for non-trivial problems; a trivial prompt may be answered directly, in which casereasoning_contentis empty. That's expected behavior.
Function Calling / Tool Use¶
LFM2.5 emits Pythonic tool calls wrapped in <|tool_call_start|>…<|tool_call_end|>. The built-in
lfm2 tool parser converts these into standard OpenAI tool_calls:
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
},
}]
response = client.chat.completions.create(
model="LiquidAI/LFM2.5-1.2B-Instruct",
messages=[{"role": "user", "content": "What's the weather in Paris?"}],
tools=tools,
temperature=0.1,
extra_body={"top_k": 50, "repetition_penalty": 1.05},
)
print(response.choices[0].message.tool_calls)
Reasoning and tool calling can be combined for the reasoning models:
vllm serve LiquidAI/LFM2.5-8B-A1B \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser lfm2
Image Understanding (VL)¶
The VL models accept image + text turns through the standard chat API:
response = client.chat.completions.create(
model="LiquidAI/LFM2.5-VL-1.6B",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"}},
{"type": "text", "text": "What is in this image?"},
],
}],
temperature=0.1,
extra_body={"min_p": 0.15, "repetition_penalty": 1.05},
)
print(response.choices[0].message.content)
Allow more than one image per request with --limit-mm-per-prompt '{"image": 4}'.
Benchmarking¶
Disable prefix caching for consistent measurements:
vllm serve LiquidAI/LFM2.5-8B-A1B \
--no-enable-prefix-caching
vllm bench serve \
--model LiquidAI/LFM2.5-8B-A1B \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--request-rate 10000 \
--num-prompts 64 \
--ignore-eos
Speed-of-Light Tuning¶
The defaults are already strong for LFM2.5 — the model fits a single GPU (so tensor parallelism gives no speedup) and the standard serve path is well optimized. A few flags give a small extra gain:
| Flag | Effect | Notes |
|---|---|---|
VLLM_USE_OINK_OPS=1 |
~+2.6% on B200 | Routes RMSNorm to the Blackwell oink kernels (bundled in the vllm-openai image); output identical. Auto-enabled on Blackwell by these recipes; inert (native fallback) elsewhere. |
--optimization-level 3 |
~+2% (all GPUs) | More aggressive compilation. Trades a longer one-time startup/compile for steady-state throughput — opt-in. |
VLLM_USE_FASTOKENS=1 |
lower TTFT on tokenization-bound loads | Swaps the HF fast-tokenizer Rust BPE backend for the fastokens shim (pip install fastokens). Helps high-QPS / long-prompt workloads; the gain is in tokenization latency, so it doesn't show up in steady-state decode throughput. |
# -O3 (opt-in); on Blackwell, VLLM_USE_OINK_OPS=1 is applied for you by the recipe
vllm serve LiquidAI/LFM2.5-1.2B-Instruct --optimization-level 3
# tokenization-bound serving (after `pip install fastokens`)
VLLM_USE_FASTOKENS=1 vllm serve LiquidAI/LFM2.5-1.2B-Instruct
These knobs make little difference for single-GPU LFM2.5, but experiment if you like:
VLLM_SSM_CONV_STATE_LAYOUT (SD vs DS), --mamba-backend (triton vs flashinfer),
--mamba-cache-mode (none vs all), --mm-processor-cache-type (lru vs shm).
Coming: the Lfm2VL encoder CUDA graph (vllm-project/vllm#44930, ~10–20% lower e2e latency at low batch) is not in 0.23.0 — it will be added once it ships in a stable release.
Server Flags Reference¶
| Flag | Description | When |
|---|---|---|
--reasoning-parser qwen3 |
Split <think>…</think> into reasoning_content |
8B-A1B, 1.2B-Thinking |
--tool-call-parser lfm2 |
Surface Pythonic tool calls as tool_calls |
tool-capable models |
--enable-auto-tool-choice |
Auto-detect tool calls in output | with --tool-call-parser |
--max-model-len N |
Cap context (up to 32K; 128K on 8B-A1B) | small GPUs / fixed workload |
--limit-mm-per-prompt '{"image": N}' |
Max images per request | VL models |
Deploy on Modal¶
Modal runs this recipe on cloud GPUs with a single command — no
infrastructure to manage. The deployment script is lfm25-modal.py in this
directory: it serves an LFM2.5 model with vLLM behind an OpenAI-compatible endpoint, with the
model and GPU selectable via environment variables.
Deploy¶
pip install modal
modal setup # one-time: authenticate with Modal
modal deploy lfm25-modal.py # serves LiquidAI/LFM2.5-1.2B-Instruct on an L4 by default
Test¶
Pick a model / GPU¶
LFM2.5's small footprint means even a budget GPU is plenty — the 1.2B dense checkpoint runs across NVIDIA T4, L4, A10G, L40S, A100 (40/80 GB), H100, H200, and B200; the 8B-A1B MoE and the VL models run on H100, H200, and B200. Size up to a 24 GB+ GPU (L4 / A10G or larger) for the 8B-A1B MoE, which keeps all ~8B of experts resident in VRAM.