Skip to main content
Ctrl+K

You are viewing the latest developer preview docs. Click here to view docs for the latest stable release(v0.18.0).

vllm-ascend - Home vllm-ascend - Home

Getting Started

  • Quickstart
  • Installation
  • Model Tutorials
    • Qwen3-Dense(Qwen3-0.6B/8B/32B)
    • Qwen-VL-Dense(Qwen3-VL-2B/4B/8B/32B)
    • Qwen3-30B-A3B
    • Qwen3-235B-A22B
    • Qwen3-VL-30B-A3B-Instruct
    • Qwen3-VL-235B-A22B-Instruct
    • Qwen3-Coder-30B-A3B
    • Qwen3-Embedding
    • Qwen3-VL-Embedding
    • Qwen3-Reranker
    • Qwen3-VL-Reranker
    • Qwen3-8B-W4A8
    • Qwen3-32B-W4A4
    • Qwen3-Next
    • Qwen3-Omni-30B-A3B-Thinking
    • Qwen3.5-27B
    • Qwen3.5-397B-A17B
    • DeepSeek-V3/3.1
    • DeepSeek-V3.2
    • DeepSeek-V4-Flash
    • DeepSeek-V4-Pro
    • DeepSeek-R1
    • DeepSeek-OCR-2
    • GLM-4.5/4.6/4.7
    • GLM-5/GLM-5.1
    • Kimi-K2-Thinking
    • Kimi-K2.5
    • PaddleOCR-VL
    • MiniMax-M2.5
    • Hunyuan-A13B-Instruct
    • Hy3-preview
    • Minitron-8B-Base
    • LLaVA-OneVision-Qwen2-0.5B-OV
    • gpt-oss-120b
    • Mixtral-8x7B-Instruct-v0.1
    • Qwen3-ASR-1.7B
    • Qwen2.5-Math-RM-72B
  • Feature Tutorials
    • PD-Colocated with Mooncake Multi-Instance
    • Prefill-Decode Disaggregation (Qwen2.5-VL)
    • Prefill-Decode Disaggregation (Deepseek)
    • Long-Sequence Context Parallel (Qwen3-235B-A22B)
    • Long-Sequence Context Parallel (Deepseek)
    • Dynamic Chunked Pipeline Parallel (DeepSeek-V3.1)
    • Suffix Speculative Decoding
    • Ray Distributed (Qwen3-235B-A22B)
  • Hardware Tutorials
    • Atlas 300I DUO
  • FAQs

User Guide

  • Features and Models
    • Supported Models
    • Supported Features
    • Feature × Feature
  • Configuration Guide
    • Environment Variables
    • Additional Configuration
  • Feature Guide
    • Graph Mode Guide
    • CPU Binding
    • AI QoS Feature
    • Quantization Guide
    • Sleep Mode Guide
    • Structured Output Guide
    • LoRA Adapters Guide
    • Expert Load Balance (EPLB)
    • Netloader Guide
    • RFork Guide
    • Multi Token Prediction (MTP)
    • Dynamic Batch
    • Disaggregated-encoder
    • Ascend Store Deployment Guide
    • KV Cache CPU Offload Guide
    • External DP
    • Distributed DP Server With Large-Scale Expert Parallelism
    • UCM Store Deployment Guide
    • Fine-Grained Tensor Parallelism (Fine-grained TP)
    • Layer Sharding Linear Guide
    • Speculative Decoding Guide
    • Context Parallel Guide
    • Weight Prefetch Guide
    • Sequence Parallelism
    • Batch Invariance
    • LMCache-Ascend Deployment Guide
    • Dynamic Chunked Pipeline Parallel
    • Flash Attention 3
  • Deployment Guide
    • Using Volcano Kthena
  • Release Notes

Developer Guide

  • Contributing
    • Testing
    • Documentation writing guide
    • Multi Node Test
    • Nightly CI Test
    • E2E CI Test
  • Design Documents
    • Patch in vLLM Ascend
    • CPU Binding
    • Prepare inputs for model forwarding
    • Disaggregated-prefill
    • Expert Parallelism Load Balancer (EPLB)
    • ACL Graph
    • KV Cache Pool
    • Adding a custom aclnn operation
    • Context Parallel (CP)
    • Dynamic Chunked Pipeline Parallel (CPP)
    • Quantization Adaptation Guide
    • Npugraph_ex
  • Accuracy
    • Using EvalScope
    • Using lm-eval
    • Using AISBench
    • Using OpenCompass
  • Performance and Debug
    • Performance Benchmark
    • Optimization and Tuning
    • Service Profiling Guide
    • MSProbe Debugging Guide

Community

  • Governance
  • Committers and Contributors
  • Issue Workflow Guidelines
  • Versioning Policy
  • User Stories
    • LLaMA-Factory
  • Repository
  • Suggest edit
  • .md

Performance and Debug

Performance and Debug#

Performance and Debug

  • Performance Benchmark
  • Optimization and Tuning
  • Service Profiling Guide
  • MSProbe Debugging Guide

previous

Using OpenCompass

next

Performance Benchmark

By the vllm-ascend team

© Copyright 2025, vllm-ascend team.