Qwen3-235B-A22B#

1 Introduction#

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support. Qwen3-235B-A22B is the largest MoE variant, featuring 235B total parameters with 22B activated per token.

This document will demonstrate the main validation steps for Qwen3-235B-A22B in the vLLM-Ascend environment, including supported features, environment preparation, single-node and multi-node deployment, accuracy and performance evaluation.

The Qwen3-235B-A22B model is first supported in v0.8.4rc2. This document is validated and written based on vLLM-Ascend v0.21.0. All v0.21.0 and later versions can run stably. To use the latest features, it is recommended to use the latest release candidate or official version.

2 Supported Features#

Please refer to the Supported Features List for the model support matrix.

Please refer to the Feature Guide for feature configuration information.

3 Prerequisites#

3.1 Model Weight#

The following model variants are available. It is recommended to download the model weight to a shared directory accessible to all nodes.

BF16 Version:

Model

Hardware Requirement

Download

Qwen3-235B-A22B (BF16)

1 Atlas 800I A3 (64G × 16), 1 Atlas 800I A2 (64G × 8)

Download

Quantized Version (Pre-converted):

Model

Quantization

Hardware Requirement

Download

Qwen3-235B-A22B-W8A8

W8A8

1 Atlas 800I A3 (64G × 16), 1 Atlas 800I A2 (64G × 8)

Download

These are the recommended numbers of cards, which can be adjusted according to the actual situation.

3.2 Model Quantization#

Install msmodelslim:

# 1. Clone the msmodelslim repository.
git clone https://gitcode.com/Ascend/msmodelslim.git

# 2. Enter the msmodelslim directory and run the installation script.
cd msmodelslim
bash install.sh

# The following message indicates that msmodelslim has been installed successfully.
Successfully installed msmodelslim-{version}

Run quantization:

cd examples/Qwen3-MOE
# Run the following command to quantize the model.
python3 quant_qwen_moe_w8a8.py --model_path /path/to/your/Qwen3-235B-A22B \
    --save_path /path/to/your/Qwen3-235B-A22B-W8A8-rot \
    --anti_dataset ../common/qwen3-moe_anti_prompt_50.json \
    --calib_dataset ../common/qwen3-moe_calib_prompt_50.json \
    --trust_remote_code True \
    --rot

3.3 Verify Multi-node Communication#

If you need to deploy a multi-node environment, verify the multi-node communication according to Verify Multi-node Communication Environment.

4 Installation#

4.1 Docker Image Installation#

You can use the official all-in-one Docker image for Qwen3 MoE models.

Docker Pull:

docker pull quay.io/ascend/vllm-ascend:v0.21.0rc1

Docker Run:

Start the docker image on your each node.

export IMAGE=quay.io/ascend/vllm-ascend:v0.21.0rc1-a3

docker run --rm \
    --name vllm-ascend-env \
    --shm-size=1g \
    --net=host \
    --device /dev/davinci0 \
    --device /dev/davinci1 \
    --device /dev/davinci2 \
    --device /dev/davinci3 \
    --device /dev/davinci4 \
    --device /dev/davinci5 \
    --device /dev/davinci6 \
    --device /dev/davinci7 \
    --device /dev/davinci8 \
    --device /dev/davinci9 \
    --device /dev/davinci10 \
    --device /dev/davinci11 \
    --device /dev/davinci12 \
    --device /dev/davinci13 \
    --device /dev/davinci14 \
    --device /dev/davinci15 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /root/.cache:/root/.cache \
    -it $IMAGE bash

Note

A3 has 8 NPUs with dual-die design (16 chips total: /dev/davinci[0-15]). If you are on a shared machine, map only the chips you need (e.g., /dev/davinci[0-7] for NPU 0-3).

export IMAGE=quay.io/ascend/vllm-ascend:v0.21.0rc1

docker run --rm \
    --name vllm-ascend-env \
    --shm-size=1g \
    --net=host \
    --device /dev/davinci0 \
    --device /dev/davinci1 \
    --device /dev/davinci2 \
    --device /dev/davinci3 \
    --device /dev/davinci4 \
    --device /dev/davinci5 \
    --device /dev/davinci6 \
    --device /dev/davinci7 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /root/.cache:/root/.cache \
    -it $IMAGE bash

The default workdir is /workspace. vLLM and vLLM-Ascend are installed as Python packages in site-packages.

Installation Verification: After starting the container, run the following command to verify the installation:

docker ps | grep vllm-ascend-env

Expected result: The container is listed with status Up. You can also verify the vllm-ascend version inside the container:

pip show vllm-ascend

Expected result: The version information is displayed, matching the pulled image version.

4.2 Source Code Installation#

If you prefer not to use the Docker image, you can build from source:

  1. Clone and install vLLM:

    git clone https://github.com/vllm-project/vllm.git
    cd vllm
    pip install -e .
    
  2. Clone and install the vLLM-Ascend repository:

    git clone https://github.com/vllm-project/vllm-ascend.git
    cd vllm-ascend
    pip install -e .
    

Installation Verification:

pip show vllm-ascend

Expected result: The version information is displayed, confirming a successful installation.

Note

If deploying a multi-node environment, set up the environment on each node.

For more details, please refer to the Installation Guide.

5 Online Service Deployment#

5.1 Single-Node Online Deployment#

Single-node deployment completes both Prefill and Decode within the same node, suitable for development, testing, and small-to-medium scale inference scenarios.

Start the server:

The following command is an example configuration. Adjust the parameters based on your actual scenario.

Atlas 800I A2/A3:

export VLLM_USE_MODELSCOPE=True
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=512
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export TASK_QUEUE_ENABLE=1

vllm serve your_model_path \
    --host <host_ip> \
    --port <port> \
    --tensor-parallel-size 8 \
    --data-parallel-size 1 \
    --seed 1024 \
    --quantization ascend \
    --served-model-name qwen3 \
    --max-num-seqs 32 \
    --max-model-len 131072 \
    --max-num-batched-tokens 8096 \
    --enable-expert-parallel \
    --trust-remote-code \
    --gpu-memory-utilization 0.95 \
    --hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \
    --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
    --additional-config '{"enable_flashcomm1": true}' \
    --async-scheduling

Note

Service Verification:

If the service starts successfully, the following startup log will be displayed:

(APIServer pid=<pid>) INFO:     Started server process [<pid>]
(APIServer pid=<pid>) INFO:     Waiting for application startup.
(APIServer pid=<pid>) INFO:     Application startup complete.

5.2 Multi-Node PD Separation Deployment#

PD (Prefill-Decode) separation splits the Prefill and Decode phases across different nodes for better throughput. The following example shows the parameter configuration for a three-node A3 PD disaggregation scenario (one Prefill node + two Decode nodes):

For the detailed deployment guide, please refer to Prefill-Decode Disaggregation Mooncake Verification.

Hardware: 3 × Atlas 800 A3 (64G × 16), one for Prefill, two for Decode.

First, prepare launch_online_dp.py on each node:

import argparse
import multiprocessing
import os
import subprocess
import sys

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--dp-size", type=int, required=True, help="Data parallel size.")
    parser.add_argument("--tp-size", type=int, default=1, help="Tensor parallel size.")
    parser.add_argument("--dp-size-local", type=int, default=-1, help="Local data parallel size.")
    parser.add_argument("--dp-rank-start", type=int, default=0, help="Starting rank for data parallel.")
    parser.add_argument("--dp-address", type=str, required=True, help="IP address for data parallel master node.")
    parser.add_argument("--dp-rpc-port", type=str, default=12345, help="Port for data parallel master node.")
    parser.add_argument("--vllm-start-port", type=int, default=9000, help="Starting port for the engine.")
    return parser.parse_args()

args = parse_args()
dp_size = args.dp_size
tp_size = args.tp_size
dp_size_local = args.dp_size_local
if dp_size_local == -1:
    dp_size_local = dp_size
dp_rank_start = args.dp_rank_start
dp_address = args.dp_address
dp_rpc_port = args.dp_rpc_port
vllm_start_port = args.vllm_start_port

def run_command(visible_devices, dp_rank, vllm_engine_port):
    command = [
        "bash",
        "./run_dp_template.sh",
        visible_devices,
        str(vllm_engine_port),
        str(dp_size),
        str(dp_rank),
        dp_address,
        dp_rpc_port,
        str(tp_size),
    ]
    subprocess.run(command, check=True)

if __name__ == "__main__":
    template_path = "./run_dp_template.sh"
    if not os.path.exists(template_path):
        print(f"Template file {template_path} does not exist.")
        sys.exit(1)

    processes = []
    num_cards = dp_size_local * tp_size
    for i in range(dp_size_local):
        dp_rank = dp_rank_start + i
        vllm_engine_port = vllm_start_port + i
        visible_devices = ",".join(str(x) for x in range(i * tp_size, (i + 1) * tp_size))
        process = multiprocessing.Process(target=run_command, args=(visible_devices, dp_rank, vllm_engine_port))
        processes.append(process)
        process.start()

    for process in processes:
        process.join()

Then prepare run_dp_template.sh on each node.

Prefill node (set nic_name and local_ip to your own):

nic_name="<your_nic_name>"
local_ip="<your_ip>"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name

export HCCL_BUFFSIZE=512
export HCCL_OP_EXPANSION_MODE="AIV"
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

export OMP_NUM_THREADS=1
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl kernel.sched_migration_cost_ns=50000
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
export TASK_QUEUE_ENABLE=1
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH

export ASCEND_RT_VISIBLE_DEVICES=$1

vllm serve "/data/weights/Qwen3-235B-A22B-w8a8-rot" \
    --host 0.0.0.0 \
    --port $2 \
    --data-parallel-size $3 \
    --data-parallel-rank $4 \
    --data-parallel-address $5 \
    --data-parallel-rpc-port $6 \
    --tensor-parallel-size $7 \
    --enable-expert-parallel \
    --served-model-name qwen3_235b \
    --max-model-len 40960 \
    --max-num-batched-tokens 16384 \
    --max-num-seqs 24 \
    --trust-remote-code \
    --gpu-memory-utilization 0.9 \
    --quantization ascend \
    --no-enable-prefix-caching \
    --enforce-eager \
    --additional-config '{"enable_flashcomm1": true, "enable_fused_mc2": 1}' \
    --kv-transfer-config \
        '{"kv_connector": "MooncakeConnectorV1",
        "kv_role": "kv_producer",
        "kv_port": "30000",
        "engine_id": "0",
        "kv_connector_extra_config": {
             "use_ascend_direct": true,
             "prefill": {
                    "dp_size": 2,
                    "tp_size": 8
             },
             "decode": {
                    "dp_size": 8,
                    "tp_size": 4
             }
        }
        }'

Decode node 0 (set nic_name and local_ip to your own):

nic_name="<your_nic_name>"
local_ip="<your_ip>"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export HCCL_BUFFSIZE=1024
export HCCL_OP_EXPANSION_MODE="AIV"
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

export OMP_NUM_THREADS=1
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl kernel.sched_migration_cost_ns=50000
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
export TASK_QUEUE_ENABLE=1
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH

export VLLM_TORCH_PROFILER_WITH_STACK=0
export ASCEND_RT_VISIBLE_DEVICES=$1

vllm serve "/data/weights/Qwen3-235B-A22B-w8a8-rot" \
    --host 0.0.0.0 \
    --port $2 \
    --data-parallel-size $3 \
    --data-parallel-rank $4 \
    --data-parallel-address $5 \
    --data-parallel-rpc-port $6 \
    --tensor-parallel-size $7 \
    --enable-expert-parallel \
    --served-model-name qwen3_235b \
    --max-model-len 40960 \
    --max-num-batched-tokens 512 \
    --max-num-seqs 128 \
    --trust-remote-code \
    --gpu-memory-utilization 0.9 \
    --quantization ascend \
    --no-enable-prefix-caching \
    --async-scheduling \
    --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
    --additional-config '{"enable_flashcomm1": true, "enable_fused_mc2": 2}' \
    --kv-transfer-config \
        '{"kv_connector": "MooncakeConnectorV1",
        "kv_role": "kv_consumer",
        "kv_port": "30100",
        "engine_id": "1",
        "kv_connector_extra_config": {
             "use_ascend_direct": true,
             "prefill": {
                    "dp_size": 2,
                    "tp_size": 8
             },
             "decode": {
                    "dp_size": 8,
                    "tp_size": 4
             }
        }
        }'

Decode node 1 (set nic_name and local_ip to your own):

nic_name="<your_nic_name>"
local_ip="<your_ip>"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export HCCL_BUFFSIZE=1024
export HCCL_OP_EXPANSION_MODE="AIV"
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

export OMP_NUM_THREADS=1
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl kernel.sched_migration_cost_ns=50000
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
export TASK_QUEUE_ENABLE=1
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH

export VLLM_TORCH_PROFILER_WITH_STACK=0
export ASCEND_RT_VISIBLE_DEVICES=$1

vllm serve "/data/weights/Qwen3-235B-A22B-w8a8-rot" \
    --host 0.0.0.0 \
    --port $2 \
    --data-parallel-size $3 \
    --data-parallel-rank $4 \
    --data-parallel-address $5 \
    --data-parallel-rpc-port $6 \
    --tensor-parallel-size $7 \
    --enable-expert-parallel \
    --served-model-name qwen3_235b \
    --max-model-len 40960 \
    --max-num-batched-tokens 512 \
    --max-num-seqs 128 \
    --trust-remote-code \
    --gpu-memory-utilization 0.9 \
    --quantization ascend \
    --no-enable-prefix-caching \
    --async-scheduling \
    --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
    --additional-config '{"enable_flashcomm1": true, "enable_fused_mc2": 2}' \
    --kv-transfer-config \
        '{"kv_connector": "MooncakeConnectorV1",
        "kv_role": "kv_consumer",
        "kv_port": "30100",
        "engine_id": "1",
        "kv_connector_extra_config": {
             "use_ascend_direct": true,
             "prefill": {
                    "dp_size": 2,
                    "tp_size": 8
             },
             "decode": {
                    "dp_size": 8,
                    "tp_size": 4
             }
        }
        }'

Once the scripts are ready, start the servers on each node.

Prefill node:

python launch_online_dp.py \
    --dp-size 2 --tp-size 8 \
    --dp-size-local 2 --dp-rank-start 0 \
    --dp-address <prefill_ip> --dp-rpc-port 54951 \
    --vllm-start-port 9123

Decode node 0:

python launch_online_dp.py \
    --dp-size 8 --tp-size 4 \
    --dp-size-local 4 --dp-rank-start 0 \
    --dp-address <decode_ip> --dp-rpc-port 54951 \
    --vllm-start-port 9123

Decode node 1:

python launch_online_dp.py \
    --dp-size 8 --tp-size 4 \
    --dp-size-local 4 --dp-rank-start 4 \
    --dp-address <decode_ip> --dp-rpc-port 54951 \
    --vllm-start-port 9123

Request Forwarding:

Run the proxy on any machine that can reach both nodes. You can get the proxy script from the repository: load_balance_proxy_server_example.py.

unset http_proxy https_proxy

python load_balance_proxy_server_example.py \
  --port 38085 \
  --host <prefill_ip> \
  --prefiller-hosts \
    <prefill_ip> <prefill_ip> \
  --prefiller-ports \
    9123 9124 \
  --decoder-hosts \
    <decode0_ip> <decode0_ip> <decode0_ip> <decode0_ip> \
    <decode1_ip> <decode1_ip> <decode1_ip> <decode1_ip> \
  --decoder-ports \
    9123 9124 9125 9126 \
    9123 9124 9125 9126 \

Note

Service Verification:

If the service starts successfully, the following startup log will be displayed:

(APIServer pid=<pid>) INFO:     Started server process [<pid>]
(APIServer pid=<pid>) INFO:     Waiting for application startup.
(APIServer pid=<pid>) INFO:     Application startup complete.

6 Functional Verification#

After the service is started, the model can be invoked by sending a prompt:

curl http://<node0_ip>:<port>/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen3",
        "prompt": "The future of AI is",
        "max_completion_tokens": 50,
        "temperature": 0
    }'

Expected result: HTTP 200 with a JSON response containing the choices field with generated text.

7 Accuracy Evaluation#

Using AISBench#

For details, please refer to Using AISBench.

Install from source:

git clone https://github.com/AISBench/benchmark.git
cd benchmark
pip install -e .

The following is an example configuration for the accuracy evaluation config file:

Accuracy Evaluation Config File:

# Example configuration: benchmarks/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py
from ais_bench.benchmark.models import VLLMCustomAPIChat
from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content

models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChat,
        abbr='vllm-api-general-chat',
        path="your_model_path",
        model="qwen3",
        request_rate = 0,
        retry = 2,
        host_ip = "127.0.0.1",
        host_port = 2001,
        max_out_len = 32768,
        batch_size = 32,
        trust_remote_code=False,
        generation_kwargs = dict(
            temperature = 0.6,
            top_k = 20,
            top_p = 0.95,
        ),
        pred_postprocessor=dict(type=extract_non_reasoning_content)
    )
]

Run the accuracy evaluation using the aime2024 dataset as an example:

ais_bench --models vllm_api_general_chat --datasets aime2024_gen_0_shot_chat_prompt --debug

The –models parameter value corresponds to the abbr field in the configuration file above. Adjust max_out_len, batch_size, and dataset tasks based on your scenario.

8 Performance Evaluation#

Using AISBench#

For setup details, including installation, dataset download, and configuration, please refer to Using AISBench for details.

The following is an example configuration for the accuracy evaluation config file:

# Example configuration: benchmarks/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py
from ais_bench.benchmark.models import VLLMCustomAPIChat
from ais_bench.benchmark.utils.postprocess.model_postprocessors import extract_non_reasoning_content

models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChat,
        abbr="vllm-api-stream-chat",
        path="your_model_path",
        model="qwen",
        stream=True,
        request_rate=0,
        use_timestamp=False,
        retry=2,
        host_ip="localhost",
        host_port=20002,
        max_out_len=1500,
        batch_size=140,
        trust_remote_code=False,
        generation_kwargs=dict(
            temperature=0,
            ignore_eos = True
        ),
    )
]

Run the performance evaluation using the GSM8K dataset as an example:

ais_bench --models vllm_api_stream_chat --datasets gsm8k_gen_0_shot_cot_str_perf --debug --summarizer default_perf --mode perf --num-prompts 560

Using vLLM Benchmark#

Refer to vLLM benchmark for more details.

There are three vllm bench subcommands:

  • latency: Benchmark the latency of a single batch of requests.

  • serve: Benchmark the online serving throughput.

  • throughput: Benchmark offline inference throughput.

Take serve as an example:

export VLLM_USE_MODELSCOPE=True
vllm bench serve \
    --model your_model_path \
    --dataset-name random \
    --random-input 200 \
    --num-prompts 200 \
    --request-rate 1 \
    --save-result \
    --result-dir ./

After several minutes, you will get the performance evaluation result.

9 Performance Tuning#

9.2 Tuning Guidelines#

9.2.1 General Tuning Reference#

Please refer to the Public Performance Tuning Documentation for tuning methods. Please refer to the Feature Guide for detailed feature descriptions.

10 FAQ#

For common environment, installation, and general parameter issues, please refer to the vLLM-Ascend FAQs. This section only covers issues specific to Qwen3-235B-A22B.

Q: What hardware is required for Qwen3-235B-A22B?#

For BF16: 1 Atlas 800I A3 (64G × 16) node, 1 Atlas 800I A2 (64G × 8) node, or 2 Atlas 800I A2 (32G × 8) nodes. For W8A8 quantized version, the hardware requirements are similar.

Q: How do I enable long context beyond 40K?#

Use yarn rope-scaling. For vLLM >= v0.12.0: --hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}'. For older versions, use --rope_scaling. Model variants like Qwen3-235B-A22B-Instruct-2507 natively support long contexts and don’t need this parameter.

Q: When should I use PD disaggregation vs single-node deployment?#

Single-node deployment is simpler and recommended when the model fits within a single node. PD disaggregation separates Prefill and Decode across nodes, enabling higher throughput for large-scale serving. For Qwen3-235B-A22B, three A3 nodes with PD disaggregation can achieve ~3× the throughput of single-node deployment.

Q: What is the difference between enable_fused_mc2=1 and =2?#

Value 1 enables the base MoE fused operator, suitable for typical EP configurations. Value 2 enables an alternative fusion strategy optimized for large-scale EP (e.g., EP32 in PD disaggregation scenarios). Both are experimental and currently only support W8A8 quantization on Atlas A3 servers.

Q: When should I use Expert Parallelism?#

Expert Parallelism (EP) should always be enabled for Qwen3-235B-A22B (an MoE model) via --enable-expert-parallel. It distributes FFN experts across NPUs to reduce per-device computation. EP works alongside TP, where MoE layers use EP and non-MoE layers use TP.

Q: How do I choose between Context Parallelism and PD Disaggregation?#

Context Parallelism (CP) splits the KV cache of a single request across multiple NPUs, suitable for long context scenarios on a single node. PD Disaggregation separates Prefill and Decode across nodes, suitable for high-throughput serving with many concurrent requests.