InternVL3.5(InternVL3_5-38B/241B-A28B)#

1 Introduction#

InternVL3.5是InternVL系列中新的开源多模态模型家族,显著提升了通用性、推理能力和推理效率。

InternVL3.5模型在vllm-ascend:v0.20.2中首次获得支持

本文档将展示InternVL3_5-38BInternVL3_5-241B-A28B模型的主要验证步骤,包括支持特性、特性配置、环境准备、单节点和多节点部署、精度和性能评估。

2 Supported Features#

请参考支持特性获取模型的支持特性矩阵。

请参考特性指南获取特性的配置。

3 Environment Preparation#

3.1 Model Weight#

require 1 Atlas 800 A3 (64G × 16) node:

4 Installation#

4.1 Docker Image Installation#

You can use our official docker image to run InternVL3_5 directly.

export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-a3
export NAME=vllm-ascend

# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci8 \
--device /dev/davinci9 \
--device /dev/davinci10 \
--device /dev/davinci11 \
--device /dev/davinci12 \
--device /dev/davinci13 \
--device /dev/davinci14 \
--device /dev/davinci15 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash

To verify the successful installation of the environment, please refer to installation.

4.2 Source Code Installation#

In addition, if you don't want to use the docker image as above, you can also build all from source:

如果要部署多节点环境,需要在每个节点上设置环境。

5 Online Service Deployment#

5.1 Single-Node Online Deployment#

  • Quantized model InternVL3_5-38B-w8a8 can be deployed on 1 Atlas 800 A3 (64G × 16) .

Run the following script to execute online inference.

Common Issues Tip: If you encounter issues, Refer to FAQs.

echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

export VLLM_ASCEND_ENABLE_FLASHCOMM1=0
export VLLM_ASCEND_ENABLE_FUSED_MC2=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export TASK_QUEUE_ENABLE=1
export HCCL_OP_EXPANSION_MODE="AIV"
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export VLLM_USE_V1=1
export VLLM_TORCH_PROFILER_WITH_STACK=0
export HCCL_BUFFSIZE=1536

vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/InternVL3_5-38B-w8a8/ \
    --port 2002 \
    --served-model-name internvl3_5 \
    --trust-remote-code \
    --async-scheduling \
    --max-model-len 40960 \
    --max-num-batched-tokens 16384 \
    --tensor-parallel-size 4 \
    --max-num-seqs 32 \
    --gpu-memory-utilization 0.9 \
    --async-scheduling \
    --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[4,32,64,128,192,256,512]}' \
    --additional-config '{"enable_weight_nz_layout": true, "enable_cpu_binding": true}' \
    --mm-processor-cache-gb 0 \
    --enable-chunked-prefill \
    --enable-prefix-caching \
    --safetensors-load-strategy 'prefetch' \
    --allowed-local-media-path "/
  • Quantized model InternVL3_5-241B-A28B-w8a8 can be deployed on 1 Atlas 800 A3 (64G × 16) .

Run the following script to execute online inference.

Common Issues Tip: If you encounter issues, Refer to FAQs.

echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export VLLM_ASCEND_ENABLE_FUSED_MC2=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export TASK_QUEUE_ENABLE=1
export HCCL_OP_EXPANSION_MODE="AIV"
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export VLLM_USE_V1=1
export VLLM_TORCH_PROFILER_WITH_STACK=0
export HCCL_BUFFSIZE=1536

vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/InternVL3_5-241B-A28B-w8a8/ \
    --port 2001 \
    --served-model-name internvl3_5 \
    --trust-remote-code \
    --async-scheduling \
    --max-model-len 40960 \
    --max-num-batched-tokens 16384 \
    --tensor-parallel-size 16 \
    --max-num-seqs 16 \
    --gpu-memory-utilization 0.9 \
    --async-scheduling \
    --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[4,32,64,128,192,256,512]}' \
    --additional-config '{"enable_weight_nz_layout": true, "enable_cpu_binding": true}' \
    --mm-processor-cache-gb 0 \
    --enable-chunked-prefill \
    --enable-prefix-caching \
    --enable-expert-parallel \
    --safetensors-load-strategy 'prefetch' \
    --allowed-local-media-path "/"

Notice:

Some configurations for optimization are shown below:

  • VLLM_ASCEND_ENABLE_FLASHCOMM1: Enable FlashComm optimization to reduce communication and computation overhead on prefill node. With FlashComm enabled, layer_sharding list cannot include o_proj as an element.

  • VLLM_ASCEND_ENABLE_FUSED_MC2: Enable following fused operators: dispatch_gmm_combine_decode and dispatch_ffn_combine operator.

Please refer to the following python file for further explanation and restrictions of the environment variables above: envs.py

6 Functional Verification#

服务器启动后,您可以使用输入提示查询模型:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "internvl3_5",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg"}},
        {"type": "text", "text": "What is the text in the illustration?"}
    ]}
    ]
    }'

Expected Result:

{"id":"chatcmpl-d3270d4a16cb4b98936f71ee3016451f","object":"chat.completion","created":1764924127,"model":"internvl3_5","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is: **a tiger**","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":107,"total_tokens":123,"completion_tokens":16,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

7 Accuracy Evaluation#

7.1 Using AISBench#

  1. Refer to Using AISBench for details.

  2. After execution, you can get the result.

8 Performance#

8.1 Using AISBench#

Refer to Using AISBench for performance evaluation for details.

8.2 Using vLLM Benchmark#

Refer to vllm benchmark for more details.

9 FAQ#

  • Common Issues Tip: If you encounter issues, Refer to FAQs.