InternVL3.5(InternVL3_5-38B/241B-A28B)#
1 Introduction#
InternVL3.5是InternVL系列中新的开源多模态模型家族,显著提升了通用性、推理能力和推理效率。
InternVL3.5模型在vllm-ascend:v0.20.2中首次获得支持
本文档将展示InternVL3_5-38B和InternVL3_5-241B-A28B模型的主要验证步骤,包括支持特性、特性配置、环境准备、单节点和多节点部署、精度和性能评估。
2 Supported Features#
请参考支持特性获取模型的支持特性矩阵。
请参考特性指南获取特性的配置。
3 Environment Preparation#
3.1 Model Weight#
require 1 Atlas 800 A3 (64G × 16) node:
InternVL3_5-38B-w8a8: requires 1 Atlas 800 A3 (64G × 16) node Download model weightInternVL3_5-241B-A28B-w8a8: requires 1 Atlas 800 A3 (64G × 16) node Download model weight
4 Installation#
4.1 Docker Image Installation#
You can use our official docker image to run InternVL3_5 directly.
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-a3
export NAME=vllm-ascend
# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci8 \
--device /dev/davinci9 \
--device /dev/davinci10 \
--device /dev/davinci11 \
--device /dev/davinci12 \
--device /dev/davinci13 \
--device /dev/davinci14 \
--device /dev/davinci15 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash
To verify the successful installation of the environment, please refer to installation.
4.2 Source Code Installation#
In addition, if you don't want to use the docker image as above, you can also build all from source:
Install
vllm-ascendfrom source, refer to installation.
如果要部署多节点环境,需要在每个节点上设置环境。
5 Online Service Deployment#
5.1 Single-Node Online Deployment#
Quantized model
InternVL3_5-38B-w8a8can be deployed on 1 Atlas 800 A3 (64G × 16) .
Run the following script to execute online inference.
Common Issues Tip: If you encounter issues, Refer to FAQs.
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export VLLM_ASCEND_ENABLE_FLASHCOMM1=0
export VLLM_ASCEND_ENABLE_FUSED_MC2=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export TASK_QUEUE_ENABLE=1
export HCCL_OP_EXPANSION_MODE="AIV"
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export VLLM_USE_V1=1
export VLLM_TORCH_PROFILER_WITH_STACK=0
export HCCL_BUFFSIZE=1536
vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/InternVL3_5-38B-w8a8/ \
--port 2002 \
--served-model-name internvl3_5 \
--trust-remote-code \
--async-scheduling \
--max-model-len 40960 \
--max-num-batched-tokens 16384 \
--tensor-parallel-size 4 \
--max-num-seqs 32 \
--gpu-memory-utilization 0.9 \
--async-scheduling \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[4,32,64,128,192,256,512]}' \
--additional-config '{"enable_weight_nz_layout": true, "enable_cpu_binding": true}' \
--mm-processor-cache-gb 0 \
--enable-chunked-prefill \
--enable-prefix-caching \
--safetensors-load-strategy 'prefetch' \
--allowed-local-media-path "/
Quantized model
InternVL3_5-241B-A28B-w8a8can be deployed on 1 Atlas 800 A3 (64G × 16) .
Run the following script to execute online inference.
Common Issues Tip: If you encounter issues, Refer to FAQs.
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export VLLM_ASCEND_ENABLE_FUSED_MC2=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export TASK_QUEUE_ENABLE=1
export HCCL_OP_EXPANSION_MODE="AIV"
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export VLLM_USE_V1=1
export VLLM_TORCH_PROFILER_WITH_STACK=0
export HCCL_BUFFSIZE=1536
vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/InternVL3_5-241B-A28B-w8a8/ \
--port 2001 \
--served-model-name internvl3_5 \
--trust-remote-code \
--async-scheduling \
--max-model-len 40960 \
--max-num-batched-tokens 16384 \
--tensor-parallel-size 16 \
--max-num-seqs 16 \
--gpu-memory-utilization 0.9 \
--async-scheduling \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[4,32,64,128,192,256,512]}' \
--additional-config '{"enable_weight_nz_layout": true, "enable_cpu_binding": true}' \
--mm-processor-cache-gb 0 \
--enable-chunked-prefill \
--enable-prefix-caching \
--enable-expert-parallel \
--safetensors-load-strategy 'prefetch' \
--allowed-local-media-path "/"
Notice:
Some configurations for optimization are shown below:
VLLM_ASCEND_ENABLE_FLASHCOMM1: Enable FlashComm optimization to reduce communication and computation overhead on prefill node. With FlashComm enabled, layer_sharding list cannot include o_proj as an element.VLLM_ASCEND_ENABLE_FUSED_MC2: Enable following fused operators: dispatch_gmm_combine_decode and dispatch_ffn_combine operator.
Please refer to the following python file for further explanation and restrictions of the environment variables above: envs.py
6 Functional Verification#
服务器启动后,您可以使用输入提示查询模型:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "internvl3_5",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg"}},
{"type": "text", "text": "What is the text in the illustration?"}
]}
]
}'
Expected Result:
{"id":"chatcmpl-d3270d4a16cb4b98936f71ee3016451f","object":"chat.completion","created":1764924127,"model":"internvl3_5","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is: **a tiger**","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":107,"total_tokens":123,"completion_tokens":16,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
7 Accuracy Evaluation#
7.1 Using AISBench#
Refer to Using AISBench for details.
After execution, you can get the result.
8 Performance#
8.1 Using AISBench#
Refer to Using AISBench for performance evaluation for details.
8.2 Using vLLM Benchmark#
Refer to vllm benchmark for more details.
9 FAQ#
Common Issues Tip: If you encounter issues, Refer to FAQs.