AI QoS特性#
背景#
In the inference scenario, there are different types of traffic, such as operator delivery, collective communication, and KVCache. Such traffics are transmitted through network and affect each other, increasing the inference latency.
For example, in the Agentic AI era, as the context length continues to increase, the size of the KVCache also gradually grows. To conserve HBM usage, the approach of offloading KVCache to DDR is adopted to enhance inference TPS. At the same time, to maximize the utilization of computing power, a pipeline orchestration method using computation to mask KVCache is commonly employed. This method involves prefetching the next layer's KVCache during the current layer's computation/communication to reduce overall latency. However, this approach introduces a traffic conflict issue between the KVCache and the operator delivery/collective communication, leading to increased inference latency and impacting the SLO.

As shown in the preceding figure, traffic conflicts occur on the UB switch when intra-node device-to-device (D2D) traffic, intra-node host-to-device (H2D) traffic, and inter-node D2D traffic are transmitted.
简介#
When different types of traffic conflict with each other, the Virtual Lane (VL) can be used to isolate the traffic at the UB switch and perform differentiated scheduling between the VLs. This helps to: (1) isolate the VLs of different types of traffic to prevent congestion from spreading; (2) perform differentiated scheduling for different types of traffic.
As shown in the following figure, different types of traffic are mapped to different VLs to isolate the traffic. In addition, the priority of each VL is set and the strict priority (SP) scheduling mode is used. When different types of traffic reach the UB switch at the same time, the traffic in the VL with the high priority is scheduled first, and then the traffic in the VL with the middle priority is scheduled. This process repeats until all the traffic is scheduled. In this way, differentiated scheduling is implemented for different types of traffic.

Different traffic is transmitted through different channels. Therefore, the AI QoS solution implements isolation and differentiated scheduling of different traffic to meet service requirements by (1) setting priorities for different NPU channels on the host, (2) establishing the mapping between the NPU channel priority and the VL of the UB switch, and (3) performing differentiated scheduling among different VLs of the UB switch based on the priority.
构建AI QoS模块#
在使用tools/ai_qos.py之前,请先构建并安装AI QoS扩展。DSMI包含/库路径取决于环境。请先定位您机器上的路径,然后替换命令中的YOUR_DSMI_INCLUDE_DIR和YOUR_DSMI_LIBRARY_FILE(例如/usr/local/Ascend/driver/include和/usr/local/Ascend/driver/lib64/driver/libdrvdsmi_host.so)。
在大多数部署中,这些命令在容器内执行。创建容器时,请确保DSMI头文件/库目录已挂载到容器文件系统中;否则CMake无法找到这些文件。
从vLLM-Ascend仓库根目录运行以下命令:
cmake -S tools/ai_qos -B tools/ai_qos/build \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_PREFIX=${PWD}/vllm_ascend \
-DDSMI_INCLUDE_DIR=YOUR_DSMI_INCLUDE_DIR \
-DDSMI_LIBRARY=YOUR_DSMI_LIBRARY_FILE
cmake --build tools/ai_qos/build -j
cmake --install tools/ai_qos/build
使用说明#
The AI QoS feature supports two modes: Auto and Manual. Enter the vLLM-Ascend installation directory and run the following command before running the inference job:
### 1) Auto mode
python tools/ai_qos.py
AI QoS auto mode automatically classifies the priorities of different types of traffic and generates QoS tags. It also prints the UB switch configuration. You can copy the outputs and log in to the UB switch to configure the QoS configurations of UB switch. This configuration will overwrite the current QoS configuration on the UB switch. If there is any existing QoS configuration, please back it up in advance.
### 2) Manual mode
python tools/ai_qos.py --mode manual --AIV_D2D {priority} --AIV_H2D {priority} --SDMA_D2D {priority} --SDMA_H2D {priority} --PCIEDMA_H2D {priority}
AI QoS manual mode calculates the QoS tag of traffic based on the priority of different types of traffic set by users, and generates and prints the UB switch configuration.You can copy the outputs and log in to the UB switch to configure the QoS configurations of UB switch. This configuration will overwrite the current QoS configuration on the UB switch. If there is any existing QoS configuration, please back it up in advance.
In manual mode, you can specify the priority of only one type of traffic. The parameters are described as follows:
名称 |
类型 |
默认值 |
描述 |
|---|---|---|---|
mode |
str |
auto |
AI QoS的模式,默认模式为"auto",另一种模式为"manual",如果选择"manual"模式则需要配置一些参数。 |
qos_manual_config |
/ |
AIV_D2D: high, |
"manual"模式的参数,决定不同类型流量的QoS优先级。 |
How to disable AI QoS:
python tools/ai_qos.py unset
The command for disabling the AI QoS feature on the UB Switch will be printed on the screen. Please log in to the UB Switch and execute the command printed on the screen to complete the feature disabling.
使用约束#
Due to underlying driver limitations, the QoS configurations for AIV_H2D and AIV_D2D do not take effect currently. Once the required adaptation capabilities are added in a future driver release, this feature will be delivered through a module upgrade.
The AI QoS feature supports the Atlas 800T A3 server and Atlas 900 A3 SuperPoD cluster. It must be used in privileged containers and requires the following software versions:
软件 |
匹配版本 |
|---|---|
Ascend HDK |
25.5.2及以上版本 |
UB交换机 |
LingQu Computing Network 1.5.1及以上版本 |