在 vLLM Ascend 中的补丁

在 vLLM Ascend 中的补丁#

vLLM Ascend is a platform plugin for vLLM. Due to the different release cycle of vLLM and vLLM Ascend and their hardware limitations, we need to patch some code in vLLM to make it compatible with vLLM Ascend.

In vLLM Ascend code, we provide a patch module vllm_ascend/patch to adapt to changes in vLLM.

原理#

We should keep in mind that Patch is not the best way to make vLLM Ascend compatible. It's just a temporary solution. The best way is to contribute the change to vLLM to make it compatible with vLLM Ascend initially. In vLLM Ascend, we have the basic principle for Patch strategy:

  1. 少即是多。请不要打补丁,除非这是目前唯一的方法。

  2. 一旦补丁被添加,必须说明将来移除该补丁的计划。

  3. Anytime, cleaning the patch code is welcome.

工作原理#

vllm_ascend/patch 目录中,你可以看到如下代码结构:

vllm_ascend
├── patch
│   ├── platform
│   │   ├── patch_xxx.py
│   ├── worker
│   │   ├── patch_yyy.py
└───────────
  • platform:此目录下的补丁代码用于修补 vLLM 主进程中的代码。当 vLLM 初始化时,会在很早的阶段由 vllm_ascend/platform::NPUPlatform::pre_register_and_update 调用。

    • For online mode, vLLM process calls the platform patch in vllm/vllm/engine/arg_utils.py::AsyncEngineArgs.add_cli_args when parsing the cli args.

    • For offline mode, vLLM process calls the platform patch in vllm/vllm/engine/arg_utils.py::EngineArgs.create_engine_config when parsing the input parameters.

  • worker:此目录中的补丁代码用于修补 vLLM worker 进程中的代码。在初始化 vLLM worker 进程时,会被 vllm_ascend/worker/worker_v1::NPUWorker::__init__ 调用。

    • For both online and offline mode, vLLM engine core process calls the worker patch in vllm/vllm/worker/worker_base.py::WorkerWrapperBase.init_worker when initializing the worker process.

如何撰写补丁#

在编写补丁之前,遵循上述原则,我们应尽量修改最少的代码。如果有必要,我们可以修改 platformworker 文件夹中的代码。下面是一个在 vLLM 中修改 distributed 模块的示例。

  1. Decide which version of vLLM we should patch. For example, after analysis, here we want to patch both 0.10.0 and main of vLLM.

  2. 决定我们应该修补哪个进程。例如,这里 distributed 属于 vLLM 主进程,所以我们应该修补 platform

  3. Create the patch file in the right folder. The file should be named as patch_{module_name}.py. The example here is vllm_ascend/patch/platform/patch_distributed.py.

  4. 在新文件中编写你的补丁代码。以下是一个示例:

    import vllm
    
    def patch_destroy_model_parallel():
        # your patch code
        ...
    
    vllm.distributed.parallel_state.destroy_model_parallel = patch_destroy_model_parallel
    
  5. Import the patch file in __init__.py. In this example, add import vllm_ascend.patch.platform.patch_distributed into vllm_ascend/patch/platform/__init__.py.

  6. vllm_ascend/patch/__init__.py 中添加补丁的描述。描述格式如下:

    # ** File: <The patch file name> **
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #   1. `<The target patch module in vLLM>`
    #    Why:
    #       <Describe the reason why we need to patch>
    #    How:
    #       <Describe the way to patch>
    #    Related PR (if no, explain why):
    #       <Add a link to the related PR in vLLM. If there is no related PR, explain why>
    #    Future Plan:
    #       <Describe the future plan to remove the patch>
    
  7. 添加单元测试和端到端(E2E)测试。在 vLLM Ascend 中新增的任何代码也应包含单元测试和端到端测试。更多详情请参见 测试指南

限制#

  1. In V1 Engine, vLLM starts three kinds of process: Main process, EngineCore process and Worker process. Now vLLM Ascend only can patch the code in Main process and Worker process by default. If you want to patch the code running in EngineCore process, you should patch EngineCore process entirely during setup. Find the entire code in vllm.v1.engine.core. Please override EngineCoreProc and DPEngineCoreProc entirely.

  2. If you are running edited vLLM code, the version of vLLM may be changed automatically. For example, if you run the edited vLLM based on v0.9.n, the version of vLLM may be changed to v0.9.nxxx. In this case, the patch for v0.9.n in vLLM Ascend would not work as expected, because vLLM Ascend can't distinguish the version of the vLLM you're using. In this case, you can set the environment variable VLLM_VERSION to specify the version of the vLLM you're using, and then the patch for v0.10.0 should work.