Mixtral-8x7B-Instruct-v0.1#

Introduction#

Mixtral-8x7B-Instruct-v0.1 is a state-of-the-art mixture-of-experts (MoE) language model developed by Mistral AI. It features 8 expert models, each with 7B parameters, and is specifically fine-tuned for instruction following tasks.

Key features of Mixtral-8x7B-Instruct-v0.1 include:

  • 8x7B parameters with sparse activation (only 2 experts activated per token)

  • Strong performance across various NLP tasks

  • Support for extended context length

  • High-quality instruction following capabilities

This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node deployment, accuracy and performance evaluation.

The Mixtral-8x7B-Instruct-v0.1 model is supported in vllm-ascend.

Environment Preparation#

Model Weight#

  • Mixtral-8x7B-Instruct-v0.1(BF16 version): Download model weight

  • Quantized versions may be available from third-party providers.

It is recommended to download the model weight to a local directory, such as /data/models/.

Installation#

You can use our official docker image to run Mixtral-8x7B-Instruct-v0.1 directly.

Select an image based on your machine type and start the docker image on your node, refer to using docker.

# Update --device according to your device (Atlas A2: /dev/davinci[0-7] Atlas A3:/dev/davinci[0-15]).
# Update the vllm-ascend image according to your environment.
# Note you should download the weight to /root/.cache in advance.
# Update the vllm-ascend image
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:v0.20.2rc1
export NAME=vllm-ascend

# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance.
docker run --rm \
    --name $NAME \
    --net=host \
    --shm-size=1g \
    --device /dev/davinci0 \
    --device /dev/davinci1 \
    --device /dev/davinci2 \
    --device /dev/davinci3 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /root/.cache:/root/.cache \
    -it $IMAGE bash

Deployment#

Single-node Deployment#

  • Mixtral-8x7B-Instruct-v0.1 can be deployed on 1 Atlas 800 A3 (64G × 16) or 1 Atlas 800 A2 (64G × 8).

Run the following script to execute online inference.

export HCCL_OP_EXPANSION_MODE="AIV"
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export VLLM_ASCEND_ENABLE_MLAPO=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
vllm serve "mistralai/Mixtral-8x7B-Instruct-v0.1" \
  --tensor-parallel-size 4 \
  --max-model-len 4096 \
  --dtype bfloat16 \
  --trust-remote-code \
  --enforce-eager \
  --block-size 128 \
  --gpu-memory-utilization 0.7

Notice: The parameters are explained as follows:

  • Setting the environment variable VLLM_ASCEND_BALANCE_SCHEDULING=1 enables balance scheduling. This may help increase output throughput and reduce TPOT in v1 scheduler. However, TTFT may degrade in some scenarios.

  • --max-model-len specifies the maximum context length - that is, the sum of input and output tokens for a single request. For testing purposes, a value of 4096 is used here.

  • --dtype float16 specifies the data type for model weights and computations.

  • --trust-remote-code allows loading models with custom code.

  • --enforce-eager forces the use of eager execution mode instead of graph compilation, which can be more stable for some models.

  • --block-size specifies the block size for KV cache management, with a value of 128 used here.

  • --gpu-memory-utilization sets the proportion of NPU memory to use for the model, with a value of 0.7 used here to reduce memory usage.

Functional Verification#

Once your server is started, you can query the model with input prompts. Mixtral-8x7B-Instruct-v0.1 uses a specific prompt format with [INST] and [/INST] tags:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
        "messages": [
            {"role": "user", "content": "你好,介绍一下你自己"}
        ],
        "max_tokens": 100,
        "temperature": 0.7
    }'

For instruction following tasks, you can use prompts like:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
        "messages": [
            {"role": "user", "content": "扮演一位资深架构师,评价一下在昇腾 Atlas A2 上部署 vLLM 的优势。"}
        ],
        "max_tokens": 100,
        "temperature": 0.7
    }'

For MoE-related questions:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
        "messages": [
            {"role": "user", "content": "简单解释一下为什么 Mixtral 模型被称为\"混合专家模型\"(MoE)?"}
        ],
        "max_tokens": 100,
        "temperature": 0.7
    }'

Using AISBench#

  1. Refer to Using AISBench for details.

  2. After execution, you can get the result. For reference, Mixtral-8x7B-Instruct-v0.1 typically performs well on various benchmarks including reasoning, comprehension, and instruction following tasks.

Performance#

Using AISBench#

Refer to Using AISBench for performance evaluation for details.

Using vLLM Benchmark#

Run performance evaluation of Mixtral-8x7B-Instruct-v0.1 as an example.

Refer to vllm benchmark for more details.

There are three vllm bench subcommands:

  • latency: Benchmark the latency of a single batch of requests.

  • serve: Benchmark the online serving throughput.

  • throughput: Benchmark offline inference throughput.

Take the serve as an example. First, start the server:

python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 4 \
    --max-model-len 512 \
    --dtype float16 \
    --trust-remote-code \
    --enforce-eager \
    --block-size 128 \
    --gpu-memory-utilization 0.7

Conclusion#

Mixtral-8x7B-Instruct-v0.1 is a powerful MoE model that offers excellent performance for instruction following tasks. With proper deployment on Ascend hardware using vllm-ascend, you can achieve high throughput and low latency for your AI applications.

For more details about model capabilities and best practices, refer to the official Mixtral documentation and vllm-ascend user guide.