vllm_gaudi.extension.unified
¶
CacheUtils
¶
Helper utilities for kv-cache
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
is_mla
|
If True, cache stores MLA latent vectors (no head dimension, single cache). If False, standard attention with per-head K/V caches. |
False
|
Source code in vllm_gaudi/extension/unified.py
__init__
¶
Source code in vllm_gaudi/extension/unified.py
_fetch_all
¶
Fetch both key and values using selected function
Source code in vllm_gaudi/extension/unified.py
_fetch_single_shared
¶
Fetch selected shared blocks from given cache
Source code in vllm_gaudi/extension/unified.py
_fetch_single_unique
¶
Fetch selected unique blocks from given cache
Source code in vllm_gaudi/extension/unified.py
fetch_shared
¶
HPUUnifiedAttentionMetadata
dataclass
¶
Source code in vllm_gaudi/extension/unified.py
__init__
¶
__init__(
block_size: int,
slot_mapping: tensor,
causal_bias: Optional[tensor],
causal_width: int,
shared_blocks: Optional[tensor],
shared_bias: Optional[tensor],
shared_bias_chunked: Optional[
SharedBlockChunkedBiasData
],
shared_chunk_size: int,
unique_blocks: Optional[tensor] | Optional[int],
unique_block_mapping: Optional[tensor],
unique_bias: Optional[tensor],
fmin: tensor,
feps: tensor,
inputL_hpu_tensors: Optional[Dict[tuple, Tensor]],
inputM_hpu_tensors: Optional[Dict[tuple, Tensor]],
online_merge: bool,
split_graphs: bool,
) -> None
num_blocks
¶
Source code in vllm_gaudi/extension/unified.py
SharedBlockChunkedBiasData
dataclass
¶
Data needed to compute shared block bias per-chunk during chunked attention.
This avoids materializing the full [query_len, num_shared_blocks, block_size] bias tensor which can be prohibitively large with many shared blocks.
Contains dense block_usages of shape (num_query_tokens, num_shared_blocks). During chunked attention, we slice block_usages[:, chunk_start:chunk_end] and generate bias for each chunk on-the-fly.
Source code in vllm_gaudi/extension/unified.py
_partial_attn_shared_chunked
¶
_partial_attn_shared_chunked(
query: tensor,
blocks: tensor,
bias: Optional[tensor],
chunked_data: SharedBlockChunkedBiasData,
chunk_size: int,
fmin: tensor,
inputL_hpu_tensors: Dict[tuple, Tensor],
inputM_hpu_tensors: Dict[tuple, Tensor],
cache_utils: CacheUtils,
dtype: dtype,
w_uv: Optional[tensor] = None,
) -> tuple[tensor, tensor, tensor]
Chunked implementation of partial_attn_shared with per-chunk bias generation.
Generates bias per chunk from dense block_usages to save memory. Avoids materializing the full (query_len, num_blocks, block_size) bias tensor.
Strategy: 1. Process blocks in chunks of chunk_size 2. For each chunk, slice block_usages and generate chunk bias on-the-fly 3. Compute attention for the chunk using _partial_attn_shared_core 4. Merge chunk results using flash-attention style online softmax
Source code in vllm_gaudi/extension/unified.py
498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 | |
_partial_attn_shared_core
¶
_partial_attn_shared_core(
query: tensor,
key: tensor,
value: tensor,
bias: tensor,
fmin: tensor,
inputL_hpu_tensors: Dict[tuple, Tensor],
inputM_hpu_tensors: Dict[tuple, Tensor],
kv_heads: int,
is_mla: bool,
w_uv: Optional[tensor] = None,
) -> tuple[tensor, tensor, tensor]
Core shared attention computation.
This is the inner loop extracted for reuse between full and chunked paths.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
tensor
|
Query tensor, already transposed [kv_heads, q_heads_per_kv, tokens, head_dim] or similar |
required |
key
|
tensor
|
Key tensor from cache [kv_heads, q_heads_per_kv, kv_len, head_dim] |
required |
value
|
tensor
|
Value tensor from cache |
required |
bias
|
tensor
|
Attention bias [1, kv_len] (already flattened from [num_blocks, block_size]) |
required |
fmin
|
tensor
|
Minimum float for softmax stability |
required |
inputL_hpu_tensors
|
Dict[tuple, Tensor]
|
Cache for FA2 tensors |
required |
inputM_hpu_tensors
|
Dict[tuple, Tensor]
|
Cache for FA2 tensors |
required |
kv_heads
|
int
|
Number of KV heads |
required |
is_mla
|
bool
|
Whether using MLA attention |
required |
w_uv
|
Optional[tensor]
|
Optional MLA projection matrix |
None
|
Returns:
| Type | Description |
|---|---|
tuple[tensor, tensor, tensor]
|
Tuple of (unnormalized_weighted_V, local_max, local_sum) |
Source code in vllm_gaudi/extension/unified.py
_partial_attn_shared_full
¶
_partial_attn_shared_full(
query: tensor,
blocks: tensor,
bias: tensor,
fmin: tensor,
inputL_hpu_tensors: Dict[tuple, Tensor],
inputM_hpu_tensors: Dict[tuple, Tensor],
cache_utils: CacheUtils,
w_uv: Optional[tensor] = None,
) -> tuple[tensor, tensor, tensor]
Full bias implementation of partial_attn_shared.
Source code in vllm_gaudi/extension/unified.py
block2batch
¶
convert_cl_aligned_tensor
¶
convert_cl_aligned_tensor(
input_hpu, reference_size
) -> tensor
Convert a CL-aligned tensor to the reference size
Source code in vllm_gaudi/extension/unified.py
create_softmax_fa2_input_tensors
¶
create_softmax_fa2_input_tensors(
attn: tensor,
fmin: tensor,
inputL_hpu_tensors: Dict[tuple, Tensor],
inputM_hpu_tensors: Dict[tuple, Tensor],
) -> tuple[tensor, tensor]
Create dummy input tensors for the softmax_fa2 operation.
Source code in vllm_gaudi/extension/unified.py
get_last_dim_size
¶
get_vecsize_packsize
¶
Get vecsize and packsize for given dtype
merge
¶
Merge partial attention values into final attn score
Source code in vllm_gaudi/extension/unified.py
online_merge
¶
Merge partial attention values using online (incremental) algorithm.
Alternative to merge() that uses online_merge_step for incremental merging. This approach is more memory efficient as it doesn't need to keep all intermediate results simultaneously.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
attn_results
|
tuple[tensor, tensor, tensor]
|
Variable number of (attn, max, sum) tuples |
()
|
feps
|
tensor
|
Small epsilon for numerical stability |
required |
Returns:
| Type | Description |
|---|---|
Optional[tensor]
|
Final normalized attention output, or None if all inputs are None |
Source code in vllm_gaudi/extension/unified.py
online_merge_step
¶
online_merge_step(
acc_attn: Optional[tensor],
acc_max: Optional[tensor],
acc_sum: Optional[tensor],
new_attn: Optional[tensor],
new_max: Optional[tensor],
new_sum: Optional[tensor],
) -> tuple[
Optional[tensor], Optional[tensor], Optional[tensor]
]
Incrementally merge attention results using flash-attention style rescaling.
This implements the online softmax algorithm where we maintain running unnormalized weighted values, max, and sum. The final normalization (dividing by sum) is done at the end.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
acc_attn
|
Optional[tensor]
|
Accumulated unnormalized weighted V [tokens, heads, head_dim] or None |
required |
acc_max
|
Optional[tensor]
|
Accumulated max values [tokens, heads] or None |
required |
acc_sum
|
Optional[tensor]
|
Accumulated sum of exp values [tokens, heads] or None |
required |
new_attn
|
Optional[tensor]
|
New unnormalized weighted V to merge |
required |
new_max
|
Optional[tensor]
|
New max values to merge |
required |
new_sum
|
Optional[tensor]
|
New sum of exp values to merge |
required |
Returns:
| Type | Description |
|---|---|
tuple[Optional[tensor], Optional[tensor], Optional[tensor]]
|
Tuple of (merged_attn, merged_max, merged_sum) |
Source code in vllm_gaudi/extension/unified.py
optional
¶
Wrap an operation to support handling None values
Source code in vllm_gaudi/extension/unified.py
partial_attn_causal
¶
partial_attn_causal(
query: tensor,
key: tensor,
value: tensor,
bias: Optional[tensor],
slice_size: int,
fmin: tensor,
inputL_hpu_tensors: Dict[tuple, Tensor],
inputM_hpu_tensors: Dict[tuple, Tensor],
w_uv: Optional[tensor] = None,
) -> tuple[tensor, tensor, tensor]
Partial attention where qkv are assumed to be causal between slices
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
w_uv
|
Optional[tensor]
|
Optional MLA projection matrix [num_heads, latent_dim, v_head_dim]. If provided, value is assumed to be in latent space and will be projected. |
None
|
Source code in vllm_gaudi/extension/unified.py
partial_attn_shared
¶
partial_attn_shared(
query: tensor,
blocks: tensor,
bias: Optional[tensor],
fmin: tensor,
inputL_hpu_tensors: Dict[tuple, Tensor],
inputM_hpu_tensors: Dict[tuple, Tensor],
cache_utils: CacheUtils,
dtype: dtype,
w_uv: Optional[tensor] = None,
chunked_data: Optional[
SharedBlockChunkedBiasData
] = None,
chunk_size: int = 0,
) -> tuple[tensor, tensor, tensor]
Partial attention where all shared blocks are compared with whole query.
Supports two modes: 1. Full bias mode (default): bias tensor is provided, process all blocks at once 2. Chunked mode: chunk_size > 0, process blocks in chunks - If bias is provided, slice from it - If bias is None but chunked_data is provided, generate bias per chunk from dense block_usages
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
tensor
|
Query tensor [tokens, num_heads, head_dim] |
required |
blocks
|
tensor
|
Shared block indices [num_shared_blocks] |
required |
bias
|
Optional[tensor]
|
Pre-computed bias tensor [query_len, num_blocks, block_size]. Can be None for chunked generation. |
required |
fmin
|
tensor
|
Minimum float value for softmax stability |
required |
inputL_hpu_tensors
|
Dict[tuple, Tensor]
|
Cache for softmax input tensors |
required |
inputM_hpu_tensors
|
Dict[tuple, Tensor]
|
Cache for softmax input tensors |
required |
cache_utils
|
CacheUtils
|
Cache utilities for fetching KV |
required |
dtype
|
dtype
|
Output dtype for bias generation |
required |
w_uv
|
Optional[tensor]
|
Optional MLA projection matrix [num_heads, latent_dim, v_head_dim] |
None
|
chunked_data
|
Optional[SharedBlockChunkedBiasData]
|
Metadata for chunked processing (contains dense block_usages) |
None
|
chunk_size
|
int
|
Number of blocks per chunk (0 = full mode, >0 = chunked mode) |
0
|
Returns:
| Type | Description |
|---|---|
tuple[tensor, tensor, tensor]
|
Tuple of (unnormalized_weighted_V, local_max, local_sum) |
Source code in vllm_gaudi/extension/unified.py
partial_attn_unique
¶
partial_attn_unique(
query: tensor,
blocks: tensor,
block_mapping: tensor,
bias: Optional[tensor],
fmin: tensor,
cache_utils: CacheUtils,
w_uv: Optional[tensor] = None,
) -> tuple[tensor, tensor, tensor]
Partial attention where all blocks are used by max one query
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
w_uv
|
Optional[tensor]
|
Optional MLA projection matrix [num_heads, latent_dim, v_head_dim]. If provided, assumes MLA mode where query/key/value are in latent space. |
None
|
Source code in vllm_gaudi/extension/unified.py
reduce_max
¶
Reduce local block minima to per-group minimum
Source code in vllm_gaudi/extension/unified.py
unified_attn
¶
unified_attn(
query: tensor,
key: tensor,
value: tensor,
key_cache: tensor,
value_cache: tensor,
scale: float,
metadata: HPUUnifiedAttentionMetadata,
) -> tensor
Main entry point for unified attention
Source code in vllm_gaudi/extension/unified.py
727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 | |
unified_mla
¶
unified_mla(
query: Optional[tensor],
key: Optional[tensor],
value: Optional[tensor],
latent_cache: tensor,
scale: float,
metadata: HPUUnifiedAttentionMetadata,
w_uv: tensor,
query_latent: Optional[tensor] = None,
) -> tensor
Main entry point for Unified MLA
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
Optional[tensor]
|
Query tensor for causal path (already uncompressed) [tokens, num_heads, qk_head_dim] None if only cached attention is needed. |
required |
key
|
Optional[tensor]
|
Key tensor for causal part [tokens, num_heads, qk_head_dim]. None for cached-only. |
required |
value
|
Optional[tensor]
|
Value tensor for causal part in latent space [tokens, num_heads, latent_dim]. None for cached-only. |
required |
latent_cache
|
tensor
|
Cached latent KV [num_blocks * block_size, latent_dim + rope_dim] |
required |
scale
|
float
|
Attention scale factor |
required |
metadata
|
HPUUnifiedAttentionMetadata
|
Unified attention metadata |
required |
w_uv
|
tensor
|
Projection matrix from latent to full V [num_heads, latent_dim, v_head_dim] |
required |
query_latent
|
Optional[tensor]
|
Query tensor for cached path (in latent space) [tokens, num_heads, latent_dim + rope_dim] None if only causal attention is needed. |
None
|
use_online_merge
|
If True, use online (incremental) merge algorithm. Merges after each partial attention to avoid large intermediate buffers. If False, use offline (single-pass) merge algorithm. |
required |
Returns:
| Type | Description |
|---|---|
tensor
|
Attention output [tokens, num_heads * v_head_dim] |
Note
- For causal-only: pass query/key/value, set query_latent=None
- For cached-only: pass query_latent, set query/key/value=None
- For mixed batches: pass both query and query_latent
Source code in vllm_gaudi/extension/unified.py
805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 | |