vllm_gaudi.extension.bucketing.padding_aware
¶
PaddingAwareBucketingStrategy
¶
Source code in vllm_gaudi/extension/bucketing/padding_aware.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 | |
get_decode_cfgs
¶
Source code in vllm_gaudi/extension/bucketing/padding_aware.py
get_prompt_cfgs
¶
Source code in vllm_gaudi/extension/bucketing/padding_aware.py
read_bucket_settings
¶
Read bucketing configuration from env variables.
phase is either 'prompt' or 'decode' dim is either 'bs', 'query' or 'block' param is either 'min', 'step', 'max', 'pad_max' or 'pad_percent' example env variable: VLLM_DECODE_BS_BUCKET_STEP=128
Source code in vllm_gaudi/extension/bucketing/padding_aware.py
warmup_range_with_limits
¶
Generate a warmup range with absolute and relative padding limits.
- Starts from
bucket_minand multiply by 2 (or +1 for 0) till tobucket_step. - Add
bucket_stepto the values till tobucket_maxand choose current bucket if: a. the next bucket exceeds the absolute padding limitpad_max, b. or the next bucket exceeds the padding ratio limitpad_percent, c. or the current bucket is a multiple ofpad_max. - Always include
bucket_maxas the last bucket.
Example: 1. for config = (0, 8, 64, 64, 0), fallback to linear bucketing without padding limits: ramp_up = [0, 1, 2, 4, 8] stable = [16, 24, 32, 40, 48, 56, 64] return [0, 1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64] 2. for config = (0, 8, 64, 64, 50), fallback to exponential bucketing: ramp_up = [0, 1, 2, 4, 8] stable = [16, 32, 64] # [24, 40, 48, 56] are skipped due to padding ratio limit return [0, 1, 2, 4, 8, 16, 32, 64] 3. for config = (0, 8, 64, 16, 50) ramp_up = [0, 1, 2, 4, 8] stable = [16, 32, 48, 64] # [24, 40, 56] are skipped due to absolute padding limit return [0, 1, 2, 4, 8, 16, 32, 48, 64] 4. for config = (16, 16, 128, 32, 25) stable = [16, 32, 48, 64, 80, 96, 128] # no ramp up phase return [16, 32, 48, 64, 80, 96, 128]