Info
根据 SGL 仓库的
qwen2.py解析
QWen2 特性
- 参数
- Hidden Size: 3584
- Attention Heads: 28
- KV Heads: 4
- GQA 机制
- RoPE
QWen2 Attention 机制
代码
class Qwen2Attention(nn.Module): def __init__( self, hidden_size: int, num_heads: int, num_kv_heads: int, head_dim: Optional[int] = None, layer_id: int = 0, rope_theta: float = 1000000, rope_scaling: Optional[Dict[str, Any]] = None, max_position_embeddings: int = 32768, quant_config: Optional[QuantizationConfig] = None, prefix: str = "", ) -> None: super().__init__() self.hidden_size = hidden_size tp_size = get_tensor_model_parallel_world_size() self.total_num_heads = num_heads assert self.total_num_heads % tp_size == 0 self.num_heads = self.total_num_heads // tp_size self.total_num_kv_heads = num_kv_heads if self.total_num_kv_heads >= tp_size: # Number of KV heads is greater than TP size, so we partition # the KV heads across multiple tensor parallel GPUs. assert self.total_num_kv_heads % tp_size == 0 else: # Number of KV heads is less than TP size, so we replicate # the KV heads across multiple tensor parallel GPUs. assert tp_size % self.total_num_kv_heads == 0 self.num_kv_heads = max(1, self.total_num_kv_heads // tp_size) if head_dim is not None: self.head_dim = head_dim else: self.head_dim = hidden_size // self.total_num_heads self.q_size = self.num_heads * self.head_dim self.kv_size = self.num_kv_heads * self.head_dim self.scaling = self.head_dim**-0.5 self.rope_theta = rope_theta self.max_position_embeddings = max_position_embeddings self.qkv_proj = QKVParallelLinear( hidden_size, self.head_dim, self.total_num_heads, self.total_num_kv_heads, bias=True, quant_config=quant_config, prefix=add_prefix("qkv_proj", prefix), ) self.o_proj = RowParallelLinear( self.total_num_heads * self.head_dim, hidden_size, bias=False, quant_config=quant_config, prefix=add_prefix("o_proj", prefix), ) self.rotary_emb = get_rope( self.head_dim, rotary_dim=self.head_dim, max_position=max_position_embeddings, base=rope_theta, rope_scaling=rope_scaling, ) self.attn = RadixAttention( self.num_heads, self.head_dim, self.scaling, num_kv_heads=self.num_kv_heads, layer_id=layer_id, quant_config=quant_config, prefix=add_prefix("attn", prefix), ) def forward( self, positions: torch.Tensor, hidden_states: torch.Tensor, forward_batch: ForwardBatch, ) -> torch.Tensor: qkv, _ = self.qkv_proj(hidden_states) q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1) q, k = self.rotary_emb(positions, q, k) attn_output = self.attn(q, k, v, forward_batch) output, _ = self.o_proj(attn_output) return output
- Attention 过程
- QKV 投影(也就是乘以)
- 输入的
hidden_states维度为[batch_size * seq_len, hidden_size],通过QKVParallelLinear投影得到 QKV 向量。 hidden_states的大小通常是[batch * seq_len, hidden_size]qkv向量大小通常是[batch * seq_len, Q + K + V]- 此处
- 输入的
- QKV 向量拆分
q_size,kv_size * 2
- 旋转位置编码 RoPE
self.rotary_emb使用 vLLM 的get_rope,只对 Q 和 K 向量应用 RoPE,V 向量保持不变。get_rope的实现在rotary_embedding.py- 在初始化的时候会初始化
cos和sincache,避免每次前向传播时重复计算三角函数 - 接下来就会应用旋转(1386-1390)
- 注意力计算与 KV Cache 使用
- 初始化使用了 GQA 分组
- 当
total_num_kv_heads >= tp_size时,KV 头在多个 GPU 间分区 - 当
total_num_kv_heads < tp_size时,KV 头在多个 GPU 间复制 num_kv_heads = max(1, total_num_kv_heads // tp_size)确保每个 GPU 至少有一个 KV 头
- 当
self.attn使用 SGL 自带的 RadixAttention,并且区分- Prefill
- Decode
- 进入
FlashInferAttnBackend,首先获取当前层的decode_wrapperforward_metadata存储了每层的元数据信息_get_wrapper_idx(layer)根据层号获取对应的 wrapper 索引
- 确定 cache 位置
- 如果是自注意力,使用
out_cache_loc - 如果是交叉注意力(比如 encoder-decoder 结构),使用
encoder_out_cache_loc
- 如果是自注意力,使用
- 存储当前 token 的 K、V 到缓存(
set_kv_buffer) - 使用 decode 包装器从完整的 KV 缓存中计算 attention
- 将 Q 从
[1, hidden_size]或[1, num_heads*head_dim]reshape 成[1, tp_q_head_num, head_dim](tensor parallel 后的 Q head num)
- 将 Q 从
- 进入
- 初始化使用了 GQA 分组
- 输出投影
- ……
- QKV 投影(也就是乘以)