CS Memos

❯

❯

❯

DeepSeek 架构

DeepSeek 架构

2025年12月17日1分钟阅读

特性
- Decoder Only，适合输出
- MLA（Multi-head Latent Attention）注意力机制而非 MHA，从而减少 KV Cache 存储
- Deepseek MoE
  - Experts: 1 Shared + 256 Routed
    - 8 experts per token
- Hidden Layer: 61 = 3 Dense + 58 MoE

最近的笔记

NVIDIA's Collective Communication Library
2025年12月17日
NWChem
2025年12月17日
HPC/并行计算
2025年12月17日

关系图谱

Created with Quartz v4.4.0 © 2025

GitHub
About ZAMBAR