• 特性
    • Decoder Only,适合输出
    • MLA(Multi-head Latent Attention) 注意力机制而非 MHA,从而减少 KV Cache 存储
    • Deepseek MoE
      • Experts: 1 Shared + 256 Routed
        • 8 experts per token
    • Hidden Layer: 61 = 3 Dense + 58 MoE