q_lora_rank: 1536
kv_lora_rank:512
qk_nope_head_dim:128v_head_dim:128
qk_rope_head_dim:64num_attention_heads:128num_key_value_heads:128
n_routed_experts: 8
n_shared_experts: 1
可以通过如下代码打印权重:
from safetensors import safe_open
# path = "model-00001-of-000163.safetensors"
path = "model-00100-of-000163.safetensors"
with safe_open(path, framework="pt", device="cpu") as f:
for k in f.keys():
tensor = f.get_tensor(k)
print(f"{k}: shape={tensor.shape}, dtype={tensor.dtype}")需要注意下表的权重行列是反过来的:
input_layernorm.weight torch.Size([7168]) bfloat16
# out [ mlp_intermidiate_size, hidden_size ]
mlp.down_proj.weight torch.Size([7168, 18432]) fp8
mlp.down_proj.weight_scale_inv torch.Size([56, 144])
# in1, in2 [ hidden_size, mlp_intermidiate_size ]
mlp.gate_proj.weight torch.Size([18432, 7168])
mlp.gate_proj.weight_scale_inv torch.Size([144, 56])
mlp.up_proj.weight torch.Size([18432, 7168])
mlp.up_proj.weight_scale_inv torch.Size([144, 56])
post_attention_layernorm.weight torch.Size([7168])
self_attn.kv_a_layernorm.weight torch.Size([512])
# [ hidden_size, kv_lora_rank + qk_rope_head_dim ]
self_attn.kv_a_proj_with_mqa.weight torch.Size([576, 7168])
self_attn.kv_a_proj_with_mqa.weight_scale_inv torch.Size([5, 56])
# [ kv_lora_rank, 2 * num_key_value_heads * (kv)]
self_attn.kv_b_proj.weight torch.Size([32768, 512])
self_attn.kv_b_proj.weight_scale_inv torch.Size([256, 4])
self_attn.o_proj.weight torch.Size([7168, 16384])
self_attn.o_proj.weight_scale_inv torch.Size([56, 128])
self_attn.q_a_layernorm.weight torch.Size([1536])
# [ hidden_size, q_lora_rank ]
self_attn.q_a_proj.weight torch.Size([1536, 7168])
self_attn.q_a_proj.weight_scale_inv torch.Size([12, 56])
# [ q_lora_rank, num_attention_heads * (qk_nope_head_dim + qk_rope_head_dim) ]
self_attn.q_b_proj.weight torch.Size([24576, 1536])
self_attn.q_b_proj.weight_scale_inv torch.Size([192, 12])专家权重如下:
layers.3-60.mlp.(experts.0-255|shared_experts).down_proj.weight: shape=torch.Size([7168, 2048]), dtype=torch.float8_e4m3fn
layers.3-60.mlp.(experts.0-255|shared_experts).down_proj.weight_scale_inv: shape=torch.Size([56, 16]), dtype=torch.float32
layers.3-60.mlp.(experts.0-255|shared_experts).gate_proj.weight: shape=torch.Size([2048, 7168]), dtype=torch.float8_e4m3fn
layers.3-60.mlp.(experts.0-255|shared_experts).gate_proj.weight_scale_inv: shape=torch.Size([16, 56]), dtype=torch.float32
layers.3-60.mlp.(experts.0-255|shared_experts).up_proj.weight: shape=torch.Size([2048, 7168]), dtype=torch.float8_e4m3fn
layers.3-60.mlp.(experts.0-255|shared_experts).up_proj.weight_scale_inv: shape=torch.Size([16, 56]), dtype=torch.float32前 3 层为 dense 层,后 58 层为稀疏 moe 层。
对于 dense 层
dense 层固定选择 8 个专家
权重大小为 7168 * 2048 * 3 * 9 * 3 = 1,189,085,184
对于稀疏 moe
256 专家,hidden size 为 7168,每个 token 从 256 专家中选择 8 个
每个专家为 3 路 MLP,7168 → 2048 → 7168
所以后面 58 层 MOE 所占的权重大小为:(2048 * 7168 * 3 * 257 + 7168 * 256 + 256) * 58 = 656,569,547,264
总共的专家大小为 656569547264 + 1189085184 = 657,758,632,448
差不多是 658 B