q_lora_rank
: 1536
kv_lora_rank
:512
qk_nope_head_dim
:128v_head_dim
:128
qk_rope_head_dim
:64num_attention_heads
:128num_key_value_heads
:128
n_routed_experts
: 8
n_shared_experts
: 1
可以通过如下代码打印权重:
from safetensors import safe_open
# path = "model-00001-of-000163.safetensors"
= "model-00100-of-000163.safetensors"
path
with safe_open(path, framework="pt", device="cpu") as f:
for k in f.keys():
= f.get_tensor(k)
tensor print(f"{k}: shape={tensor.shape}, dtype={tensor.dtype}")
需要注意下表的权重行列是反过来的:
7168]) bfloat16
input_layernorm.weight torch.Size([# out [ mlp_intermidiate_size, hidden_size ]
7168, 18432]) fp8
mlp.down_proj.weight torch.Size([56, 144])
mlp.down_proj.weight_scale_inv torch.Size([
# in1, in2 [ hidden_size, mlp_intermidiate_size ]
18432, 7168])
mlp.gate_proj.weight torch.Size([144, 56])
mlp.gate_proj.weight_scale_inv torch.Size([18432, 7168])
mlp.up_proj.weight torch.Size([144, 56])
mlp.up_proj.weight_scale_inv torch.Size([7168])
post_attention_layernorm.weight torch.Size([512])
self_attn.kv_a_layernorm.weight torch.Size([
# [ hidden_size, kv_lora_rank + qk_rope_head_dim ]
576, 7168])
self_attn.kv_a_proj_with_mqa.weight torch.Size([5, 56])
self_attn.kv_a_proj_with_mqa.weight_scale_inv torch.Size([# [ kv_lora_rank, 2 * num_key_value_heads * (kv)]
32768, 512])
self_attn.kv_b_proj.weight torch.Size([256, 4])
self_attn.kv_b_proj.weight_scale_inv torch.Size([7168, 16384])
self_attn.o_proj.weight torch.Size([56, 128])
self_attn.o_proj.weight_scale_inv torch.Size([1536])
self_attn.q_a_layernorm.weight torch.Size([
# [ hidden_size, q_lora_rank ]
1536, 7168])
self_attn.q_a_proj.weight torch.Size([12, 56])
self_attn.q_a_proj.weight_scale_inv torch.Size([# [ q_lora_rank, num_attention_heads * (qk_nope_head_dim + qk_rope_head_dim) ]
24576, 1536])
self_attn.q_b_proj.weight torch.Size([192, 12]) self_attn.q_b_proj.weight_scale_inv torch.Size([
专家权重如下:
.3-60.mlp.(experts.0-255|shared_experts).down_proj.weight: shape=torch.Size([7168, 2048]), dtype=torch.float8_e4m3fn
layers.3-60.mlp.(experts.0-255|shared_experts).down_proj.weight_scale_inv: shape=torch.Size([56, 16]), dtype=torch.float32
layers.3-60.mlp.(experts.0-255|shared_experts).gate_proj.weight: shape=torch.Size([2048, 7168]), dtype=torch.float8_e4m3fn
layers.3-60.mlp.(experts.0-255|shared_experts).gate_proj.weight_scale_inv: shape=torch.Size([16, 56]), dtype=torch.float32
layers.3-60.mlp.(experts.0-255|shared_experts).up_proj.weight: shape=torch.Size([2048, 7168]), dtype=torch.float8_e4m3fn
layers.3-60.mlp.(experts.0-255|shared_experts).up_proj.weight_scale_inv: shape=torch.Size([16, 56]), dtype=torch.float32 layers
前 3 层为 dense 层,后 58 层为稀疏 moe 层。
对于 dense 层
dense 层固定选择 8 个专家
权重大小为 7168 * 2048 * 3 * 9 * 3 = 1,189,085,184
对于稀疏 moe
256 专家,hidden size 为 7168,每个 token 从 256 专家中选择 8 个
每个专家为 3 路 MLP,7168 → 2048 → 7168
所以后面 58 层 MOE 所占的权重大小为:(2048 * 7168 * 3 * 257 + 7168 * 256 + 256) * 58 = 656,569,547,264
总共的专家大小为 656569547264 + 1189085184 = 657,758,632,448
差不多是 658 B