D4 Profile
| 层级 | 模块 | Kernel | h800 ds prof | b 200 d1 mtp on |
|---|---|---|---|---|
| 端到端 | 96ms | |||
| 61层执行耗时 | 87.32ms | |||
| attention层 | self.input_layernorm | flashinfer::norm::FusedAddRM | 4 | |
| attention层 | q_a_kv_a | deep_gemm::sm100_fp8_gem | 13 | |
| attention层 | self.input_layernorm | rmsnorm_split_col_kernel | 2 | |
| attention层 | q_proj_b | per_token_group_quant_fp8_ | 2 | |
| attention层 | q_proj_b | deep_gemm::sm100_fp8_gem | 19 | |
| attention层 | torch.bmm transpose(0,1) | nvjet_tst_256x128_64x5_2x1 | 16 | |
| attention层 | self.rotary_emb | elementwise_kernel | ||
| attention层 | self.rotary_emb | rotary_embedding_kernel | 7 | |
| attention层 | self.rotary_emb | unrolled_elementwise_kernel | ||
| attention层 | self.rotary_emb | index_elementwise_kernel | ||
| attention层 | self.attn_mqa | mhaSm100fKernel_Qkv | 255 | 225.767 |
| attention层 | torch.bmm transpose(0,1) | nvjet_tst_64x128_64x13_2x1 | 15 | |
| attention层 | self.o_proj | elementwise_kernel | ||
| attention层 | self.o_proj | per_token_group_quant_fp8_ | 5 | |
| attention层 | self.o_proj | deep_gemm::sm100_fp8_gem | 50 | 43.512 |
| attention层 | sum | 388 | ||
| gate层 | self.input_layernorm | flashinfer::norm::FusedAddRM | 3 | |
| gate层 | nvjet_tst_64x32_64x16_2x4_2 | 6 | ||
| gate层 | cublasLt::splitKreduce_kernel | 2 | ||
| gate层 | distribution_elementwise_grid | |||
| gate层 | topk | 8 | ||
| gate层 | sum | 19 | ||
| moe层 | dispatch | deep_ep::internode_ll::dispatch | 187 | 48.212 |
| moe层 | up_gate_gemm | deep_gemm::sm100_fp8_gem | 81 | 195.136 |
| moe层 | up_gate_gemm | vectorized_elementwise_kernel | ||
| moe层 | silu | silu_and_mul_kernel_ep_index | 8 | 40.491 |
| moe层 | down_gemm | deep_gemm::sm100_fp8_gem | 39 | 104.671 |
| moe层 | combine | 394 | ||
| moe层 | sum | |||
| combine/share overlap | deep_ep::internode_ll::combine | |||
| combine/share overlap | per_token_group_quant_fp8_ | 3 | ||
| combine/share overlap | deep_gemm::sm100_fp8_gem | 17 | ||
| combine/share overlap | act_and_mul_kernel | 4 | ||
| combine/share overlap | deep_gemm::sm100_fp8_gem | 10 | ||
| combine/share overlap | deep_ep::internode_ll::combine |