当你计算一个芯片的 MFU(Model FLOPs Utilization)时,必须考虑所使用的数据类型(precision format),例如 FP8、BF16、FP16 等。
| 名称 | 含义 |
|---|---|
| Theoretical peak FLOPs | 芯片在给定精度下的最大理论算力(例如 FP8 算力 vs BF16 算力不同) |
| Measured FLOPs | 实际模型执行过程中统计到的有效浮点运算数量 |
| MFU (Model FLOPs Utilization) | 模型在实际运行中利用了理论算力的多少比例 |
计算公式通常是:
\[\text{MFU} = \frac{\text{Measured FLOPs}}{\text{Theoretical peak FLOPs for the same precision}}\]
H200 上的 Tensor Core 是“同一批硬件单元”,支持多种数据精度模式; 不同精度下的峰值算力差异源于每个 Tensor Core 在该模式下的有效吞吐率不同,而不是“专门为某个精度配置了不同数量的 Tensor Core”。
| Technical Specifications¹ | GB200 NVL72 | HGX B200 |
|---|---|---|
| Blackwell GPUs | Grace CPUs | 72 | 36 | 8 | 0 |
| CPU Cores | 2,592 Arm Neoverse V2 Cores | - |
| Total FP4 Tensor Core | 1,440 PFLOPS | 144 PFLOPS |
| Total FP8/FP6 Tensor Core | 720 PFLOPS | 72 PFLOPS |
| Total Fast Memory | Up to 30TB | Up to 1.4TB |
| Total Memory Bandwidth | Up to 576TB/s | Up to 62TB/s |
| Total NVLink Bandwidth | 130TB/s | 14.4TB/s |
| Individual Blackwell GPU Specifications | ||
| FP4 Tensor Core | 20 PFLOPS | 18 PFLOPS |
| FP8/FP6 Tensor Core | 10 PFLOPS | 9 PFLOPS |
| INT8 Tensor Core | 10 POPS | 9 POPS |
| FP16/BF16 Tensor Core | 5 PFLOPS | 4.5 PFLOPS |
| TF32 Tensor Core | 2.5 PFLOPS | 2.2 PFLOPS |
| FP32 | 80 TFLOPS | 75 TFLOPS |
| FP64/FP64 Tensor Core | 40 TFLOPS | 37 TFLOPS |
| GPU Memory | Bandwidth | 186GB HBM3E | 8TB/s | 180GB HBM3E | 7.7TB/s |
| Multi-Instance GPU (MIG) | 7 | |
| Decompression Engine | Yes | |
| Decoders | 7 NVDEC² | |
| 7 nvJPEG | ||
| Max Thermal Design Power (TDP) | Configurable up to 1,200W | Configurable up to 1,000W |
| Interconnect | 5th Generation NVLink: 1.8TB/s | |
| PCIe Gen5: 128GB/s | ||
| Server Options | NVIDIA GB200 NVL72 partner and NVIDIA-Certified Systems™ with 72 GPUs | NVIDIA HGX B200 partner and NVIDIA-Certified Systems with 8 GPUs |
| Specification | H200 SXM | H200 NVL |
|---|---|---|
| FP64 | 34 TFLOPS | 30 TFLOPS |
| FP64 Tensor Core | 67 TFLOPS | 60 TFLOPS |
| FP32 | 67 TFLOPS | 60 TFLOPS |
| TF32 Tensor Core | 989 TFLOPS | 835 TFLOPS |
| BFLOAT16 Tensor Core | 1,979 TFLOPS | 1,671 TFLOPS |
| FP16 Tensor Core | 1,979 TFLOPS | 1,671 TFLOPS |
| FP8 Tensor Core | 3,958 TFLOPS | 3,341 TFLOPS |
| INT8 Tensor Core | 3,958 TFLOPS | 3,341 TFLOPS |
| GPU Memory | 141 GB | 141 GB |
| GPU Memory Bandwidth | 4.8 TB/s | 4.8 TB/s |