Hardware

当你计算一个芯片的 MFU（Model FLOPs Utilization）时，必须考虑所使用的数据类型（precision format），例如 FP8、BF16、FP16 等。

名称	含义
Theoretical peak FLOPs	芯片在给定精度下的最大理论算力（例如 FP8 算力 vs BF16 算力不同）
Measured FLOPs	实际模型执行过程中统计到的有效浮点运算数量
MFU (Model FLOPs Utilization)	模型在实际运行中利用了理论算力的多少比例

计算公式通常是：

\[\text{MFU} = \frac{\text{Measured FLOPs}}{\text{Theoretical peak FLOPs for the same precision}}\]

H200 上的 Tensor Core 是“同一批硬件单元”，支持多种数据精度模式；不同精度下的峰值算力差异源于每个 Tensor Core 在该模式下的有效吞吐率不同，而不是“专门为某个精度配置了不同数量的 Tensor Core”。

Technical Specifications¹	GB200 NVL72	HGX B200
Blackwell GPUs \| Grace CPUs	72 \| 36	8 \| 0
CPU Cores	2,592 Arm Neoverse V2 Cores	-
Total FP4 Tensor Core	1,440 PFLOPS	144 PFLOPS
Total FP8/FP6 Tensor Core	720 PFLOPS	72 PFLOPS
Total Fast Memory	Up to 30TB	Up to 1.4TB
Total Memory Bandwidth	Up to 576TB/s	Up to 62TB/s
Total NVLink Bandwidth	130TB/s	14.4TB/s
Individual Blackwell GPU Specifications
FP4 Tensor Core	20 PFLOPS	18 PFLOPS
FP8/FP6 Tensor Core	10 PFLOPS	9 PFLOPS
INT8 Tensor Core	10 POPS	9 POPS
FP16/BF16 Tensor Core	5 PFLOPS	4.5 PFLOPS
TF32 Tensor Core	2.5 PFLOPS	2.2 PFLOPS
FP32	80 TFLOPS	75 TFLOPS
FP64/FP64 Tensor Core	40 TFLOPS	37 TFLOPS
GPU Memory \| Bandwidth	186GB HBM3E \| 8TB/s	180GB HBM3E \| 7.7TB/s
Multi-Instance GPU (MIG)	7
Decompression Engine	Yes
Decoders	7 NVDEC²
	7 nvJPEG
Max Thermal Design Power (TDP)	Configurable up to 1,200W	Configurable up to 1,000W
Interconnect	5th Generation NVLink: 1.8TB/s
	PCIe Gen5: 128GB/s
Server Options	NVIDIA GB200 NVL72 partner and NVIDIA-Certified Systems™ with 72 GPUs	NVIDIA HGX B200 partner and NVIDIA-Certified Systems with 8 GPUs