H100 and H200 NVLink Cluster Scaling: Maximum Configurations and Performance
NVLink 4.0 Topology, Bridge Bandwidth, and the Architecture Behind the Largest GPU Clusters
A technical deep dive into NVLink lane allocation, SXM vs NVL form factors, NVSwitch fabric design, and how interconnect topology determines the theoretical maximum size and real training throughput of H100 and H200 GPU clusters.
NVLink 4.0: The Foundation of Hopper Interconnect
NVLink 4.0, introduced with the NVIDIA Hopper architecture, represents the fourth generation of NVIDIA's proprietary GPU interconnect technology. Every H100 and H200 GPU uses NVLink 4.0 as its primary mechanism for high-bandwidth, low-latency communication with peer GPUs. Understanding the physical layer of NVLink 4.0 is essential for evaluating cluster topology choices and their performance implications.
Each NVLink 4.0 link operates at 50 GB/s bidirectional (25 GB/s in each direction). A single H100 or H200 GPU exposes 18 NVLink 4.0 links, providing a theoretical aggregate bandwidth of 900 GB/s bidirectional per GPU. This is the total interconnect budget available to each GPU, and how those 18 links are allocated across peer connections defines the fundamental difference between the SXM and NVL form factors.
NVLink 4.0 Per-Link Specifications
Previous NVLink generations operated at lower per-link bandwidth: NVLink 3.0 (Ampere/A100) provided 50 GB/s per link with 12 links per GPU, totaling 600 GB/s. NVLink 2.0 (Volta/V100) provided 25 GB/s per link with 6 links, totaling 300 GB/s. The jump from Ampere to Hopper maintained the same per-link rate but added 6 more links per GPU, increasing aggregate bandwidth by 50%.
These 18 links are physical connections routed through the GPU package. The way they are wired in the system board or connected through bridges determines the NVLink topology, which in turn determines the NVLink domain size: the number of GPUs that can communicate at full NVLink bandwidth without traversing a slower interconnect.
NVLink Bridge Topology: How Lanes Connect GPUs
The physical mechanism for connecting NVLink lanes between GPUs depends on the form factor. In the NVL (NVLink) configurations, GPUs are connected using NVLink bridges: physical connectors that sit atop the GPU modules and carry the NVLink lanes between them. In SXM configurations, GPUs connect through NVSwitch chips on the baseboard. This difference has profound consequences for topology, bandwidth distribution, and maximum NVLink domain size.
H100 NVL: Two GPUs, One Bridge, Full Bandwidth
The H100 NVL is a PCIe form factor card that pairs two H100 GPUs connected by an NVLink bridge. In this configuration, each GPU dedicates all 18 of its NVLink 4.0 links to a single peer GPU through the bridge. This means the two GPUs share 900 GB/s of bidirectional bandwidth between them.
Because all 18 links terminate at the same peer, the H100 NVL pair achieves the theoretical maximum point-to-point NVLink bandwidth. There is no bandwidth sharing with other GPUs. For workloads that only require two-GPU parallelism (such as inference with model sizes between 80 GB and 188 GB, where a single GPU's memory is insufficient but two GPUs suffice), this topology is optimal. The NVLink domain size is 2 GPUs.
Beyond the NVLink pair, each H100 NVL GPU also has a PCIe Gen 5 x16 connection to the host system, providing 64 GB/s bidirectional to CPU memory and the rest of the server fabric. Communication with GPUs in other NVL pairs must traverse PCIe, InfiniBand, or Ethernet, all of which operate at significantly lower bandwidth than the NVLink connection.
H100 NVL Topology Summary
- NVLink domain: 2 GPUs
- Links per GPU to peer: 18 (all dedicated)
- Bandwidth between pair: 900 GB/s bidirectional
- Memory per GPU: 94 GB HBM3 (188 GB total per pair)
- Host connection: PCIe Gen 5 x16 per GPU
- Inter-pair communication: PCIe, InfiniBand, or Ethernet
H200 NVL: Same Bridge Topology, More Memory
The H200 NVL retains the identical NVLink bridge topology as the H100 NVL. Two GPUs are connected through an NVLink bridge, with all 18 lanes per GPU dedicated to the single peer connection. The bandwidth between the pair remains 900 GB/s bidirectional.
The critical difference is memory: each H200 GPU carries 141 GB of HBM3e (up from 94 GB HBM3 on the H100 NVL), bringing the total per pair to 282 GB. Memory bandwidth also increases from 3.35 TB/s to 4.8 TB/s per GPU. For inference workloads, this means models up to approximately 270 GB (accounting for KV cache and activation memory overhead) can fit within a single NVL pair, a substantial improvement over the H100 NVL's effective capacity.
For training workloads, the larger memory per GPU reduces the minimum degree of tensor parallelism required for a given model size. A model that previously needed 4-way tensor parallelism across 4 H100 GPUs (requiring NVSwitch or cross-pair communication) might fit in 2-way tensor parallelism on an H200 NVL pair, keeping all tensor-parallel communication within the NVLink domain.
Double Bridge Configurations: Splitting NVLink Lanes
The concept of double bridges introduces an important tradeoff. When NVLink lanes are split across two bridges to connect a GPU to two different peers simultaneously, each bridge carries approximately half the available lanes. With 18 total NVLink 4.0 links per GPU, a double-bridge configuration allocates roughly 9 links per bridge, providing approximately 450 GB/s bidirectional to each of two peers.
This lane-splitting is not commonly used in the H100 NVL product line. NVIDIA's standard H100 NVL product dedicates all 18 links to a single peer. However, the principle of lane splitting is fundamental to understanding why NVSwitch exists and why direct bridge topologies cannot scale to 8 GPUs.
Consider the mathematical constraint: an H100 GPU has 18 NVLink lanes. To directly connect to 7 other GPUs (as in an 8-GPU fully connected mesh), each connection would receive at most 2 to 3 lanes, providing only 100 to 150 GB/s per peer. This severe bandwidth reduction per connection makes direct bridging impractical for large NVLink domains. The solution is NVSwitch.
Lane Splitting Bandwidth Impact
| Configuration | Peers | Links per Peer | BW per Peer |
|---|---|---|---|
| Single bridge (NVL) | 1 | 18 | 900 GB/s |
| Double bridge | 2 | 9 | 450 GB/s |
| Triple bridge (theoretical) | 3 | 6 | 300 GB/s |
| 7 peers (direct mesh, theoretical) | 7 | ~2.6 | ~129 GB/s |
| NVSwitch (SXM, 8-GPU) | 7 | 18 (switched) | 900 GB/s aggregate |
NVSwitch solves the lane-splitting problem by acting as a non-blocking crossbar, allowing all 18 links per GPU to be used at full bandwidth regardless of which peer is the destination.
SXM vs NVL: Form Factor and Topology Comparison
The choice between SXM and NVL determines the NVLink domain size, which is the single most important factor in cluster performance for distributed training. The SXM form factor creates an 8-GPU fully connected NVLink domain. The NVL form factor creates 2-GPU pairs. Everything downstream, from parallelism strategy to achievable training throughput, follows from this distinction.
H100 SXM5: 8 GPUs Fully Connected via NVSwitch
The H100 SXM5 is the form factor used in the NVIDIA DGX H100 and HGX H100 baseboards. Eight H100 GPUs mount directly onto the baseboard in SXM5 sockets. Four third-generation NVSwitch chips are integrated into the baseboard, creating a non-blocking, fully connected fabric among all 8 GPUs.
Each GPU connects its 18 NVLink lanes to the NVSwitch fabric. The NVSwitch acts as a crossbar switch: any GPU can communicate with any other GPU at the full 900 GB/s aggregate bandwidth. Unlike direct bridging, the NVSwitch does not require dedicating specific lanes to specific peers. The switching fabric dynamically routes traffic, so a GPU performing an all-reduce across all 8 GPUs uses the same 18 links it would use for a point-to-point transfer to a single peer.
The four NVSwitch chips provide aggregate switching capacity of 7.2 TB/s across the 8-GPU domain. Each NVSwitch chip has 64 NVLink 4.0 ports (3.2 TB/s switching capacity per chip). The four chips together create full bisection bandwidth: in the worst-case traffic pattern (4 GPUs sending to 4 other GPUs simultaneously), the fabric sustains full line rate without congestion.
H100 SXM5 (DGX/HGX H100)
- NVLink domain: 8 GPUs
- Interconnect: NVSwitch (4 chips)
- Per-GPU bandwidth: 900 GB/s
- Memory per GPU: 80 GB HBM3
- Total GPU memory: 640 GB
- Network: 8x ConnectX-7 (400 Gb/s each)
- All-reduce within domain: NVLink only
H100 NVL (PCIe Bridge)
- NVLink domain: 2 GPUs
- Interconnect: NVLink bridge (direct)
- Per-GPU bandwidth: 900 GB/s (to peer)
- Memory per GPU: 94 GB HBM3
- Total GPU memory: 188 GB per pair
- Network: PCIe Gen 5 x16 per GPU
- All-reduce beyond pair: PCIe/IB/Ethernet
H200 SXM and H200 NVL: Memory Upgrade, Same Topology
NVIDIA offers the H200 in both SXM and NVL form factors. The H200 SXM is used in the DGX H200, providing 8 GPUs with 141 GB HBM3e each (1,128 GB total) connected through the same NVSwitch fabric as the H100 SXM. The H200 NVL pairs two GPUs at 141 GB each (282 GB per pair) through the same direct bridge topology as the H100 NVL.
The NVLink topology is identical across both generations. The H200 does not introduce new NVLink lanes, new NVSwitch revisions, or different bridge architectures. The improvement is entirely in the memory subsystem: more capacity (141 GB vs 80/94 GB) and higher bandwidth (4.8 TB/s vs 3.35 TB/s per GPU). For cluster scaling, this means the NVLink topology analysis for H100 applies directly to H200 systems, with the added benefit that larger per-GPU memory can reduce the degree of parallelism required for a given model.
Why NVLink Domain Size Matters for Training
The NVLink domain size directly determines the maximum degree of tensor parallelism that can operate at NVLink speed. Tensor parallelism splits individual matrix operations across GPUs, requiring all-reduce communication after every layer's forward and backward pass. This communication is latency-sensitive and bandwidth-intensive.
In an 8-GPU SXM domain, tensor parallelism degree 8 (TP=8) keeps all tensor-parallel communication within NVLink at 900 GB/s. In a 2-GPU NVL domain, TP is limited to 2 before communication must cross PCIe or the network. For large language models where TP=8 is standard (GPT-3 scale and above), this means SXM is effectively mandatory for competitive training throughput.
The performance penalty for crossing the NVLink domain boundary is severe. NVLink delivers 900 GB/s; a single InfiniBand ConnectX-7 port delivers 50 GB/s (400 Gb/s). That is an 18x bandwidth reduction. Even with 8 InfiniBand ports per node (400 GB/s aggregate), the per-GPU effective bandwidth for cross-node tensor-parallel communication is 50 GB/s, a fraction of the within-domain bandwidth. This is why cluster architects keep tensor parallelism within the NVLink domain and use pipeline parallelism or data parallelism for cross-node scaling.
Theoretical Maximum Cluster Sizes
The theoretical maximum cluster size for H100 and H200 systems is bounded by the scale-out network, not by NVLink. NVLink defines the intra-node (or intra-pair) domain, while InfiniBand or Ethernet handles inter-node communication. NVIDIA's reference architectures and real-world deployments define practical limits.
DGX H100 (8x SXM): Up to 32,768 GPUs
The DGX H100 node contains 8 H100 SXM5 GPUs connected via NVSwitch, plus 8 ConnectX-7 InfiniBand adapters providing 400 Gb/s (50 GB/s) each. The standard scale-out fabric is NVIDIA Quantum-2 InfiniBand at 400 Gb/s NDR.
NVIDIA's DGX SuperPOD reference architecture specifies scalable unit pods. A single SuperPOD rack group contains 32 DGX H100 nodes (256 GPUs). Multiple SuperPODs connect through spine InfiniBand switches. The largest validated SuperPOD configuration is the Eos supercomputer at 576 DGX H100 nodes (4,608 GPUs).
In practice, Meta's Research SuperCluster (RSC) demonstrated 16,384 H100 GPUs connected via InfiniBand, and their subsequent Grand Teton cluster scaled to 24,576 H100 GPUs. The theoretical limit for a fat-tree InfiniBand topology with Quantum-2 switches is approximately 32,768 GPUs, constrained by switch port count and the number of tiers in the fabric.
DGX H100 Cluster Scale Reference
H100 NVL: Pair-Based Scaling
H100 NVL clusters scale by populating servers with multiple NVL pairs. Common server configurations include 4 GPUs (2 pairs), 8 GPUs (4 pairs), or 10 GPUs (5 pairs), depending on the OEM platform. Each pair has its own NVLink domain of 2 GPUs, and cross-pair communication uses PCIe-based interconnects or network adapters.
Since the NVLink domain is only 2 GPUs, the InfiniBand or Ethernet network handles a larger share of the communication burden compared to SXM clusters. The maximum cluster size is again limited by the network fabric rather than NVLink. Practically, NVL-based clusters can scale to thousands of GPUs, but training throughput per GPU is lower than equivalent SXM clusters for workloads that benefit from TP greater than 2.
The primary use case for large H100 NVL clusters is inference at scale, where the 2-GPU NVLink domain is sufficient for serving most models, and the higher per-GPU memory (94 GB on H100 NVL, 141 GB on H200 NVL) reduces the number of GPUs needed per model replica.
DGX H200: SXM Topology with Expanded Memory
The DGX H200 uses the same 8-GPU NVSwitch topology as the DGX H100 but with H200 GPUs providing 141 GB HBM3e each. Total node memory increases from 640 GB to 1,128 GB. The scale-out network remains 8x ConnectX-7 at 400 Gb/s.
For cluster scaling, the DGX H200 offers an interesting advantage: because each GPU has 76% more memory than the H100 SXM (141 GB vs 80 GB), the same model can be trained with fewer GPUs or with a lower degree of tensor parallelism. A model that required TP=8 on H100 (using all 8 GPUs in the NVLink domain for tensor parallelism) might work with TP=4 on H200, freeing the remaining 4 GPUs for increased data parallelism. This can improve cluster utilization and training efficiency.
The maximum cluster size for DGX H200 follows the same InfiniBand fabric constraints as DGX H100. NVIDIA's reference architecture supports the same SuperPOD configurations, with the cluster's theoretical maximum remaining in the range of 32,768 GPUs.
Speed Implications at Scale: NVLink vs InfiniBand Bandwidth
The performance of a large GPU cluster depends on two bandwidth tiers: the NVLink domain bandwidth for intra-node communication, and the scale-out network bandwidth for inter-node communication. The ratio between these two tiers determines the optimal parallelism strategy and the achievable training throughput.
Within the NVLink Domain: 900 GB/s
Inside the NVLink domain (8 GPUs on SXM, 2 GPUs on NVL), all-reduce operations run at 900 GB/s aggregate per GPU. For an 8-GPU ring all-reduce on SXM, the effective bandwidth utilization approaches 900 * (N-1)/N = 787.5 GB/s per GPU, where N=8. This bandwidth is sufficient to overlap communication with computation for most transformer layer sizes, achieving near-linear scaling within the node.
The latency of NVLink communication is approximately 2 to 5 microseconds for small messages, making it suitable for fine-grained tensor-parallel communication where every layer's activations must be synchronized.
Across InfiniBand: 400 GB/s per Node
A DGX H100/H200 node has 8 ConnectX-7 InfiniBand ports, each at 400 Gb/s (50 GB/s). The aggregate inter-node bandwidth is 400 GB/s per node, or 50 GB/s per GPU. This is 18x less bandwidth per GPU than NVLink.
For data-parallel all-reduce across 1,000 nodes (8,000 GPUs), the InfiniBand fabric must handle the gradient synchronization traffic. With ring all-reduce, each node sends and receives approximately 2 * model_size / num_nodes bytes of gradient data per training step. For a 175B parameter model (350 GB in FP16), this is about 700 MB per node per step. At 400 GB/s aggregate bandwidth, this takes approximately 1.75 milliseconds, which can be overlapped with computation.
However, if tensor parallelism crosses the InfiniBand boundary (as it must for NVL clusters with TP greater than 2), the communication volume per step is much larger and latency-sensitive. This is where the 18x bandwidth gap becomes a critical bottleneck.
Bandwidth Tier Comparison (per GPU)
| Interconnect | Bandwidth (per GPU) | Latency | Use Case |
|---|---|---|---|
| NVLink 4.0 (SXM) | 900 GB/s | 2 to 5 us | Tensor parallelism |
| InfiniBand NDR400 | 50 GB/s (single port) | 1 to 3 us | Pipeline, data parallelism |
| PCIe Gen 5 x16 | 64 GB/s | ~1 us | Host memory, NVL inter-pair |
| Ethernet 400GbE | 50 GB/s (single port) | 5 to 15 us | Data parallelism (RoCE) |
The NVLink-to-InfiniBand Ratio and Parallelism Strategy
The 18:1 ratio of NVLink to InfiniBand bandwidth per GPU is the key number for cluster architects. It means that communication-intensive parallelism strategies (tensor parallelism) must stay within the NVLink domain, while less communication-intensive strategies (data parallelism, pipeline parallelism) can tolerate the network.
For a 175B parameter model on a 256-GPU DGX H100 cluster (32 nodes), a typical configuration is TP=8 (within each node), PP=4 (across 4 nodes), DP=8 (8 data-parallel replicas). The tensor-parallel communication (all-reduce per layer) stays within NVLink at 900 GB/s. The pipeline-parallel communication (activation transfers between stages) crosses InfiniBand, but the volume is smaller (only the boundary activations, not the full gradient). The data-parallel gradient all-reduce also crosses InfiniBand but can be overlapped with backward pass computation.
On an equivalent H100 NVL cluster, TP is limited to 2. This forces more pipeline stages or requires expert parallelism to distribute the model, increasing the communication that must traverse the network. The result is lower per-GPU throughput (measured in tokens per second per GPU) compared to an SXM cluster running the same model.
NVLink Domain Size and Tensor Parallelism Degree
Tensor parallelism (TP) splits individual weight matrices across multiple GPUs. For a transformer model, this means each GPU holds a shard of every layer's weights, and the GPUs must perform an all-reduce operation after each matrix multiplication to combine partial results. The communication volume per layer is proportional to the hidden dimension size, and this communication happens on the critical path, meaning it cannot be hidden behind computation.
The practical maximum TP degree is equal to the NVLink domain size. Going beyond the NVLink domain forces tensor-parallel all-reduce operations across InfiniBand, which reduces throughput by the bandwidth ratio (approximately 18x). Research from NVIDIA, Microsoft, and Google consistently shows that TP should never exceed the NVLink domain size for optimal performance.
Tensor Parallelism Degree by Platform
| Platform | NVLink Domain | Max TP (practical) | Model Size at TP Max |
|---|---|---|---|
| H100 NVL | 2 GPUs | TP=2 | ~35B (FP16) |
| H200 NVL | 2 GPUs | TP=2 | ~55B (FP16) |
| DGX H100 (SXM) | 8 GPUs | TP=8 | ~140B (FP16) |
| DGX H200 (SXM) | 8 GPUs | TP=8 | ~220B (FP16) |
| DGX GB200 NVL72 | 72 GPUs | TP=72 | ~2T+ (FP16) |
Model size estimates assume weight sharding only. Actual per-GPU memory must also accommodate optimizer states, activations, and KV cache. FP16 weight sizes are computed as 2 bytes per parameter.
For models larger than the TP-limited capacity of the NVLink domain, pipeline parallelism (PP) distributes different layers across different nodes. PP communication volume is much smaller (only boundary activations between pipeline stages), making it tolerable over InfiniBand. The combination of TP within the NVLink domain and PP across nodes is the standard approach for training large language models on Hopper-class hardware.
Expert parallelism (EP), used in mixture-of-experts models, introduces a third dimension. EP distributes different experts across GPUs, with all-to-all communication routing tokens to the appropriate expert. This communication pattern benefits from large NVLink domains because the all-to-all traffic can stay within NVLink when TP and EP are co-located in the same domain.
Comparison to Blackwell: NVLink 5.0 and the 72-GPU Domain
NVIDIA's Blackwell architecture introduces NVLink 5.0, which doubles per-GPU bandwidth to 1,800 GB/s bidirectional (36 links at 50 GB/s each, or equivalently, 18 links at 100 GB/s each using higher signaling rates). This alone would be a significant improvement, but the more transformative change is the NVLink domain size.
The DGX GB200 NVL72 configuration connects 72 Blackwell GPUs through fifth-generation NVSwitch chips in a single NVLink domain. This is a 9x increase over Hopper's 8-GPU domain. All 72 GPUs can perform all-reduce operations entirely within NVLink, without touching InfiniBand.
Generational NVLink Comparison
| Specification | Hopper (H100/H200) | Blackwell (B200/GB200) |
|---|---|---|
| NVLink generation | 4.0 | 5.0 |
| BW per GPU (bidirectional) | 900 GB/s | 1,800 GB/s |
| NVLink domain (SXM/NVL72) | 8 GPUs | 72 GPUs |
| Max TP degree (NVLink only) | 8 | 72 |
| NVSwitch generation | 3rd gen | 5th gen |
| Aggregate NVLink domain BW | 7.2 TB/s | 129.6 TB/s |
The 72-GPU NVLink domain changes the scaling calculus fundamentally. With Hopper, a 1-trillion-parameter model requires multiple nodes for tensor parallelism (TP=8 per node, pipeline across nodes). With Blackwell, the same model fits in a single 72-GPU NVLink domain, keeping all tensor-parallel communication at NVLink bandwidth. The result is dramatically higher training throughput for very large models, because the pipeline-parallel overhead (bubble time between stages) is eliminated for models that fit within the NVLink domain.
For organizations currently deploying H100 or H200 clusters, the Blackwell comparison is relevant for capacity planning. If you are building a cluster today for a workload that will grow over 2 to 3 years, the NVLink domain size limitation of Hopper will become increasingly significant as model sizes grow. NVIDIA DGX systems offer a path to Blackwell when the next generation becomes available, while Hopper remains the optimal choice for immediate deployment needs.
Cluster Sizing Guide: Model Parameters to Hardware
Sizing a GPU cluster requires mapping your model's parameter count, training data volume, and time-to-train target to specific hardware configurations. The interconnect topology determines which configurations are feasible and which will bottleneck on communication.
Step 1: Calculate Memory Requirements
For mixed-precision training (FP16 weights, FP32 optimizer states), the minimum GPU memory per parameter is approximately 18 to 20 bytes when using Adam optimizer. This accounts for 2 bytes for FP16 weights, 4 bytes for FP32 master weights, 4 bytes for FP32 first moment, 4 bytes for FP32 second moment, and 2 to 4 bytes for gradients, plus activation memory that depends on batch size and sequence length.
A 70B parameter model requires approximately 1.26 to 1.4 TB of GPU memory for training. A single DGX H100 node provides 640 GB, so a minimum of 3 nodes (24 GPUs) is needed. A DGX H200 node provides 1,128 GB, so 2 nodes (16 GPUs) may suffice with activation checkpointing.
Step 2: Choose Parallelism Strategy
Based on the memory requirement and NVLink domain size, determine the parallelism configuration:
- Model fits in 1 GPU: Use data parallelism only. Any form factor works.
- Model fits in NVLink domain (2 GPUs for NVL, 8 for SXM): Use TP within domain, DP across nodes.
- Model exceeds NVLink domain: Use TP within domain + PP across nodes + DP for replicas. SXM strongly preferred.
- 1T+ parameter model: Full 3D parallelism (TP + PP + DP) mandatory. SXM required for competitive throughput. Consider waiting for Blackwell NVL72 if timeline permits.
Step 3: Estimate Training Time
Training time in GPU-hours can be estimated using the scaling law approximation: approximately 6 * N * D / (throughput per GPU), where N is parameter count and D is token count. For an H100 SXM at peak utilization (roughly 40% of theoretical 989 TFLOPS FP16, or about 396 TFLOPS effective), a 70B model trained on 2T tokens requires approximately 700,000 GPU-hours.
On a 512-GPU DGX H100 cluster (64 nodes), this corresponds to approximately 57 days of continuous training, assuming 95% cluster uptime and 85% GPU utilization (accounting for communication overhead). On a 1,024-GPU cluster, this drops to approximately 30 days.
Quick Sizing Reference
| Model Size | Minimum GPUs (H100 SXM) | Recommended TP | Form Factor |
|---|---|---|---|
| 7B | 1 | TP=1 | NVL or SXM |
| 13B | 2 | TP=2 | NVL or SXM |
| 70B | 8 | TP=8 | SXM required |
| 175B | 32 | TP=8, PP=4 | SXM required |
| 540B | 128 | TP=8, PP=16 | SXM required |
| 1T+ | 512+ | TP=8, PP=32+ | SXM required, consider Blackwell |
Step 4: Network Fabric Design
For clusters with more than one node, the InfiniBand fabric design is critical. NVIDIA recommends a rail-optimized topology for DGX clusters: each of the 8 GPUs in a node connects to a dedicated InfiniBand rail, and each rail uses a separate leaf switch. This ensures that GPU 0 on every node communicates through the same switch, GPU 1 through another switch, and so on. This topology optimizes for the data-parallel all-reduce pattern, where GPUs at the same position across nodes exchange gradients.
For clusters over 256 GPUs, a two-tier (leaf-spine) InfiniBand fabric is typical. For clusters over 2,048 GPUs, a three-tier fat-tree is required to maintain full bisection bandwidth. The network fabric cost can represent 20% to 30% of the total cluster cost, making it a significant factor in the total cost of ownership.
Petronella Technology Group designs and deploys complete GPU cluster solutions, from single-node AI development systems to multi-rack InfiniBand-connected clusters. Our team handles the complete stack: hardware procurement, rack and power infrastructure, InfiniBand fabric design, NVLink topology validation, NVIDIA Base Command configuration, and distributed training software optimization.
Frequently Asked Questions
In the H100 SXM (DGX H100) configuration, the maximum NVLink domain is 8 GPUs. All 8 GPUs are fully connected through four third-generation NVSwitch chips on the baseboard, giving every GPU 900 GB/s bidirectional bandwidth to every other GPU in the domain. Scaling beyond 8 GPUs requires InfiniBand or Ethernet networking across nodes.
The H100 NVL uses direct NVLink bridges to connect a pair of 2 GPUs, while the H100 SXM uses NVSwitch to create a fully connected mesh of 8 GPUs. The NVL configuration dedicates all 18 NVLink lanes per GPU to a single peer, delivering 900 GB/s between the pair. The SXM configuration distributes 18 lanes across NVSwitch to reach all 7 other GPUs, still delivering 900 GB/s aggregate per GPU but shared across multiple peers.
When 18 NVLink lanes are split across two bridges, each bridge carries approximately 9 lanes, providing roughly 450 GB/s bidirectional per bridge instead of the full 900 GB/s on a single connection. This enables connecting to two peers rather than one, but each individual link has half the bandwidth. This tradeoff is fundamental to understanding why SXM systems use NVSwitch instead of direct bridges for 8-GPU connectivity.
Yes, H100 NVL clusters can scale to thousands of GPUs using InfiniBand or Ethernet for inter-node communication, just like SXM clusters. However, the NVLink domain size is only 2 GPUs (a single pair) versus 8 GPUs in SXM. This means NVL clusters must use network-based communication for any operation involving more than 2 GPUs, while SXM clusters can perform 8-GPU all-reduce operations entirely within the NVLink domain. For large language model training, SXM clusters typically achieve higher training throughput per GPU because tensor parallelism across 8 GPUs stays within the NVLink domain.
Inside a DGX H100, NVLink provides 900 GB/s bidirectional per GPU (7.2 TB/s aggregate across 8 GPUs). Each node has eight 400 Gb/s (50 GB/s) InfiniBand ConnectX-7 ports, providing 400 GB/s aggregate inter-node bandwidth. The NVLink-to-InfiniBand ratio is approximately 18:1 per GPU (900 GB/s vs 50 GB/s per port), or about 18:1 for the full node (7.2 TB/s vs 400 GB/s). This ratio determines the ideal balance between tensor parallelism (NVLink-bound) and pipeline or data parallelism (network-bound).
The H200 retains the same NVLink 4.0 topology as the H100 (900 GB/s, 18 links, same NVSwitch architecture) but upgrades the memory subsystem from 80 GB HBM3 to 141 GB HBM3e per GPU, with memory bandwidth increasing from 3.35 TB/s to 4.8 TB/s. For cluster scaling, the larger memory per GPU means larger model shards per device, which reduces the degree of tensor parallelism required and can reduce inter-GPU communication volume. An 8-GPU DGX H200 node provides 1,128 GB of total GPU memory versus 640 GB on the DGX H100.
NVIDIA Blackwell introduces NVLink 5.0 at 1,800 GB/s bidirectional per GPU, double the Hopper generation. More importantly, the NVLink domain expands from 8 GPUs to 72 GPUs in the DGX GB200 NVL72 configuration using fifth-generation NVSwitch. This means 72 GPUs can perform all-reduce operations entirely within the NVLink domain, dramatically reducing reliance on InfiniBand for large-scale training. The combination of 2x bandwidth per GPU and 9x larger NVLink domain fundamentally changes the scaling math for distributed training.
Ready to Design Your GPU Cluster?
Petronella Technology Group designs and deploys H100, H200, and Blackwell GPU clusters for AI training and inference. Our CMMC-RP certified team handles everything from NVLink topology validation to InfiniBand fabric design to compliance hardening.
Whether you need a single DGX node or a multi-rack SuperPOD, we configure hardware to match your model architecture, parallelism strategy, and performance targets.
Related Hardware Resources
NVIDIA DGX Systems
DGX B300, B200, H200, and DGX Station. The complete DGX lineup for enterprise AI.
AI Development Systems
Custom-configured AI workstations and servers for development, fine-tuning, and inference.
Tensor vs Pipeline Parallelism
Deep dive into distributed training strategies and how they map to GPU interconnect topology.
SXM Total Cost of Ownership
Full TCO analysis for SXM-based GPU systems including power, cooling, and networking.