How should I configure parallelism for a 70B parameter model on 2 nodes of 8xH100?

Use 8-way tensor parallelism within each node (leveraging NVLink at 900 GB/s between GPUs) and 2-way pipeline parallelism across nodes (using InfiniBand at 400 Gb/s). This gives you TP=8, PP=2, for a total of 16 GPUs. Each GPU holds roughly 4.4 GB of model parameters (70B / 16 in FP16), well within the 80 GB HBM3 on each H100. Pipeline stages communicate only activation tensors between nodes, which is a fraction of the data volume that tensor parallelism all-reduce would require.

Tensor vs Pipeline Parallelism

Q: Why does tensor parallelism require NVLink instead of InfiniBand?

Tensor parallelism performs all-reduce operations after every layer computation, synchronizing partial results across all participating GPUs. For a 70B parameter model with 8-way tensor parallelism on H100 GPUs, each all-reduce transfers approximately 1.6 GB of data per layer, and there are 80 layers. NVLink provides 900 GB/s bidirectional bandwidth, completing each all-reduce in roughly 1.8 milliseconds. InfiniBand at 50 GB/s per port would take approximately 32 milliseconds per all-reduce, making the communication overhead 18x larger and dominating total training time.

Q: What are pipeline bubbles and how do you reduce them?

Pipeline bubbles are periods of GPU idle time inherent to pipeline parallelism. When a model is split into 4 pipeline stages, stage 4 sits idle while stages 1 through 3 process the first micro-batch. The bubble fraction equals (p-1)/m where p is the number of pipeline stages and m is the number of micro-batches. With 4 stages and 4 micro-batches, 75% of time is wasted. Increasing to 32 micro-batches reduces the bubble to 9.4%. Interleaved scheduling (1F1B) and techniques like zero bubble pipeline parallelism further reduce idle time.

Q: Can I use tensor parallelism over PCIe instead of NVLink?

Technically yes, but it is almost never practical for large models. PCIe Gen 5 delivers 64 GB/s bidirectional, roughly 14x slower than NVLink 5.0 at 900 GB/s. For a 70B model with 8-way tensor parallelism, each all-reduce would take approximately 25 milliseconds over PCIe versus 1.8 milliseconds over NVLink. The communication overhead over PCIe would consume more time than the actual computation, resulting in negative scaling where adding GPUs makes training slower, not faster.

Q: What parallelism strategy does Llama 3.1 405B use?

Meta trained Llama 3.1 405B on 16,384 H100 GPUs using a combination of 8-way tensor parallelism within each node (over NVLink), 16-way pipeline parallelism across nodes (over InfiniBand), and 128-way data parallelism across pipeline groups. This TP=8, PP=16, DP=128 configuration kept NVLink-intensive tensor operations within each 8-GPU DGX node while using the lower-bandwidth InfiniBand fabric only for pipeline stage boundaries and gradient synchronization.

Q: What bandwidth do I need between nodes for pipeline parallelism?

Pipeline parallelism transfers activation tensors between stages, not full gradient synchronization. For a 70B model with a micro-batch size of 1 and sequence length of 4096, the activation tensor at each stage boundary is approximately 32 MB. At 32 micro-batches per step with some overlap, you need roughly 1 GB/s of sustained throughput between nodes. A single ConnectX-7 port at 400 Gb/s (50 GB/s) provides 50x more bandwidth than needed, which is why pipeline parallelism works well over InfiniBand even at scale.

Tensor Parallelism

Splitting Individual Layers Across GPUs

Tensor parallelism (TP) takes a single layer's weight matrix and slices it across multiple GPUs. Every GPU holds a shard of every layer. When a forward pass runs, each GPU computes its portion of the matrix multiplication, then all GPUs synchronize their partial results through an all-reduce collective operation before moving to the next layer.

Consider a transformer's self-attention layer. The query, key, and value projection matrices each have dimensions [hidden_size x hidden_size]. For a 70B model like Llama 2 70B, hidden_size is 8,192. With 8-way tensor parallelism, each GPU holds a [8192 x 1024] slice of each projection matrix. After each GPU computes its partial output, the results must be combined through an all-reduce to reconstruct the full attention output.

This pattern repeats for every layer in the network. A 70B model has 80 transformer layers, meaning 80 all-reduce operations per forward pass and another 80 during the backward pass. That is 160 synchronization points per training step where every GPU must wait for every other GPU to finish its computation and exchange data.

The critical requirement: Tensor parallelism demands extremely high bandwidth and low latency between GPUs because all-reduce operations happen after every single layer. If the interconnect is slow, GPUs spend more time waiting for data than computing.

Tensor Parallelism: 4-Way TP on One Layer

Input Tensor X [batch, seq, 8192]
       |
       v
  +---------+---------+---------+---------+
  | GPU 0   | GPU 1   | GPU 2   | GPU 3   |
  | W[0:2048]| W[2048: | W[4096: | W[6144: |
  |         | 4096]   | 6144]   | 8192]   |
  |  Y_0    |  Y_1    |  Y_2    |  Y_3    |
  +---------+---------+---------+---------+
       |         |         |         |
       +----+----+----+----+----+----+
            |  ALL-REDUCE  |
            v              v
     Y = Y_0 + Y_1 + Y_2 + Y_3
            |
            v
     Next Layer (repeat all-reduce)
            |
            ...
     80 layers = 80 all-reduce ops

Key Characteristics

Every GPU participates in every forward/backward pass
All-reduce after every layer (collective, not point-to-point)
Latency-sensitive: one slow GPU stalls the entire group
Memory efficient: each GPU stores only 1/N of each layer

Pipeline Parallelism: 4 Stages Across 4 GPUs

Model: 80 layers total

GPU 0 (Stage 1): Layers  0-19  --> activation
GPU 1 (Stage 2): Layers 20-39  --> activation
GPU 2 (Stage 3): Layers 40-59  --> activation
GPU 3 (Stage 4): Layers 60-79  --> output

Time -->
         MB1    MB2    MB3    MB4
GPU 0:  [FWD]  [FWD]  [FWD]  [FWD]  [BWD]...
GPU 1:  [idle] [FWD]  [FWD]  [FWD]  [FWD]...
GPU 2:  [idle] [idle] [FWD]  [FWD]  [FWD]...
GPU 3:  [idle] [idle] [idle] [FWD]  [FWD]...
         ^^^^^^^^^^^^^^^^^^^
         Pipeline "bubble" (idle time)

Key Characteristics

Each GPU processes a different subset of layers
Point-to-point transfers: only activation tensors between adjacent stages
Bandwidth tolerant: works over InfiniBand and even Ethernet
Pipeline bubbles reduce efficiency (mitigated by micro-batching)

Pipeline Parallelism

Splitting the Model Into Sequential Stages

Pipeline parallelism (PP) divides the model vertically: the first N layers go on GPU 0, the next N layers on GPU 1, and so on. Data flows through the pipeline sequentially. GPU 0 processes a micro-batch, sends the resulting activation tensor to GPU 1, then immediately starts processing the next micro-batch. Each GPU operates on a different micro-batch at any given time.

The communication pattern is fundamentally different from tensor parallelism. Instead of all-reduce operations involving every GPU after every layer, pipeline parallelism uses simple point-to-point sends between adjacent stages. GPU 0 sends to GPU 1, GPU 1 sends to GPU 2, and so on. The data transferred is an activation tensor, typically much smaller than the collective data moved during an all-reduce.

For that same 70B model split across 4 pipeline stages, each stage handles 20 layers. The activation tensor passed between stages has dimensions [micro_batch_size x sequence_length x hidden_size]. With a micro-batch of 1, sequence length of 4,096, and hidden_size of 8,192, that activation is about 32 MB in FP16. Compare this to tensor parallelism's all-reduce, which must exchange partial sums totaling hundreds of megabytes across all GPUs simultaneously.

The tradeoff: Pipeline parallelism is bandwidth-friendly but time-inefficient. GPUs at later stages sit idle while earlier stages process the first micro-batches. This idle time is called the pipeline bubble, and minimizing it is the central challenge of pipeline parallelism.

The Deciding Factor

Interconnect Bandwidth Determines Everything

The entire tensor-vs-pipeline decision collapses to a single question: what is the bandwidth of the link between GPUs? Three interconnect technologies define three distinct regimes.

NVLink (Within a Node)

NVLink 5.0 (Blackwell) 900 GB/s

NVLink 4.0 (Hopper) 900 GB/s

NVLink 3.0 (Ampere) 600 GB/s

Tensor Parallelism: Excellent

This bandwidth makes all-reduce operations fast enough that communication overhead stays below 10% of total step time. Tensor parallelism is the default within NVLink-connected nodes.

InfiniBand (Between Nodes)

ConnectX-7, single port 400 Gb/s (50 GB/s)

ConnectX-7, dual bonded 800 Gb/s (100 GB/s)

vs NVLink 5.0 9x to 18x slower

Pipeline Parallelism: Ideal

Point-to-point activation transfers are small enough that InfiniBand handles them with minimal overhead. Tensor parallelism across InfiniBand, by contrast, creates severe bottlenecks at every all-reduce.

PCIe (Fallback)

PCIe Gen 5 x16 64 GB/s bidir

PCIe Gen 4 x16 32 GB/s bidir

vs NVLink 5.0 14x to 28x slower

Tensor Parallelism: Not Viable

PCIe cannot sustain the all-reduce throughput that tensor parallelism requires. Only pipeline parallelism with small activation tensors is practical over PCIe interconnects.

The Math: Why Tensor Parallelism Fails Across InfiniBand

Consider 8-way tensor parallelism for a 70B model (Llama 2 70B architecture: 80 layers, hidden_size 8192, FP16 weights). Each all-reduce operation transfers 2 x (N-1)/N x message_size bytes using the ring all-reduce algorithm, where N is the number of GPUs.

Over NVLink (900 GB/s) Message per all-reduce: ~1.6 GB Ring all-reduce data: 2 x (7/8) x 1.6 GB = 2.8 GB NVLink bandwidth: 900 GB/s Time per all-reduce: 2.8 / 900 = 3.1 ms 80 layers x 2 (fwd+bwd) = 160 all-reduces Total comm time: 160 x 3.1 ms = 496 ms H100 compute time: ~2,400 ms per step Comm overhead: 496 / 2400 = 20.7% Viable: communication is a fraction of compute.

Over InfiniBand (50 GB/s single port) Message per all-reduce: ~1.6 GB Ring all-reduce data: 2 x (7/8) x 1.6 GB = 2.8 GB IB bandwidth: 50 GB/s (single ConnectX-7) Time per all-reduce: 2.8 / 50 = 56 ms 80 layers x 2 (fwd+bwd) = 160 all-reduces Total comm time: 160 x 56 ms = 8,960 ms H100 compute time: ~2,400 ms per step Comm overhead: 8960 / 2400 = 373% Disastrous: GPUs idle 3.7x longer than they compute.

The 18x gap is the entire story. NVLink at 900 GB/s makes tensor parallelism communication a manageable overhead. InfiniBand at 50 GB/s turns it into the dominant cost. Even with dual-bonded 800 Gb/s InfiniBand (100 GB/s), the overhead is still 9x higher than NVLink, making tensor parallelism across nodes impractical for models with many layers.

When to Use Each Strategy

The decision follows directly from the interconnect between your GPUs. Here are the three scenarios and the optimal strategy for each.

TP

Tensor Parallelism: Within a Single NVLink Node

Use tensor parallelism when your model's individual layers are too large for a single GPU's memory and all GPUs are connected via NVLink within a single server node. This is the default for systems like NVIDIA DGX with 8 GPUs connected by NVLink at 900 GB/s. Tensor parallelism is the best way to spread a single large layer across multiple GPUs because every GPU contributes to every computation, maximizing utilization.

Typical tensor parallelism degrees: 2-way, 4-way, or 8-way, matching the number of GPUs in a single node. Going beyond 8-way TP is unusual because most server nodes max out at 8 GPUs, and extending TP across nodes introduces the InfiniBand bottleneck described above. For H100 and H200 NVLink clusters, 8-way TP within a node is the standard configuration.

Best for:

Models where each layer exceeds single-GPU memory (hidden_size > 12,288 in FP16 typically). Examples: Llama 70B, Falcon 180B, GPT-4 class models. Run 8-way TP within a DGX node.

PP

Pipeline Parallelism: Across Nodes Over InfiniBand or Ethernet

Use pipeline parallelism when you need more GPUs than fit in a single node and must communicate across InfiniBand or Ethernet. Pipeline parallelism sends only activation tensors between adjacent pipeline stages. These tensors are small (tens of megabytes) compared to the gigabytes moved by tensor parallelism all-reduce operations, making pipeline parallelism tolerant of lower-bandwidth interconnects.

Pipeline parallelism across 2, 4, 8, or 16 nodes is common in large-scale training. The main cost is pipeline bubbles, not communication overhead. With proper micro-batching (32 or more micro-batches per step), bubble overhead drops below 10%, making pipeline parallelism highly efficient even over modest network links.

Best for:

Multi-node clusters connected by InfiniBand (400-800 Gb/s) or high-speed Ethernet (100-400 GbE). Examples: 2 to 64 nodes of 8xH100, multi-GPU inference with RTX 6000 Pro across servers.

Hybrid

Hybrid (TP + PP): What Real Training Runs Actually Use

Nearly every large-scale training run uses a hybrid approach: tensor parallelism within each NVLink-connected node and pipeline parallelism across nodes over InfiniBand. This exploits the strength of each strategy exactly where its interconnect supports it. High-bandwidth NVLink handles the frequent all-reduce operations of tensor parallelism. Lower-bandwidth InfiniBand handles the occasional, small activation transfers of pipeline parallelism.

Most production configurations add a third dimension: data parallelism (DP) across groups of pipeline-parallel GPUs. The full notation is TP x PP x DP. For example, TP=8, PP=4, DP=16 uses 512 GPUs total (8 x 4 x 16). Each group of 32 GPUs (8 TP x 4 PP) processes one copy of the model, and 16 such groups process different data shards in parallel, synchronizing gradients periodically.

Best for:

Any training run requiring more than 8 GPUs. This is the standard for models above 30B parameters. Meta's Llama 3.1 405B used TP=8, PP=16, DP=128 across 16,384 H100 GPUs.

The Pipeline Bubble Problem and How to Solve It

Pipeline parallelism's biggest weakness is idle GPU time. Understanding the bubble fraction and the techniques to minimize it is essential for efficient cluster utilization.

The Bubble Fraction Formula

With p pipeline stages and m micro-batches per training step, the bubble fraction is:

bubble_fraction = (p - 1) / m

With 4 pipeline stages and only 4 micro-batches, the bubble fraction is 3/4 = 75%. Three-quarters of all GPU-seconds are wasted. This is catastrophic. The solution is straightforward: increase the number of micro-batches.

p=4, m=4 75.0% bubble

p=4, m=8 37.5% bubble

p=4, m=16 18.8% bubble

p=4, m=32 9.4% bubble

p=4, m=64 4.7% bubble

Mitigation Techniques

Micro-Batching

Split the global batch into many micro-batches. With 32+ micro-batches per step, the bubble fraction drops below 10%. The tradeoff: more micro-batches means more activation memory stored simultaneously, since each stage must hold activations for all in-flight micro-batches during the backward pass. Activation checkpointing (recomputation) mitigates this memory cost.

1F1B (One Forward, One Backward) Scheduling

Instead of running all forward passes first and then all backward passes (the "fill-drain" approach), 1F1B interleaves forward and backward passes. After the pipeline fills, each GPU alternates between one forward micro-batch and one backward micro-batch. This keeps the peak activation memory bounded to p micro-batches instead of m, significantly reducing memory pressure while maintaining the same bubble fraction.

Interleaved Stages

Assign each GPU multiple non-contiguous stages (for example, GPU 0 gets stages 1 and 5, GPU 1 gets stages 2 and 6). This effectively doubles the number of pipeline stages while keeping the same number of GPUs. The bubble fraction becomes (p-1)/(m*v) where v is the number of virtual stages per GPU. Megatron-LM implements this as "interleaved pipeline parallelism," reducing bubble overhead by 2x to 4x compared to standard scheduling.

Zero Bubble Pipeline Parallelism

Recent research (Qi et al., 2023) introduced zero-bubble scheduling by splitting the backward pass into two parts: backward for input gradients (B) and backward for weight gradients (W). By reordering B and W computations across micro-batches, GPUs can stay occupied during what would otherwise be bubble time. This achieves near-zero bubble overhead in theory, with practical implementations reaching below 3% bubble fraction.

Real-World Configuration Examples

Practical parallelism configurations for common model sizes and cluster topologies. These assume FP16/BF16 training with activation checkpointing enabled.

Hybrid TP+PP

70B Model on 2 Nodes of 8xH100 (16 GPUs)

A 70B parameter model in FP16 requires approximately 140 GB for weights alone. With optimizer states (Adam: 2x weights for momentum and variance in FP32), total memory per model copy is roughly 560 GB. Across 16 H100 GPUs (80 GB each = 1,280 GB total), this leaves comfortable headroom for activations.

Configuration: TP=8, PP=2, DP=1. Each node runs 8-way tensor parallelism over NVLink. The two nodes form a 2-stage pipeline over InfiniBand. Each GPU holds 70B / 16 = 4.375B parameters, requiring about 8.75 GB in FP16. With optimizer states distributed via ZeRO Stage 1, each GPU's total memory footprint is approximately 35 GB, well within the 80 GB HBM3 per H100.

Cluster Topology

Node 0 (NVLink 900 GB/s):
  GPU0-GPU7: TP=8, Layers 0-39

        | InfiniBand 400 Gb/s |

Node 1 (NVLink 900 GB/s):
  GPU0-GPU7: TP=8, Layers 40-79

TP=8 (intra-node), PP=2 (inter-node)
Total: 16 GPUs
Memory/GPU: ~35 GB / 80 GB available

Hybrid TP+PP+DP

405B Model on 8 Nodes of 8xH100 (64 GPUs)

A 405B model (like Llama 3.1 405B) in FP16 needs approximately 810 GB for weights. With optimizer states, the full training state exceeds 3.2 TB. Across 64 GPUs at 80 GB each (5,120 GB total), you have enough memory, but the parallelism strategy must be carefully designed to keep communication efficient.

Configuration: TP=8, PP=8, DP=1. Eight-way tensor parallelism within each node over NVLink. Eight pipeline stages across nodes, with each stage handling approximately 12 of the model's 126 layers. Each GPU holds 405B / 64 = 6.3B parameters (12.7 GB in FP16). With 32+ micro-batches per step, pipeline bubble overhead stays below 10%.

To add data parallelism, double the cluster to 16 nodes (128 GPUs) and use TP=8, PP=8, DP=2. This doubles throughput with minimal additional communication overhead since gradient synchronization across DP ranks can overlap with computation.

Cluster Topology

Node 0: TP=8, PP Stage 1 (Layers 0-15)
Node 1: TP=8, PP Stage 2 (Layers 16-31)
Node 2: TP=8, PP Stage 3 (Layers 32-47)
Node 3: TP=8, PP Stage 4 (Layers 48-63)
Node 4: TP=8, PP Stage 5 (Layers 64-78)
Node 5: TP=8, PP Stage 6 (Layers 79-93)
Node 6: TP=8, PP Stage 7 (Layers 94-109)
Node 7: TP=8, PP Stage 8 (Layers 110-125)

All inter-node: InfiniBand 400-800 Gb/s
All intra-node: NVLink 900 GB/s
Micro-batches: 32+ (bubble < 10%)

TP for Inference

70B Model Inference on a Single 8xH100 Node

Inference has different requirements than training. There are no optimizer states and no backward pass, so a 70B model in FP16 needs only about 140 GB, comfortably fitting across 8 H100 GPUs (640 GB total). The priority for inference is latency, not throughput.

Configuration: TP=8, PP=1. Pure tensor parallelism across all 8 GPUs within a single node. Every GPU participates in every token generation, minimizing per-token latency. Pipeline parallelism would add latency because each stage must wait for the previous stage to finish. For latency-sensitive inference serving, tensor parallelism over NVLink is always preferred.

Frameworks like vLLM and TensorRT-LLM default to tensor parallelism for inference. They shard the KV-cache across GPUs alongside the model weights, keeping memory balanced and latency minimal.

Inference Topology

Single Node (NVLink 900 GB/s):
  GPU0: 1/8 of every layer + KV shard
  GPU1: 1/8 of every layer + KV shard
  GPU2: 1/8 of every layer + KV shard
  GPU3: 1/8 of every layer + KV shard
  GPU4: 1/8 of every layer + KV shard
  GPU5: 1/8 of every layer + KV shard
  GPU6: 1/8 of every layer + KV shard
  GPU7: 1/8 of every layer + KV shard

TP=8, PP=1
All-reduce per layer: ~1.8 ms (NVLink)
Per-token latency: ~15-25 ms

Bandwidth Calculation Deep Dive

Step-by-step calculations that show why interconnect bandwidth is the gating factor for parallelism strategy selection.

Tensor Parallelism: All-Reduce Volume Per Step

For a transformer layer with hidden dimension H and tensor parallelism degree T, the all-reduce after the column-parallel linear layer transfers a tensor of size [batch x seq_len x H] in FP16.

Model: Llama 2 70B H = 8192, layers = 80, T = 8 micro_batch = 1, seq_len = 4096 Activation size per all-reduce: 1 x 4096 x 8192 x 2 bytes = 67 MB All-reduces per layer: 2 (attn + FFN) All-reduces per step (fwd+bwd): 80 layers x 2 x 2 = 320 Ring all-reduce volume per op: 2 x (7/8) x 67 MB = 117 MB Total communication per step: 320 x 117 MB = 36.6 GB

36.6 GB of all-reduce data per training step. Over NVLink at 900 GB/s, this takes about 41 milliseconds. Over single-port InfiniBand at 50 GB/s, it takes 732 milliseconds. The compute time for the step itself is roughly 2,400 milliseconds on H100, so NVLink adds 1.7% overhead while InfiniBand adds 30.5% overhead. With larger batch sizes, the activation tensors grow proportionally, making the gap even worse.

Pipeline Parallelism: Activation Transfer Volume Per Step

Pipeline parallelism transfers only the activation tensor at each stage boundary. This is a point-to-point send, not a collective operation. Each stage sends one activation per micro-batch.

Model: Llama 2 70B, PP = 4 stages H = 8192 micro_batch = 1, seq_len = 4096 micro_batches_per_step = 32 Activation size per transfer: 1 x 4096 x 8192 x 2 bytes = 67 MB Transfers per step per boundary: 32 (fwd) + 32 (bwd) = 64 Stage boundaries: 3 (between 4 stages) Total communication per step: 64 x 67 MB x 3 = 12.9 GB But each boundary is independent: Per-boundary: 64 x 67 MB = 4.3 GB Over IB at 50 GB/s: 86 ms per boundary Pipelined (overlapped): ~86 ms total

Pipeline parallelism transfers can overlap with computation because stage N can be sending its output to stage N+1 while simultaneously receiving input from stage N-1 and computing on a different micro-batch. The effective communication overhead is often just a single boundary's latency, around 86 ms over single-port InfiniBand. Compared to compute time of 2,400 ms, that is only 3.6% overhead.

Metric	Tensor Parallelism	Pipeline Parallelism
Communication pattern	All-reduce (collective)	Point-to-point (send/recv)
Data moved per step (70B, 8-way)	36.6 GB	4.3 GB per boundary
Overhead on NVLink (900 GB/s)	1.7%	0.2%
Overhead on IB (50 GB/s)	30.5%	3.6%
Overhead on PCIe Gen 5 (64 GB/s)	23.8%	2.8%
GPU idle time source	Synchronization stalls	Pipeline bubbles
Primary bottleneck	Interconnect bandwidth	Micro-batch count
Inference latency impact	Low (all GPUs in parallel)	Higher (sequential stages)

Advanced Considerations for Cluster Design

Sequence Parallelism

Sequence parallelism (SP) extends tensor parallelism by also partitioning the sequence dimension across GPUs for operations that do not involve weight matrices (LayerNorm, Dropout). In standard tensor parallelism, these operations are replicated on every GPU, wasting memory. Sequence parallelism eliminates this redundancy. Megatron-LM enables SP alongside TP with zero additional communication cost because the required data permutations piggyback on existing all-reduce operations.

For very long sequences (32K+ tokens), sequence parallelism becomes essential for memory. Without it, the activation memory for LayerNorm and Dropout alone can consume 10+ GB per GPU at 128K sequence length.

Expert Parallelism for MoE Models

Mixture-of-Experts (MoE) models like Mixtral 8x7B and DeepSeek V3 introduce a fourth parallelism dimension: expert parallelism (EP). Each expert is placed on a different GPU, and tokens are routed to the appropriate expert via all-to-all communication. EP works well across InfiniBand because the data volume per token is small (only the tokens routed to a given expert need to be sent to that GPU).

For MoE models, a common configuration is TP within the node for attention layers, EP across nodes for expert layers, and PP across node groups for the full model. The all-to-all pattern of EP is more bandwidth-friendly than all-reduce, making cross-node expert placement more practical than cross-node tensor parallelism.

Context Parallelism for Long Sequences

When training on very long sequences (100K+ tokens), the KV-cache and attention computation become the bottleneck rather than model weights. Context parallelism (CP) splits the sequence across GPUs, with each GPU processing a chunk of the sequence and exchanging KV pairs with other GPUs via ring attention or similar algorithms.

CP is typically applied within a node over NVLink because the KV exchange happens frequently (once per attention layer). The full parallelism stack for a long-context 405B model might be TP=8, CP=2, PP=16, DP=64, using 16,384 GPUs across 2,048 nodes, with TP and CP sharing the 8 NVLink-connected GPUs in each node.

Network Topology Matters

InfiniBand networks use fat-tree or dragonfly topologies. In a fat-tree, bandwidth is uniform between any two nodes. In a dragonfly topology, bandwidth within a group is higher than between groups. When designing pipeline parallelism across nodes, place adjacent pipeline stages on nodes within the same network group to minimize cross-group traffic.

Petronella Technology Group designs custom GPU cluster topologies optimized for your specific model architecture and parallelism requirements. The network fabric is as important as the GPUs themselves. A poorly designed network can leave expensive GPUs idle waiting for data.

Quick Decision Framework

Follow this decision tree to determine the right parallelism configuration for your workload.

1

Model fits on 1 GPU?

No parallelism needed. Use data parallelism only if you want to increase throughput by processing more batches across GPUs.

2

Fits on 1 node (8 GPUs)?

Use tensor parallelism (TP=2, 4, or 8) over NVLink. This is the simplest and most efficient configuration for models up to ~140 GB.

3

Needs multiple nodes?

Hybrid: TP=8 within each node (NVLink) and PP across nodes (InfiniBand). Add DP if you have extra nodes for throughput.

4

Inference priority?

Maximize TP within a node for lowest latency. Use PP across nodes only if the model does not fit on a single node. Never use PP for latency-sensitive serving if TP is an option.

Frequently Asked Questions

What is the difference between tensor parallelism and pipeline parallelism?

Tensor parallelism splits individual layers across multiple GPUs, requiring every GPU to participate in every forward and backward pass through all-reduce synchronization. Pipeline parallelism splits the model into sequential stages, assigning groups of complete layers to each GPU. The fundamental difference is communication: tensor parallelism requires high-bandwidth collective operations (all-reduce) after every layer, while pipeline parallelism uses low-volume point-to-point transfers (activation tensors) between adjacent stages. NVLink bandwidth (900 GB/s) supports tensor parallelism; InfiniBand (50-100 GB/s per port) is sufficient for pipeline parallelism.

Why does tensor parallelism require NVLink instead of InfiniBand?

Tensor parallelism runs an all-reduce operation after every transformer layer, synchronizing partial computation results across all GPUs in the tensor-parallel group. For a 70B model with 80 layers, that is 160 all-reduce operations per training step (forward and backward). Each all-reduce moves approximately 117 MB of data using ring all-reduce with 8 GPUs. The total communication volume per step is about 36.6 GB. NVLink at 900 GB/s handles this in 41 milliseconds (1.7% overhead on a 2,400 ms compute step). InfiniBand at 50 GB/s would take 732 milliseconds (30.5% overhead), making tensor parallelism across InfiniBand wasteful. The 18x bandwidth gap between NVLink and single-port InfiniBand is the core reason.

What are pipeline bubbles and how do you reduce them?

Pipeline bubbles are periods of GPU idle time at the start and end of each training step. With p pipeline stages, the last stage sits idle for (p-1) micro-batch durations while the pipeline fills. The bubble fraction is (p-1)/m where m is the number of micro-batches per step. With 4 stages and 4 micro-batches, 75% of GPU time is wasted. The primary fix is increasing micro-batches: at 32 micro-batches, the bubble drops to 9.4%. Advanced techniques include 1F1B scheduling (interleaving forward and backward passes to limit memory usage), interleaved stages (assigning multiple non-contiguous stages to each GPU), and zero-bubble scheduling (splitting backward passes into input-gradient and weight-gradient components that can be reordered to fill idle slots).

How should I configure parallelism for a 70B model on 16 H100 GPUs?

With 2 nodes of 8xH100, use TP=8 within each node and PP=2 across nodes. The 8-way tensor parallelism leverages NVLink at 900 GB/s for the frequent all-reduce operations. The 2-way pipeline parallelism sends activation tensors (~67 MB each) across InfiniBand at 400 Gb/s, which completes in about 10 milliseconds per transfer. Each GPU holds approximately 4.375B parameters (8.75 GB in FP16), leaving roughly 70 GB of the H100's 80 GB HBM3 for activations, KV-cache, and optimizer states. With 32 micro-batches per step, pipeline bubble overhead is only 3.1%.

Can I use tensor parallelism over PCIe instead of NVLink?

It is technically possible for very small models, but impractical for anything above 13B parameters. PCIe Gen 5 delivers 64 GB/s bidirectional bandwidth, roughly 14x slower than NVLink 5.0 at 900 GB/s. For an 8-way tensor-parallel 70B model, each all-reduce would take about 25 milliseconds over PCIe versus 1.8 milliseconds over NVLink. With 160 all-reduces per step, the total communication overhead is 4,000 ms on PCIe versus 288 ms on NVLink. Since the compute time is about 2,400 ms, PCIe communication takes longer than the actual computation. You would get better performance with fewer GPUs using tensor parallelism over NVLink than more GPUs over PCIe. Pipeline parallelism over PCIe is viable because activation transfers are much smaller and less frequent.

What parallelism strategy does Llama 3.1 405B use?

Meta trained Llama 3.1 405B on 16,384 H100 GPUs using TP=8, PP=16, DP=128. The 8-way tensor parallelism runs within each 8-GPU DGX node over NVLink. The 16-way pipeline parallelism spans 16 nodes, with each pipeline stage handling approximately 8 of the 126 transformer layers. The 128-way data parallelism replicates 128 copies of the full TP x PP pipeline, each processing different training data. Gradients are synchronized across DP ranks using ring all-reduce over InfiniBand. This configuration achieves strong scaling because the bandwidth-intensive operations (tensor parallelism all-reduce) stay within NVLink, while the bandwidth-light operations (pipeline stage transitions, gradient sync) traverse InfiniBand.

What bandwidth do I need between nodes for pipeline parallelism?

Much less than you might expect. Pipeline parallelism transfers activation tensors between adjacent stages. For a 70B model with micro-batch size 1 and sequence length 4,096, each activation is approximately 67 MB. With 32 micro-batches per step and some communication-computation overlap, you need roughly 2-5 GB/s of sustained throughput between nodes. A single ConnectX-7 port at 400 Gb/s (50 GB/s) delivers 10x to 25x more bandwidth than required. Even 100 Gb/s Ethernet (12.5 GB/s) is sufficient for pipeline parallelism. The real bottleneck for pipeline parallelism is not bandwidth; it is the pipeline bubble overhead, which is solved by increasing micro-batch count.

Tensor vs Pipeline
Parallelism Strategy

Splitting Individual Layers Across GPUs

Splitting the Model Into Sequential Stages

Interconnect Bandwidth Determines Everything

NVLink (Within a Node)

InfiniBand (Between Nodes)

PCIe (Fallback)

The Math: Why Tensor Parallelism Fails Across InfiniBand

When to Use Each Strategy

Tensor Parallelism: Within a Single NVLink Node

Pipeline Parallelism: Across Nodes Over InfiniBand or Ethernet

Hybrid (TP + PP): What Real Training Runs Actually Use

The Pipeline Bubble Problem and How to Solve It

The Bubble Fraction Formula

Mitigation Techniques

Micro-Batching

1F1B (One Forward, One Backward) Scheduling

Interleaved Stages

Zero Bubble Pipeline Parallelism

Real-World Configuration Examples

70B Model on 2 Nodes of 8xH100 (16 GPUs)

405B Model on 8 Nodes of 8xH100 (64 GPUs)

70B Model Inference on a Single 8xH100 Node

Bandwidth Calculation Deep Dive

Tensor Parallelism: All-Reduce Volume Per Step

Pipeline Parallelism: Activation Transfer Volume Per Step

Advanced Considerations for Cluster Design

Sequence Parallelism

Expert Parallelism for MoE Models

Context Parallelism for Long Sequences

Network Topology Matters

Quick Decision Framework

Model fits on 1 GPU?

Fits on 1 node (8 GPUs)?

Needs multiple nodes?

Inference priority?

Frequently Asked Questions

Build the Right Cluster for Your Parallelism Strategy

Tensor vs PipelineParallelism Strategy

Splitting Individual Layers Across GPUs

Splitting the Model Into Sequential Stages

Interconnect Bandwidth Determines Everything

NVLink (Within a Node)

InfiniBand (Between Nodes)

PCIe (Fallback)

The Math: Why Tensor Parallelism Fails Across InfiniBand

When to Use Each Strategy

Tensor Parallelism: Within a Single NVLink Node

Pipeline Parallelism: Across Nodes Over InfiniBand or Ethernet

Hybrid (TP + PP): What Real Training Runs Actually Use

The Pipeline Bubble Problem and How to Solve It

The Bubble Fraction Formula

Mitigation Techniques

Micro-Batching

1F1B (One Forward, One Backward) Scheduling

Interleaved Stages

Zero Bubble Pipeline Parallelism

Real-World Configuration Examples

70B Model on 2 Nodes of 8xH100 (16 GPUs)

405B Model on 8 Nodes of 8xH100 (64 GPUs)

70B Model Inference on a Single 8xH100 Node

Bandwidth Calculation Deep Dive

Tensor Parallelism: All-Reduce Volume Per Step

Pipeline Parallelism: Activation Transfer Volume Per Step

Advanced Considerations for Cluster Design

Sequence Parallelism

Expert Parallelism for MoE Models

Context Parallelism for Long Sequences

Network Topology Matters

Quick Decision Framework

Model fits on 1 GPU?

Fits on 1 node (8 GPUs)?

Needs multiple nodes?

Inference priority?

Frequently Asked Questions

Build the Right Cluster for Your Parallelism Strategy

Tensor vs Pipeline
Parallelism Strategy