Tensor Parallelism Pipeline Parallelism Hybrid

Tensor vs Pipeline Parallelism: Choosing the Right Strategy for Your AI Cluster

When a model outgrows a single GPU, you face a fundamental engineering decision: split layers across GPUs (tensor parallelism) or split the model into stages (pipeline parallelism). The bandwidth of the interconnect between your GPUs determines which approach wins. This guide provides the math, the specs, and the real-world configurations to make the right call.

Tensor Parallelism

Splitting Individual Layers Across GPUs

Tensor parallelism (TP) takes a single layer's weight matrix and slices it across multiple GPUs. Every GPU holds a shard of every layer. When a forward pass runs, each GPU computes its portion of the matrix multiplication, then all GPUs synchronize their partial results through an all-reduce collective operation before moving to the next layer.

Consider a transformer's self-attention layer. The query, key, and value projection matrices each have dimensions [hidden_size x hidden_size]. For a 70B model like Llama 2 70B, hidden_size is 8,192. With 8-way tensor parallelism, each GPU holds a [8192 x 1024] slice of each projection matrix. After each GPU computes its partial output, the results must be combined through an all-reduce to reconstruct the full attention output.

This pattern repeats for every layer in the network. A 70B model has 80 transformer layers, meaning 80 all-reduce operations per forward pass and another 80 during the backward pass. That is 160 synchronization points per training step where every GPU must wait for every other GPU to finish its computation and exchange data.

The critical requirement: Tensor parallelism demands extremely high bandwidth and low latency between GPUs because all-reduce operations happen after every single layer. If the interconnect is slow, GPUs spend more time waiting for data than computing.

Tensor Parallelism: 4-Way TP on One Layer

Input Tensor X [batch, seq, 8192]
       |
       v
  +---------+---------+---------+---------+
  | GPU 0   | GPU 1   | GPU 2   | GPU 3   |
  | W[0:2048]| W[2048: | W[4096: | W[6144: |
  |         | 4096]   | 6144]   | 8192]   |
  |  Y_0    |  Y_1    |  Y_2    |  Y_3    |
  +---------+---------+---------+---------+
       |         |         |         |
       +----+----+----+----+----+----+
            |  ALL-REDUCE  |
            v              v
     Y = Y_0 + Y_1 + Y_2 + Y_3
            |
            v
     Next Layer (repeat all-reduce)
            |
            ...
     80 layers = 80 all-reduce ops
                        

Key Characteristics

  • Every GPU participates in every forward/backward pass
  • All-reduce after every layer (collective, not point-to-point)
  • Latency-sensitive: one slow GPU stalls the entire group
  • Memory efficient: each GPU stores only 1/N of each layer

Pipeline Parallelism: 4 Stages Across 4 GPUs

Model: 80 layers total

GPU 0 (Stage 1): Layers  0-19  --> activation
GPU 1 (Stage 2): Layers 20-39  --> activation
GPU 2 (Stage 3): Layers 40-59  --> activation
GPU 3 (Stage 4): Layers 60-79  --> output

Time -->
         MB1    MB2    MB3    MB4
GPU 0:  [FWD]  [FWD]  [FWD]  [FWD]  [BWD]...
GPU 1:  [idle] [FWD]  [FWD]  [FWD]  [FWD]...
GPU 2:  [idle] [idle] [FWD]  [FWD]  [FWD]...
GPU 3:  [idle] [idle] [idle] [FWD]  [FWD]...
         ^^^^^^^^^^^^^^^^^^^
         Pipeline "bubble" (idle time)
                        

Key Characteristics

  • Each GPU processes a different subset of layers
  • Point-to-point transfers: only activation tensors between adjacent stages
  • Bandwidth tolerant: works over InfiniBand and even Ethernet
  • Pipeline bubbles reduce efficiency (mitigated by micro-batching)
Pipeline Parallelism

Splitting the Model Into Sequential Stages

Pipeline parallelism (PP) divides the model vertically: the first N layers go on GPU 0, the next N layers on GPU 1, and so on. Data flows through the pipeline sequentially. GPU 0 processes a micro-batch, sends the resulting activation tensor to GPU 1, then immediately starts processing the next micro-batch. Each GPU operates on a different micro-batch at any given time.

The communication pattern is fundamentally different from tensor parallelism. Instead of all-reduce operations involving every GPU after every layer, pipeline parallelism uses simple point-to-point sends between adjacent stages. GPU 0 sends to GPU 1, GPU 1 sends to GPU 2, and so on. The data transferred is an activation tensor, typically much smaller than the collective data moved during an all-reduce.

For that same 70B model split across 4 pipeline stages, each stage handles 20 layers. The activation tensor passed between stages has dimensions [micro_batch_size x sequence_length x hidden_size]. With a micro-batch of 1, sequence length of 4,096, and hidden_size of 8,192, that activation is about 32 MB in FP16. Compare this to tensor parallelism's all-reduce, which must exchange partial sums totaling hundreds of megabytes across all GPUs simultaneously.

The tradeoff: Pipeline parallelism is bandwidth-friendly but time-inefficient. GPUs at later stages sit idle while earlier stages process the first micro-batches. This idle time is called the pipeline bubble, and minimizing it is the central challenge of pipeline parallelism.

The Deciding Factor

Interconnect Bandwidth Determines Everything

The entire tensor-vs-pipeline decision collapses to a single question: what is the bandwidth of the link between GPUs? Three interconnect technologies define three distinct regimes.

NVLink (Within a Node)

NVLink 5.0 (Blackwell) 900 GB/s
NVLink 4.0 (Hopper) 900 GB/s
NVLink 3.0 (Ampere) 600 GB/s

Tensor Parallelism: Excellent

This bandwidth makes all-reduce operations fast enough that communication overhead stays below 10% of total step time. Tensor parallelism is the default within NVLink-connected nodes.

InfiniBand (Between Nodes)

ConnectX-7, single port 400 Gb/s (50 GB/s)
ConnectX-7, dual bonded 800 Gb/s (100 GB/s)
vs NVLink 5.0 9x to 18x slower

Pipeline Parallelism: Ideal

Point-to-point activation transfers are small enough that InfiniBand handles them with minimal overhead. Tensor parallelism across InfiniBand, by contrast, creates severe bottlenecks at every all-reduce.

PCIe (Fallback)

PCIe Gen 5 x16 64 GB/s bidir
PCIe Gen 4 x16 32 GB/s bidir
vs NVLink 5.0 14x to 28x slower

Tensor Parallelism: Not Viable

PCIe cannot sustain the all-reduce throughput that tensor parallelism requires. Only pipeline parallelism with small activation tensors is practical over PCIe interconnects.

The Math: Why Tensor Parallelism Fails Across InfiniBand

Consider 8-way tensor parallelism for a 70B model (Llama 2 70B architecture: 80 layers, hidden_size 8192, FP16 weights). Each all-reduce operation transfers 2 x (N-1)/N x message_size bytes using the ring all-reduce algorithm, where N is the number of GPUs.

Over NVLink (900 GB/s)

Message per all-reduce: ~1.6 GB
Ring all-reduce data:   2 x (7/8) x 1.6 GB = 2.8 GB
NVLink bandwidth:       900 GB/s
Time per all-reduce:    2.8 / 900 = 3.1 ms

80 layers x 2 (fwd+bwd) = 160 all-reduces
Total comm time:        160 x 3.1 ms = 496 ms

H100 compute time:      ~2,400 ms per step
Comm overhead:          496 / 2400 = 20.7%
                        

Viable: communication is a fraction of compute.

Over InfiniBand (50 GB/s single port)

Message per all-reduce: ~1.6 GB
Ring all-reduce data:   2 x (7/8) x 1.6 GB = 2.8 GB
IB bandwidth:           50 GB/s (single ConnectX-7)
Time per all-reduce:    2.8 / 50 = 56 ms

80 layers x 2 (fwd+bwd) = 160 all-reduces
Total comm time:        160 x 56 ms = 8,960 ms

H100 compute time:      ~2,400 ms per step
Comm overhead:          8960 / 2400 = 373%
                        

Disastrous: GPUs idle 3.7x longer than they compute.

The 18x gap is the entire story. NVLink at 900 GB/s makes tensor parallelism communication a manageable overhead. InfiniBand at 50 GB/s turns it into the dominant cost. Even with dual-bonded 800 Gb/s InfiniBand (100 GB/s), the overhead is still 9x higher than NVLink, making tensor parallelism across nodes impractical for models with many layers.

When to Use Each Strategy

The decision follows directly from the interconnect between your GPUs. Here are the three scenarios and the optimal strategy for each.

TP

Tensor Parallelism: Within a Single NVLink Node

Use tensor parallelism when your model's individual layers are too large for a single GPU's memory and all GPUs are connected via NVLink within a single server node. This is the default for systems like NVIDIA DGX with 8 GPUs connected by NVLink at 900 GB/s. Tensor parallelism is the best way to spread a single large layer across multiple GPUs because every GPU contributes to every computation, maximizing utilization.

Typical tensor parallelism degrees: 2-way, 4-way, or 8-way, matching the number of GPUs in a single node. Going beyond 8-way TP is unusual because most server nodes max out at 8 GPUs, and extending TP across nodes introduces the InfiniBand bottleneck described above. For H100 and H200 NVLink clusters, 8-way TP within a node is the standard configuration.

Best for:

Models where each layer exceeds single-GPU memory (hidden_size > 12,288 in FP16 typically). Examples: Llama 70B, Falcon 180B, GPT-4 class models. Run 8-way TP within a DGX node.

PP

Pipeline Parallelism: Across Nodes Over InfiniBand or Ethernet

Use pipeline parallelism when you need more GPUs than fit in a single node and must communicate across InfiniBand or Ethernet. Pipeline parallelism sends only activation tensors between adjacent pipeline stages. These tensors are small (tens of megabytes) compared to the gigabytes moved by tensor parallelism all-reduce operations, making pipeline parallelism tolerant of lower-bandwidth interconnects.

Pipeline parallelism across 2, 4, 8, or 16 nodes is common in large-scale training. The main cost is pipeline bubbles, not communication overhead. With proper micro-batching (32 or more micro-batches per step), bubble overhead drops below 10%, making pipeline parallelism highly efficient even over modest network links.

Best for:

Multi-node clusters connected by InfiniBand (400-800 Gb/s) or high-speed Ethernet (100-400 GbE). Examples: 2 to 64 nodes of 8xH100, multi-GPU inference with RTX 6000 Pro across servers.

Hybrid

Hybrid (TP + PP): What Real Training Runs Actually Use

Nearly every large-scale training run uses a hybrid approach: tensor parallelism within each NVLink-connected node and pipeline parallelism across nodes over InfiniBand. This exploits the strength of each strategy exactly where its interconnect supports it. High-bandwidth NVLink handles the frequent all-reduce operations of tensor parallelism. Lower-bandwidth InfiniBand handles the occasional, small activation transfers of pipeline parallelism.

Most production configurations add a third dimension: data parallelism (DP) across groups of pipeline-parallel GPUs. The full notation is TP x PP x DP. For example, TP=8, PP=4, DP=16 uses 512 GPUs total (8 x 4 x 16). Each group of 32 GPUs (8 TP x 4 PP) processes one copy of the model, and 16 such groups process different data shards in parallel, synchronizing gradients periodically.

Best for:

Any training run requiring more than 8 GPUs. This is the standard for models above 30B parameters. Meta's Llama 3.1 405B used TP=8, PP=16, DP=128 across 16,384 H100 GPUs.

The Pipeline Bubble Problem and How to Solve It

Pipeline parallelism's biggest weakness is idle GPU time. Understanding the bubble fraction and the techniques to minimize it is essential for efficient cluster utilization.

The Bubble Fraction Formula

With p pipeline stages and m micro-batches per training step, the bubble fraction is:

bubble_fraction = (p - 1) / m
                        

With 4 pipeline stages and only 4 micro-batches, the bubble fraction is 3/4 = 75%. Three-quarters of all GPU-seconds are wasted. This is catastrophic. The solution is straightforward: increase the number of micro-batches.

p=4, m=4 75.0% bubble
p=4, m=8 37.5% bubble
p=4, m=16 18.8% bubble
p=4, m=32 9.4% bubble
p=4, m=64 4.7% bubble

Mitigation Techniques

Micro-Batching

Split the global batch into many micro-batches. With 32+ micro-batches per step, the bubble fraction drops below 10%. The tradeoff: more micro-batches means more activation memory stored simultaneously, since each stage must hold activations for all in-flight micro-batches during the backward pass. Activation checkpointing (recomputation) mitigates this memory cost.

1F1B (One Forward, One Backward) Scheduling

Instead of running all forward passes first and then all backward passes (the "fill-drain" approach), 1F1B interleaves forward and backward passes. After the pipeline fills, each GPU alternates between one forward micro-batch and one backward micro-batch. This keeps the peak activation memory bounded to p micro-batches instead of m, significantly reducing memory pressure while maintaining the same bubble fraction.

Interleaved Stages

Assign each GPU multiple non-contiguous stages (for example, GPU 0 gets stages 1 and 5, GPU 1 gets stages 2 and 6). This effectively doubles the number of pipeline stages while keeping the same number of GPUs. The bubble fraction becomes (p-1)/(m*v) where v is the number of virtual stages per GPU. Megatron-LM implements this as "interleaved pipeline parallelism," reducing bubble overhead by 2x to 4x compared to standard scheduling.

Zero Bubble Pipeline Parallelism

Recent research (Qi et al., 2023) introduced zero-bubble scheduling by splitting the backward pass into two parts: backward for input gradients (B) and backward for weight gradients (W). By reordering B and W computations across micro-batches, GPUs can stay occupied during what would otherwise be bubble time. This achieves near-zero bubble overhead in theory, with practical implementations reaching below 3% bubble fraction.

Real-World Configuration Examples

Practical parallelism configurations for common model sizes and cluster topologies. These assume FP16/BF16 training with activation checkpointing enabled.

Hybrid TP+PP

70B Model on 2 Nodes of 8xH100 (16 GPUs)

A 70B parameter model in FP16 requires approximately 140 GB for weights alone. With optimizer states (Adam: 2x weights for momentum and variance in FP32), total memory per model copy is roughly 560 GB. Across 16 H100 GPUs (80 GB each = 1,280 GB total), this leaves comfortable headroom for activations.

Configuration: TP=8, PP=2, DP=1. Each node runs 8-way tensor parallelism over NVLink. The two nodes form a 2-stage pipeline over InfiniBand. Each GPU holds 70B / 16 = 4.375B parameters, requiring about 8.75 GB in FP16. With optimizer states distributed via ZeRO Stage 1, each GPU's total memory footprint is approximately 35 GB, well within the 80 GB HBM3 per H100.

Cluster Topology

Node 0 (NVLink 900 GB/s):
  GPU0-GPU7: TP=8, Layers 0-39

        | InfiniBand 400 Gb/s |

Node 1 (NVLink 900 GB/s):
  GPU0-GPU7: TP=8, Layers 40-79

TP=8 (intra-node), PP=2 (inter-node)
Total: 16 GPUs
Memory/GPU: ~35 GB / 80 GB available
                            
Hybrid TP+PP+DP

405B Model on 8 Nodes of 8xH100 (64 GPUs)

A 405B model (like Llama 3.1 405B) in FP16 needs approximately 810 GB for weights. With optimizer states, the full training state exceeds 3.2 TB. Across 64 GPUs at 80 GB each (5,120 GB total), you have enough memory, but the parallelism strategy must be carefully designed to keep communication efficient.

Configuration: TP=8, PP=8, DP=1. Eight-way tensor parallelism within each node over NVLink. Eight pipeline stages across nodes, with each stage handling approximately 12 of the model's 126 layers. Each GPU holds 405B / 64 = 6.3B parameters (12.7 GB in FP16). With 32+ micro-batches per step, pipeline bubble overhead stays below 10%.

To add data parallelism, double the cluster to 16 nodes (128 GPUs) and use TP=8, PP=8, DP=2. This doubles throughput with minimal additional communication overhead since gradient synchronization across DP ranks can overlap with computation.

Cluster Topology

Node 0: TP=8, PP Stage 1 (Layers 0-15)
Node 1: TP=8, PP Stage 2 (Layers 16-31)
Node 2: TP=8, PP Stage 3 (Layers 32-47)
Node 3: TP=8, PP Stage 4 (Layers 48-63)
Node 4: TP=8, PP Stage 5 (Layers 64-78)
Node 5: TP=8, PP Stage 6 (Layers 79-93)
Node 6: TP=8, PP Stage 7 (Layers 94-109)
Node 7: TP=8, PP Stage 8 (Layers 110-125)

All inter-node: InfiniBand 400-800 Gb/s
All intra-node: NVLink 900 GB/s
Micro-batches: 32+ (bubble < 10%)
                            
TP for Inference

70B Model Inference on a Single 8xH100 Node

Inference has different requirements than training. There are no optimizer states and no backward pass, so a 70B model in FP16 needs only about 140 GB, comfortably fitting across 8 H100 GPUs (640 GB total). The priority for inference is latency, not throughput.

Configuration: TP=8, PP=1. Pure tensor parallelism across all 8 GPUs within a single node. Every GPU participates in every token generation, minimizing per-token latency. Pipeline parallelism would add latency because each stage must wait for the previous stage to finish. For latency-sensitive inference serving, tensor parallelism over NVLink is always preferred.

Frameworks like vLLM and TensorRT-LLM default to tensor parallelism for inference. They shard the KV-cache across GPUs alongside the model weights, keeping memory balanced and latency minimal.

Inference Topology

Single Node (NVLink 900 GB/s):
  GPU0: 1/8 of every layer + KV shard
  GPU1: 1/8 of every layer + KV shard
  GPU2: 1/8 of every layer + KV shard
  GPU3: 1/8 of every layer + KV shard
  GPU4: 1/8 of every layer + KV shard
  GPU5: 1/8 of every layer + KV shard
  GPU6: 1/8 of every layer + KV shard
  GPU7: 1/8 of every layer + KV shard

TP=8, PP=1
All-reduce per layer: ~1.8 ms (NVLink)
Per-token latency: ~15-25 ms
                            

Bandwidth Calculation Deep Dive

Step-by-step calculations that show why interconnect bandwidth is the gating factor for parallelism strategy selection.

Tensor Parallelism: All-Reduce Volume Per Step

For a transformer layer with hidden dimension H and tensor parallelism degree T, the all-reduce after the column-parallel linear layer transfers a tensor of size [batch x seq_len x H] in FP16.

Model: Llama 2 70B
  H = 8192, layers = 80, T = 8
  micro_batch = 1, seq_len = 4096

Activation size per all-reduce:
  1 x 4096 x 8192 x 2 bytes = 67 MB

All-reduces per layer: 2 (attn + FFN)
All-reduces per step (fwd+bwd):
  80 layers x 2 x 2 = 320

Ring all-reduce volume per op:
  2 x (7/8) x 67 MB = 117 MB

Total communication per step:
  320 x 117 MB = 36.6 GB
                        

36.6 GB of all-reduce data per training step. Over NVLink at 900 GB/s, this takes about 41 milliseconds. Over single-port InfiniBand at 50 GB/s, it takes 732 milliseconds. The compute time for the step itself is roughly 2,400 milliseconds on H100, so NVLink adds 1.7% overhead while InfiniBand adds 30.5% overhead. With larger batch sizes, the activation tensors grow proportionally, making the gap even worse.

Pipeline Parallelism: Activation Transfer Volume Per Step

Pipeline parallelism transfers only the activation tensor at each stage boundary. This is a point-to-point send, not a collective operation. Each stage sends one activation per micro-batch.

Model: Llama 2 70B, PP = 4 stages
  H = 8192
  micro_batch = 1, seq_len = 4096
  micro_batches_per_step = 32

Activation size per transfer:
  1 x 4096 x 8192 x 2 bytes = 67 MB

Transfers per step per boundary:
  32 (fwd) + 32 (bwd) = 64

Stage boundaries: 3 (between 4 stages)

Total communication per step:
  64 x 67 MB x 3 = 12.9 GB

But each boundary is independent:
  Per-boundary: 64 x 67 MB = 4.3 GB

Over IB at 50 GB/s: 86 ms per boundary
Pipelined (overlapped): ~86 ms total
                        

Pipeline parallelism transfers can overlap with computation because stage N can be sending its output to stage N+1 while simultaneously receiving input from stage N-1 and computing on a different micro-batch. The effective communication overhead is often just a single boundary's latency, around 86 ms over single-port InfiniBand. Compared to compute time of 2,400 ms, that is only 3.6% overhead.

Metric Tensor Parallelism Pipeline Parallelism
Communication pattern All-reduce (collective) Point-to-point (send/recv)
Data moved per step (70B, 8-way) 36.6 GB 4.3 GB per boundary
Overhead on NVLink (900 GB/s) 1.7% 0.2%
Overhead on IB (50 GB/s) 30.5% 3.6%
Overhead on PCIe Gen 5 (64 GB/s) 23.8% 2.8%
GPU idle time source Synchronization stalls Pipeline bubbles
Primary bottleneck Interconnect bandwidth Micro-batch count
Inference latency impact Low (all GPUs in parallel) Higher (sequential stages)

Advanced Considerations for Cluster Design

Sequence Parallelism

Sequence parallelism (SP) extends tensor parallelism by also partitioning the sequence dimension across GPUs for operations that do not involve weight matrices (LayerNorm, Dropout). In standard tensor parallelism, these operations are replicated on every GPU, wasting memory. Sequence parallelism eliminates this redundancy. Megatron-LM enables SP alongside TP with zero additional communication cost because the required data permutations piggyback on existing all-reduce operations.

For very long sequences (32K+ tokens), sequence parallelism becomes essential for memory. Without it, the activation memory for LayerNorm and Dropout alone can consume 10+ GB per GPU at 128K sequence length.

Expert Parallelism for MoE Models

Mixture-of-Experts (MoE) models like Mixtral 8x7B and DeepSeek V3 introduce a fourth parallelism dimension: expert parallelism (EP). Each expert is placed on a different GPU, and tokens are routed to the appropriate expert via all-to-all communication. EP works well across InfiniBand because the data volume per token is small (only the tokens routed to a given expert need to be sent to that GPU).

For MoE models, a common configuration is TP within the node for attention layers, EP across nodes for expert layers, and PP across node groups for the full model. The all-to-all pattern of EP is more bandwidth-friendly than all-reduce, making cross-node expert placement more practical than cross-node tensor parallelism.

Context Parallelism for Long Sequences

When training on very long sequences (100K+ tokens), the KV-cache and attention computation become the bottleneck rather than model weights. Context parallelism (CP) splits the sequence across GPUs, with each GPU processing a chunk of the sequence and exchanging KV pairs with other GPUs via ring attention or similar algorithms.

CP is typically applied within a node over NVLink because the KV exchange happens frequently (once per attention layer). The full parallelism stack for a long-context 405B model might be TP=8, CP=2, PP=16, DP=64, using 16,384 GPUs across 2,048 nodes, with TP and CP sharing the 8 NVLink-connected GPUs in each node.

Network Topology Matters

InfiniBand networks use fat-tree or dragonfly topologies. In a fat-tree, bandwidth is uniform between any two nodes. In a dragonfly topology, bandwidth within a group is higher than between groups. When designing pipeline parallelism across nodes, place adjacent pipeline stages on nodes within the same network group to minimize cross-group traffic.

Petronella Technology Group designs custom GPU cluster topologies optimized for your specific model architecture and parallelism requirements. The network fabric is as important as the GPUs themselves. A poorly designed network can leave expensive GPUs idle waiting for data.

Quick Decision Framework

Follow this decision tree to determine the right parallelism configuration for your workload.

1

Model fits on 1 GPU?

No parallelism needed. Use data parallelism only if you want to increase throughput by processing more batches across GPUs.

2

Fits on 1 node (8 GPUs)?

Use tensor parallelism (TP=2, 4, or 8) over NVLink. This is the simplest and most efficient configuration for models up to ~140 GB.

3

Needs multiple nodes?

Hybrid: TP=8 within each node (NVLink) and PP across nodes (InfiniBand). Add DP if you have extra nodes for throughput.

4

Inference priority?

Maximize TP within a node for lowest latency. Use PP across nodes only if the model does not fit on a single node. Never use PP for latency-sensitive serving if TP is an option.

Frequently Asked Questions

Tensor parallelism splits individual layers across multiple GPUs, requiring every GPU to participate in every forward and backward pass through all-reduce synchronization. Pipeline parallelism splits the model into sequential stages, assigning groups of complete layers to each GPU. The fundamental difference is communication: tensor parallelism requires high-bandwidth collective operations (all-reduce) after every layer, while pipeline parallelism uses low-volume point-to-point transfers (activation tensors) between adjacent stages. NVLink bandwidth (900 GB/s) supports tensor parallelism; InfiniBand (50-100 GB/s per port) is sufficient for pipeline parallelism.

Tensor parallelism runs an all-reduce operation after every transformer layer, synchronizing partial computation results across all GPUs in the tensor-parallel group. For a 70B model with 80 layers, that is 160 all-reduce operations per training step (forward and backward). Each all-reduce moves approximately 117 MB of data using ring all-reduce with 8 GPUs. The total communication volume per step is about 36.6 GB. NVLink at 900 GB/s handles this in 41 milliseconds (1.7% overhead on a 2,400 ms compute step). InfiniBand at 50 GB/s would take 732 milliseconds (30.5% overhead), making tensor parallelism across InfiniBand wasteful. The 18x bandwidth gap between NVLink and single-port InfiniBand is the core reason.

Pipeline bubbles are periods of GPU idle time at the start and end of each training step. With p pipeline stages, the last stage sits idle for (p-1) micro-batch durations while the pipeline fills. The bubble fraction is (p-1)/m where m is the number of micro-batches per step. With 4 stages and 4 micro-batches, 75% of GPU time is wasted. The primary fix is increasing micro-batches: at 32 micro-batches, the bubble drops to 9.4%. Advanced techniques include 1F1B scheduling (interleaving forward and backward passes to limit memory usage), interleaved stages (assigning multiple non-contiguous stages to each GPU), and zero-bubble scheduling (splitting backward passes into input-gradient and weight-gradient components that can be reordered to fill idle slots).

With 2 nodes of 8xH100, use TP=8 within each node and PP=2 across nodes. The 8-way tensor parallelism leverages NVLink at 900 GB/s for the frequent all-reduce operations. The 2-way pipeline parallelism sends activation tensors (~67 MB each) across InfiniBand at 400 Gb/s, which completes in about 10 milliseconds per transfer. Each GPU holds approximately 4.375B parameters (8.75 GB in FP16), leaving roughly 70 GB of the H100's 80 GB HBM3 for activations, KV-cache, and optimizer states. With 32 micro-batches per step, pipeline bubble overhead is only 3.1%.

It is technically possible for very small models, but impractical for anything above 13B parameters. PCIe Gen 5 delivers 64 GB/s bidirectional bandwidth, roughly 14x slower than NVLink 5.0 at 900 GB/s. For an 8-way tensor-parallel 70B model, each all-reduce would take about 25 milliseconds over PCIe versus 1.8 milliseconds over NVLink. With 160 all-reduces per step, the total communication overhead is 4,000 ms on PCIe versus 288 ms on NVLink. Since the compute time is about 2,400 ms, PCIe communication takes longer than the actual computation. You would get better performance with fewer GPUs using tensor parallelism over NVLink than more GPUs over PCIe. Pipeline parallelism over PCIe is viable because activation transfers are much smaller and less frequent.

Meta trained Llama 3.1 405B on 16,384 H100 GPUs using TP=8, PP=16, DP=128. The 8-way tensor parallelism runs within each 8-GPU DGX node over NVLink. The 16-way pipeline parallelism spans 16 nodes, with each pipeline stage handling approximately 8 of the 126 transformer layers. The 128-way data parallelism replicates 128 copies of the full TP x PP pipeline, each processing different training data. Gradients are synchronized across DP ranks using ring all-reduce over InfiniBand. This configuration achieves strong scaling because the bandwidth-intensive operations (tensor parallelism all-reduce) stay within NVLink, while the bandwidth-light operations (pipeline stage transitions, gradient sync) traverse InfiniBand.

Much less than you might expect. Pipeline parallelism transfers activation tensors between adjacent stages. For a 70B model with micro-batch size 1 and sequence length 4,096, each activation is approximately 67 MB. With 32 micro-batches per step and some communication-computation overlap, you need roughly 2-5 GB/s of sustained throughput between nodes. A single ConnectX-7 port at 400 Gb/s (50 GB/s) delivers 10x to 25x more bandwidth than required. Even 100 Gb/s Ethernet (12.5 GB/s) is sufficient for pipeline parallelism. The real bottleneck for pipeline parallelism is not bandwidth; it is the pipeline bubble overhead, which is solved by increasing micro-batch count.

Build the Right Cluster for Your Parallelism Strategy

Petronella Technology Group designs custom GPU clusters optimized for your specific parallelism requirements. We select the right NVLink topology, InfiniBand fabric, and network architecture to maximize training throughput and inference performance for your model size and workload.

From single-node DGX deployments to multi-node NVLink cluster builds and full AI infrastructure design. Call for a free consultation.

Or schedule a call at a time that works for you

Petronella Technology Group | 5540 Centerview Dr, Suite 200, Raleigh, NC 27606 | Since 2002

Petronella Technology Group

5540 Centerview Dr, Suite 200, Raleigh, NC 27606

(919) 348-4912 | Founded 2002 | 2,500+ Clients

CMMC-RP Certified Team: Craig Petronella, Blake Rea, Justin Summers, Jonathan Wood

Craig Petronella: CMMC-RP, CCNA, CWNE, DFE #604180