NVIDIA MIG: GPU Slicing for Datacenter Concurrency
Partition One GPU Into Up to Seven Isolated Instances
Multi-Instance GPU technology delivers hardware-level isolation for compute, memory, and bandwidth on every slice. Run multiple workloads on a single A100, H100, or H200 with zero contention and zero context-switching overhead.
The GPU Contention Problem
Modern datacenter GPUs like the A100 80GB and H100 80GB are enormously powerful. A single H100 delivers 3,958 TFLOPS of FP8 throughput and 80GB of HBM3 memory. The problem is that most inference workloads, development tasks, and small training jobs use only a fraction of that capacity.
Without MIG: Waste or Contention
When a datacenter GPU serves one workload at a time, the utilization story is grim. A small language model inference job might consume 10GB of memory and 15% of available compute on an 80GB H100. The remaining 70GB of memory and 85% of compute sit idle, burning power and generating zero revenue.
The traditional workaround is time-slicing, where the GPU scheduler rapidly switches between workloads. This introduces real penalties: context-switching overhead that degrades throughput by 10-30%, no memory isolation between tenants (one workload can see or corrupt another's memory), and unpredictable tail latency caused by scheduling jitter. For production inference serving, those tail latency spikes translate directly into SLA violations.
In multi-tenant environments like managed Kubernetes clusters, shared AI development platforms, or inference-as-a-service deployments, the absence of hardware isolation creates both a performance problem and a security problem. Tenants cannot trust that their data is isolated from other users on the same GPU.
With MIG: True Hardware Partitioning
NVIDIA Multi-Instance GPU (MIG) solves this by partitioning a single physical GPU into multiple isolated instances at the hardware level. Each MIG instance receives dedicated Streaming Multiprocessors (SMs), dedicated memory, and dedicated memory bandwidth controllers. This is not software scheduling or virtualization. It is physical partitioning of the GPU's resources.
A single A100 80GB can be split into seven independent 1g.10gb instances, each behaving like a small standalone GPU with its own compute engines, its own 10GB memory region, and its own fraction of the memory bus. Workloads on different MIG instances cannot interfere with each other, cannot access each other's memory, and do not contend for scheduling resources.
The result is deterministic performance with consistent latency characteristics on every slice, full memory isolation between tenants, and GPU utilization rates that approach 100% instead of the 15-30% typical of single-workload deployments.
The core value proposition: MIG turns one expensive datacenter GPU into multiple independent GPUs, each with hardware-enforced isolation. You get the multi-tenancy of a cloud GPU service with the performance predictability of dedicated hardware.
How MIG Works
MIG operates through a two-level hierarchy of GPU Instances (GIs) and Compute Instances (CIs) that map directly to hardware partitioning boundaries inside the GPU.
GPU Instances (GIs)
A GPU Instance is the top-level partition. Each GI receives a fixed allocation of Streaming Multiprocessors, memory, and memory bandwidth. The GPU's L2 cache is also partitioned across GIs so that one instance's cache activity cannot evict another's data.
GIs are the isolation boundary. Each GI has its own video decoders, DMA engines, and memory controllers. A fault in one GI (even a GPU hang caused by a buggy kernel) cannot propagate to other GIs on the same physical GPU.
Compute Instances (CIs)
Within each GPU Instance, you can optionally create one or more Compute Instances. CIs subdivide the GI's Streaming Multiprocessors among multiple processes. All CIs within the same GI share that GI's memory allocation.
For most datacenter use cases, you create one CI per GI (a 1:1 mapping). CIs become useful when you want to run multiple small processes within a single memory domain, such as pre-processing and inference pipelines that share tensors.
Hardware-Level Isolation
Each MIG instance appears to the operating system and CUDA runtime as a separate GPU device with its own device file (for example, /dev/nvidia0 becomes /dev/nvidia0/gi0/ci0). Applications require zero code changes. Any CUDA application that runs on a full GPU runs identically on a MIG instance.
Memory protection is enforced by the GPU's Memory Management Unit (MMU). Even a malicious workload running on one MIG instance cannot read or write memory belonging to another instance. This is the same level of isolation you get from separate physical GPUs.
Enabling MIG on a Datacenter GPU
MIG is controlled through nvidia-smi, the NVIDIA System Management Interface. The workflow involves enabling MIG mode on the GPU, creating GPU Instances with specific profiles, and then creating Compute Instances within each GI.
# Enable MIG mode on GPU 0 (requires GPU reset, no active processes)
$ sudo nvidia-smi -i 0 -mig 1
# List available GPU Instance profiles
$ nvidia-smi mig -lgip
+-----------------------------------------------------------------------------+
| GPU Instance Profiles: |
| GPU Name ID Instances Memory SMs Memory BW |
| 1g.10gb 19 7 9.75 GB 14 ~285 GB/s |
| 2g.20gb 14 3 19.50 GB 28 ~570 GB/s |
| 3g.40gb 9 2 39.25 GB 42 ~950 GB/s |
| 4g.40gb 5 1 39.25 GB 56 ~1140 GB/s |
| 7g.80gb 0 1 79.00 GB 98 ~2039 GB/s |
+-----------------------------------------------------------------------------+
# Create seven 1g.10gb GPU Instances (maximum density)
$ sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19 -i 0
# Create one Compute Instance in each GPU Instance
$ sudo nvidia-smi mig -cci -i 0
# Verify: each instance appears as a separate device
$ nvidia-smi -L
GPU 0: NVIDIA A100 80GB (MIG 7x 1g.10gb)
MIG 1g.10gb Device 0: (UUID: MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
MIG 1g.10gb Device 1: (UUID: MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
MIG 1g.10gb Device 2: (UUID: MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
MIG 1g.10gb Device 3: (UUID: MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
MIG 1g.10gb Device 4: (UUID: MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
MIG 1g.10gb Device 5: (UUID: MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
MIG 1g.10gb Device 6: (UUID: MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
Each MIG device gets its own UUID and can be assigned to containers, VMs, or individual processes using CUDA_VISIBLE_DEVICES or Kubernetes device plugin resource requests. Applications see a standard GPU device and require no modification.
MIG Profiles Explained
MIG profiles define how a GPU's resources are divided. The naming convention follows the pattern: {GPU_slice_count}g.{memory}gb. Larger slices get more SMs, more memory, and more memory bandwidth.
NVIDIA A100 80GB MIG Profiles
| Profile | Max Instances | SMs per Instance | Memory | Memory BW | Ideal Workload |
|---|---|---|---|---|---|
| 1g.10gb | 7 | 14 | 9.75 GB | ~285 GB/s | Small model inference, dev sandboxes |
| 2g.20gb | 3 | 28 | 19.50 GB | ~570 GB/s | Medium model inference, RAG pipelines |
| 3g.40gb | 2 | 42 | 39.25 GB | ~950 GB/s | Large model inference, fine-tuning |
| 4g.40gb | 1 | 56 | 39.25 GB | ~1140 GB/s | Compute-heavy inference, can pair with 3g.40gb |
| 7g.80gb | 1 | 98 | 79.00 GB | ~2039 GB/s | Full GPU (training, large model serving) |
NVIDIA H100 80GB MIG Profiles
| Profile | Max Instances | SMs per Instance | Memory | Memory BW | Key Advantage over A100 |
|---|---|---|---|---|---|
| 1g.10gb | 7 | 16 (+14%) | 9.63 GB | ~464 GB/s | 63% more memory bandwidth per slice |
| 1g.20gb | 7 | 16 | 19.25 GB | ~464 GB/s | 2x memory vs 1g.10gb, same SM count |
| 2g.20gb | 3 | 32 | 19.25 GB | ~928 GB/s | 63% more BW, Transformer Engine per slice |
| 3g.40gb | 2 | 48 | 38.50 GB | ~1392 GB/s | FP8 support, 47% more BW |
| 7g.80gb | 1 | 114 | 79.00 GB | ~3250 GB/s | Full H100: 3x A100 FP8 throughput |
Mixed Profile Configurations
MIG allows mixing certain profile sizes on the same GPU, enabling heterogeneous workload scheduling on a single device. The GPU's memory and SMs are divided into 7 slices internally. Profiles consume 1, 2, 3, 4, or 7 of these slices, and valid combinations must sum to 7 or fewer.
Maximum Density
7x 1g.10gb
7 small inference endpoints
Balanced Mix
1x 3g.40gb + 2x 2g.20gb
1 large model + 2 medium models
Train + Serve
1x 4g.40gb + 1x 3g.40gb
Fine-tuning + production inference
Dev Team
1x 3g.40gb + 4x 1g.10gb
1 staging GPU + 4 developer sandboxes
Medium Fleet
3x 2g.20gb + 1x 1g.10gb
3 medium services + 1 monitoring
Full GPU
1x 7g.80gb
Single workload, maximum performance
MIG Use Cases in the Datacenter
MIG transforms how organizations deploy and operate GPU infrastructure. These are the production scenarios where MIG delivers the highest return on investment.
Multi-Tenant Inference Serving
Serve different models to different customers on a single GPU. A managed AI provider can deploy a text classification model, a summarization model, an embedding model, and a small chat model on four separate MIG instances of the same H100. Each tenant gets guaranteed compute and memory with no noisy-neighbor effects.
This is the highest-ROI MIG use case. Instead of provisioning one GPU per model, you provision one GPU per 3 to 7 models, reducing hardware costs by 60 to 85% for inference fleets.
Development Environments
Give each ML engineer a dedicated GPU slice for interactive development. A team of seven developers can share a single A100 80GB, with each person receiving a 1g.10gb instance for prototyping, debugging, and experimentation. Developers get predictable performance because their slice is hardware-isolated from teammates' workloads.
When a developer needs to run a larger experiment, the admin can reconfigure the GPU to provide fewer, larger slices during off-hours. This flexibility eliminates the common pattern of buying one GPU per developer and seeing most of them idle 90% of the time.
Mixed Workload Scheduling
Run training and inference simultaneously on the same GPU. Configure one 4g.40gb instance for fine-tuning a model while a 3g.40gb instance serves production inference traffic. The training job cannot impact inference latency because the instances are hardware-isolated.
This pattern is particularly valuable for continuous learning systems where a model is retrained periodically while the previous version continues serving requests. Without MIG, this requires two separate GPUs or careful time-sharing that risks SLA violations during training windows.
Kubernetes GPU Sharing
In Kubernetes clusters, MIG provides proper GPU sharing with isolation guarantees that time-slicing cannot deliver. Each MIG device appears as a schedulable resource type. Pods request specific MIG profile sizes, and the Kubernetes scheduler places them on nodes with available MIG instances of the requested size.
This is a fundamental improvement over the default Kubernetes GPU model, where one pod claims an entire GPU. With MIG, a single 8-GPU node with all GPUs partitioned into 1g.10gb slices can serve 56 concurrent GPU pods instead of 8.
MIG vs. Time-Slicing vs. MPS
NVIDIA provides three methods for sharing a GPU among multiple workloads. Each has distinct isolation, performance, and compatibility characteristics. Choosing the right method depends on your workload profile and isolation requirements.
| Characteristic | MIG | Time-Slicing | MPS |
|---|---|---|---|
| Isolation Level | Hardware (MMU-enforced) | None (shared everything) | Partial (shared memory space) |
| Memory Isolation | Full (dedicated partition) | None | Configurable limits, not enforced |
| Fault Isolation | Full (GI-level containment) | None (one crash affects all) | None (MPS server crash kills all) |
| Context-Switch Overhead | Zero (parallel execution) | 10 to 30% throughput loss | Near zero (concurrent kernels) |
| Latency Predictability | Deterministic per slice | High jitter, unpredictable p99 | Moderate jitter |
| Max Concurrent Workloads | 7 (hardware limit) | Unlimited (software scheduling) | 48 (MPS client limit) |
| GPU Compatibility | A100, A30, H100, H200, B200 | All NVIDIA GPUs | Volta and newer |
| Best For | Multi-tenant production | Dev/test, low-priority sharing | Cooperative HPC workloads |
Choose MIG When:
- ● You serve multiple tenants on shared infrastructure
- ● Inference latency SLAs require predictable p99
- ● Memory isolation is a security or compliance requirement
- ● You run Kubernetes and need proper GPU resource scheduling
- ● Workloads are diverse (different models, different resource needs)
Choose Time-Slicing When:
- ● You use consumer or older datacenter GPUs without MIG
- ● Workloads are low-priority and latency is not critical
- ● You need more than 7 concurrent workloads per GPU
- ● Development and testing environments where isolation is optional
- ● Short-lived batch jobs that tolerate scheduling delays
Choose MPS When:
- ● Multiple processes from the same trusted application
- ● HPC workloads with cooperative multi-process patterns
- ● Workloads that individually underutilize SMs and benefit from overlap
- ● Controlled environments where all processes are trusted
- ● You can combine MPS inside a MIG instance for maximum density
MIG in Kubernetes
The NVIDIA GPU Operator and device plugin provide native Kubernetes integration for MIG. Pods request specific MIG profile types through standard resource specifications, and the scheduler handles placement automatically.
Device Plugin Configuration
The NVIDIA device plugin supports three MIG strategies that determine how MIG devices appear to Kubernetes:
- none: MIG devices are not exposed. The entire GPU appears as a single resource. Use this when MIG is disabled.
- single: All MIG devices on a GPU must be the same profile. Exposed as
nvidia.com/gpu. Simplest to manage, but no mixed profiles. - mixed: Different MIG profiles can coexist on the same GPU. Each profile type is exposed as a distinct resource (for example,
nvidia.com/mig-1g.10gb,nvidia.com/mig-3g.40gb). This is the recommended strategy for production clusters.
Pod Scheduling Example
Request a specific MIG profile in your pod spec. The scheduler will place the pod on a node with an available instance of the requested type:
# Pod requesting a 2g.20gb MIG slice
apiVersion: v1
kind: Pod
metadata:
name: inference-server
spec:
containers:
- name: model-server
image: nvcr.io/nvidia/tritonserver:24.01
resources:
limits:
nvidia.com/mig-2g.20gb: 1
This pod receives exactly one 2g.20gb MIG instance: 28 SMs (A100) or 32 SMs (H100), ~20GB memory, and dedicated memory bandwidth. No other pod can access this instance's resources.
GPU Operator MIG Configuration
The NVIDIA GPU Operator manages MIG configuration at the cluster level through a MIGConfig custom resource. This allows administrators to define MIG profiles per node and reconfigure them without SSH access to individual machines.
# GPU Operator MIG config: 3 profiles on each GPU
apiVersion: nvidia.com/v1alpha1
kind: MIGConfig
metadata:
name: mixed-workload-config
spec:
mig-strategy: mixed
devices:
- device-filter: "NVIDIA H100 80GB HBM3"
mig-enabled: true
mig-devices:
3g.40gb: 1 # Large model serving
2g.20gb: 1 # Medium inference
1g.10gb: 2 # Small tasks
The GPU Operator will drain the node, reconfigure MIG profiles, and uncordon the node automatically. This process takes 30 to 60 seconds per GPU and can be orchestrated across the cluster with rolling updates to avoid downtime.
Performance Characteristics
MIG's hardware partitioning model delivers near-zero overhead per slice. The performance you lose is the resources allocated to other slices, not overhead from the partitioning mechanism itself.
Partitioning Overhead
NVIDIA benchmarks show less than 1% overhead from MIG partitioning itself. A 1g.10gb instance delivers 1/7th of the GPU's compute at 1/7th of the memory bandwidth, with no additional penalty for the isolation mechanism.
The overhead comes from the small amount of memory and SMs reserved for the MIG management infrastructure, which is why usable memory per slice is slightly less than a perfect 1/7th split (9.75 GB per slice on an 80GB GPU, not 11.4 GB).
Memory Bandwidth Scaling
Memory bandwidth scales linearly with profile size. An A100 80GB delivers ~2039 GB/s total. A 1g.10gb slice gets ~285 GB/s, a 3g.40gb slice gets ~950 GB/s, and a 7g.80gb slice gets the full ~2039 GB/s. This linear scaling is critical for inference workloads where memory bandwidth (not compute) is the bottleneck.
On the H100, the per-slice bandwidth is 63% higher than the A100 for equivalent profile sizes, thanks to HBM3's increased bandwidth. This makes H100 MIG slices particularly effective for bandwidth-limited LLM inference.
Tail Latency
MIG delivers deterministic tail latency because there is no contention between instances. In time-slicing, p99 latency can spike to 3 to 5x the median when the scheduler preempts your workload to service another. With MIG, the p99/p50 ratio stays below 1.3x because your slice runs continuously without interruption.
For inference services with strict SLAs (response within 100ms at p99), this predictability is what makes MIG production-viable. The alternative, overprovisioning dedicated GPUs per service, costs 3 to 7x more in hardware.
Performance summary: A MIG instance behaves like a smaller GPU with a proportional share of compute, memory, and bandwidth. Within that allocation, workloads run at full speed with no scheduling penalties. The key insight is that MIG trades flexibility (you cannot dynamically resize slices) for determinism (your slice's performance is guaranteed and invariant).
MIG Limitations and Constraints
MIG is a powerful tool, but it has hard constraints that affect how and when you should deploy it. Understanding these limitations is essential for designing a correct MIG deployment.
Hardware Requirements
MIG requires NVIDIA Ampere architecture or newer with specific hardware partitioning circuitry. Supported GPUs: A100 (40GB and 80GB), A30, H100, H200, and B200. MIG is not available on consumer GeForce GPUs, professional RTX GPUs (including RTX 6000 Ada), or older datacenter GPUs like the V100 and T4. If your fleet includes non-MIG GPUs, you will need to use time-slicing or MPS on those devices.
Static Partitioning
MIG profiles cannot be resized while workloads are running. Changing the partition layout requires stopping all processes on the GPU, destroying existing MIG instances, creating new ones, and restarting workloads. In Kubernetes, this means draining the node. There is no dynamic scaling: if a workload needs more compute than its slice provides, it must be migrated to a larger slice through a reconfiguration cycle.
Maximum 7 Instances
The hardware supports a maximum of 7 GPU Instances per physical GPU (using the smallest 1g profile). If you need to serve more than 7 concurrent workloads per GPU, you must either use time-slicing within MIG instances, run MPS inside a MIG instance, or accept that some workloads share an instance. For very small workloads, the 7-instance limit may underutilize the available memory in each slice.
No Cross-Instance Communication
MIG instances cannot communicate with each other through NVLink, peer-to-peer GPU memory access, or shared memory. Each instance is fully isolated. This means MIG is not suitable for workloads that require multi-GPU parallelism (such as large model training with tensor parallelism). For those workloads, use full GPUs without MIG enabled.
Profile Alignment Constraints
Not all profile combinations are valid on a single GPU. The GPU's internal memory and SM partitioning follows alignment rules that prevent arbitrary mixing. For example, you cannot create 2x 2g.20gb + 2x 1g.10gb + 1x 1g.10gb on an A100 because the 2g profiles must start at specific memory boundaries. Always consult the NVIDIA MIG User Guide for your specific GPU model to verify that your desired configuration is valid.
Driver and Software Requirements
MIG requires NVIDIA driver 450.80.02 or newer (470+ recommended for full feature support). CUDA 11.0 or newer is required. Some features, such as the 1g.20gb profile on H100, require CUDA 12.0+. Container runtimes must be configured with the NVIDIA Container Toolkit to properly map MIG devices into containers. Always test your specific CUDA application version against MIG before production deployment.
Cost Optimization with MIG
MIG changes the economics of GPU infrastructure by converting one high-cost GPU into multiple usable devices. The TCO impact is substantial for inference-heavy deployments.
Scenario: 1x H100 80GB with MIG
- ✓ 7 isolated inference endpoints on one GPU
- ✓ 1 PCIe or SXM slot consumed
- ✓ ~700W total power draw
- ✓ 1 GPU to monitor, maintain, and replace
- ✓ Hardware-level isolation between all 7 workloads
- ✓ Single driver update, single firmware update
- ✓ ~9.6GB memory per instance with HBM3 bandwidth
Estimated cost: $30,000 to $40,000 for the GPU (varies by form factor and vendor). Annual power: ~$920 at $0.15/kWh.
Alternative: 7x Smaller GPUs
- ✓ 7 dedicated GPUs (e.g., L4 or T4)
- ✓ 7 PCIe slots consumed (2+ servers required)
- ✓ ~500W total power draw (7x ~72W per L4)
- ✓ 7 GPUs to monitor, maintain, and replace
- ✓ Natural isolation (separate physical devices)
- ✓ 7 driver updates, firmware updates, health checks
- ✓ 24GB memory per L4, but GDDR6X (not HBM3)
Estimated cost: $21,000 to $35,000 for 7x L4 GPUs plus 2 server chassis, additional NICs, rack space, and cabling. Annual power: ~$657 at $0.15/kWh.
Where MIG Wins on TCO
The GPU hardware cost may be comparable, but MIG's total cost of ownership advantage comes from infrastructure consolidation:
50 to 75%
Less rack space (1 server vs. 2+)
3 to 5x
Higher memory bandwidth per instance vs. consumer GPUs
85%
Fewer management touchpoints (1 GPU vs. 7)
1 to 2 RU
Single server footprint for all 7 workloads
MIG is most cost-effective when workloads fit within a single slice's memory and compute budget. For workloads that require the full 80GB of memory or all 98/114 SMs, MIG adds no value because the workload cannot be partitioned. Petronella Technology Group helps clients analyze their workload profiles to determine the optimal MIG configuration for their specific use case. See our SXM TCO analysis for more on datacenter GPU economics.
Petronella MIG Configuration Services
Petronella Technology Group configures MIG-optimized datacenter GPU deployments for organizations running production AI workloads.
Workload Profiling
We analyze your inference models, training jobs, and development workflows to determine the optimal MIG profile configuration for your GPU fleet. This includes memory footprint analysis, compute utilization profiling, and bandwidth requirement mapping.
Kubernetes Integration
Full GPU Operator deployment with MIG-aware scheduling, including device plugin configuration, resource quota policies, and automated MIG profile management through custom resources. We configure both single and mixed MIG strategies based on your workload diversity.
Hardware Selection
Guidance on A100 vs. H100 vs. H200 for your MIG use case, including the NVIDIA DGX and custom AI development systems. We evaluate whether MIG, time-slicing, or dedicated GPUs provide the best TCO for each workload category.
Compliance Hardening
MIG's hardware isolation makes it suitable for compliance-sensitive multi-tenant deployments. Our CMMC-RP certified team configures MIG with audit logging, access controls, and tenant isolation documentation for HIPAA, CMMC, and NIST 800-171 environments.
Explore our full range of AI infrastructure services for enterprise deployments.
Frequently Asked Questions
MIG is supported on NVIDIA A100 (40GB and 80GB), A30, H100, H200, and B200 datacenter GPUs. It is not available on consumer GeForce GPUs, professional RTX GPUs, or older datacenter GPUs like the V100 or T4. MIG requires the Ampere architecture or newer with specific hardware partitioning circuitry built into the GPU die.
The maximum is 7 instances on a single A100 or H100 GPU using the 1g.10gb profile (the smallest slice with ~10GB memory each). Larger profiles reduce the instance count: 3 instances with 2g.20gb, 2 instances with 3g.40gb, or 1 full-GPU instance with 7g.80gb. You can also mix certain compatible profile sizes on the same GPU.
MIG itself adds less than 1% overhead because isolation is enforced at the hardware level, not through software scheduling. Each MIG instance gets dedicated Streaming Multiprocessors and dedicated memory controllers. The only trade-off is that each instance receives a fraction of the full GPU's compute and memory resources. Within its allocated slice, a workload runs at full speed with no context-switching penalty.
MIG provides hardware-level isolation with dedicated compute, memory, and memory bandwidth per instance. Time-slicing shares the GPU by rapidly switching between workloads, which introduces context-switch overhead and provides no memory isolation. MPS (Multi-Process Service) allows concurrent kernel execution from multiple processes with shared memory space but no fault isolation. MIG is preferred for production multi-tenant environments, time-slicing for lightweight development sharing, and MPS for tightly coupled cooperative workloads from trusted sources.
Yes. The NVIDIA GPU Operator and NVIDIA device plugin for Kubernetes fully support MIG. You configure MIG profiles on each node and schedule pods to specific MIG device types using standard Kubernetes resource requests (for example, nvidia.com/mig-2g.20gb: 1). This allows fine-grained GPU allocation where different pods receive different slice sizes based on their workload requirements.
You can destroy and recreate MIG instances without a full system reboot, but all running workloads on that GPU must be stopped first. Enabling or disabling MIG mode itself requires a GPU reset. In Kubernetes environments, the GPU Operator automates this process: it drains the node, reconfigures MIG profiles (typically 30 to 60 seconds), and uncordons the node. Dynamic resizing while workloads are running is not supported.
MIG on a single H100 80GB can replace up to 7 smaller inference GPUs while consuming one PCIe slot, one power connection, and one cooling footprint. The TCO savings come from reduced server count, lower operational complexity, less rack space, simplified networking, and fewer points of failure. For inference workloads that fit within a MIG slice, the per-query cost is significantly lower than dedicating an entire GPU per workload. The trade-off is that each MIG slice has less total memory and compute than a dedicated L4 or T4. Petronella can help you model both options for your specific workload mix.
Deploy MIG-Optimized GPU Infrastructure
Petronella Technology Group configures MIG-optimized datacenter GPU deployments for organizations running multi-tenant inference, shared development environments, and mixed AI workloads. Our CMMC-RP certified team handles hardware selection, MIG profile design, Kubernetes integration, and compliance hardening.
Call now for a free GPU infrastructure consultation. We will analyze your workloads and design the optimal MIG configuration.
Or schedule a call at a time that works for you
Petronella Technology Group | 5540 Centerview Dr, Suite 200, Raleigh, NC 27606 | Since 2002