Unsloth Fine-Tuning: ship private LLMs without the cloud bill
Unsloth trains Llama, Mistral, Gemma, Qwen, and Phi models roughly 2x faster while cutting VRAM use by up to 70 percent. Petronella Technology Group helps technical teams go from a notebook script to a hardened on-prem fine-tune pipeline that fits HIPAA, CMMC, and internal data policies.
What Unsloth actually is, and why it keeps winning benchmarks
Unsloth is an open-source Python library that makes fine-tuning large language models faster and far cheaper on a single consumer or data-center GPU. It is not a model. It is not a cloud service. It is a drop-in replacement for the HuggingFace transformers + trl training loop, with custom Triton kernels, manual backprop for the expensive ops, and memory-layout tricks that eliminate the waste HuggingFace leaves on the table.
The practical promise is simple. You write the same training script you already know. You swap AutoModelForCausalLM for FastLanguageModel. You keep the HuggingFace dataset APIs, the TRL SFTTrainer, and the PEFT LoRA config. Training then runs about 2x faster and uses roughly 70 percent less VRAM, which means a 4090 can fine-tune models that previously required an H100, and an H100 can fine-tune models that previously required a multi-GPU node.
Those numbers come from the Unsloth team's published benchmarks on their official site and the GitHub README. Our job at Petronella is not to re-benchmark the library. Our job is to translate those gains into production wins on your hardware, your data, and your compliance boundary.
How Unsloth gets 2x faster, explained for engineers
If you have ever profiled a HuggingFace fine-tune, you already know the bottleneck is rarely the math. It is memory bandwidth, redundant kernel launches, and the Python-C++ boundary. Unsloth rewrites the hot paths end to end. The specific wins worth knowing:
Custom Triton kernels for cross-entropy, RoPE, and RMSNorm
The default HuggingFace implementations keep intermediate tensors around that the training loop never reads. Unsloth fuses the forward and backward passes for these ops into single Triton kernels, which removes the extra allocation and the extra HBM round-trip. On a Llama 3 8B fine-tune, this alone saves several gigabytes of activation memory per step.
Manual backprop for LoRA
PyTorch autograd is general-purpose. LoRA does not need general-purpose autograd because the graph is narrow and predictable. Unsloth writes the backward pass by hand for the LoRA layers, which eliminates a chunk of autograd bookkeeping and further cuts memory.
4-bit QLoRA with bitsandbytes NF4
Quantizing base weights to 4-bit NF4 shrinks the model footprint by roughly 4x. Unsloth integrates this cleanly with the Triton kernels so you do not pay a performance penalty for the dequantize-on-the-fly math. In practice, a Llama 3.1 8B QLoRA fine-tune with sequence length 2048 fits comfortably in 16 GB of VRAM. A 4090 handles it with room for longer contexts.
Gradient checkpointing rewritten
PyTorch's default torch.utils.checkpoint re-runs the forward pass to recover activations. Unsloth ships a variant, enabled with use_gradient_checkpointing="unsloth", that pre-computes and stores only what the backward actually needs, trading a small amount of compute for a big drop in peak memory. On context lengths above 4K this is the setting that lets you fit the whole batch.
use_gradient_checkpointing="unsloth" with packing=True in SFTTrainer gives you short-sequence packing and low-memory long-context handling in the same run. Most teams forget the second flag and burn 30 percent of their wall-clock training time on padding tokens.Multi-GPU
The open-source single-GPU path is the core of Unsloth. Multi-GPU training is available today and continues to ship upgrades. For teams that need to scale past a single H100 or past a single 48 GB card, we help with DDP, FSDP, and careful attention to gradient accumulation so the effective batch size matches what your eval harness expects.
When Unsloth is the right tool (and when it is not)
Unsloth is not a silver bullet. It is the right pick for a specific, common workload: parameter-efficient fine-tuning of a transformer decoder model on a single GPU or small multi-GPU node. That covers the majority of real enterprise fine-tuning work, but not all of it.
| Workload | Unsloth fit | Better alternative |
|---|---|---|
| QLoRA fine-tune, 7B-70B, single GPU | Best-in-class | None |
| Full fine-tune, 8B-class on an H100 | Strong, uses FP8 / bf16 | torchtune for research comparisons |
| Massive multi-node pretraining | Not the goal | NeMo, Megatron-LM, or torchtitan |
| DPO, ORPO, GRPO RL fine-tuning | Supported, ~80 percent VRAM savings on GRPO | Stock TRL if you need bleeding-edge algorithms |
| Vision or speech adapter training | Supported, growing | HuggingFace direct for unusual modalities |
| Mixing into axolotl YAML workflows | Possible via kernel patches | axolotl native config for team familiarity |
If your team already lives in axolotl and the YAML-driven pipeline works, we will often keep you there and patch in Unsloth kernels where they help. If your team is starting fresh and needs to fine-tune a Llama 3.3 70B against a private dataset on a pair of H100s without a month of plumbing, Unsloth is the fastest path from zero to running.
Supported models and the hardware sweet spot
Unsloth supports more than 500 model checkpoints spanning text generation, vision-language models, text-to-speech, and embedding models. The ones enterprise teams ask us about most often:
NVIDIA hardware, practical guidance
Unsloth runs on NVIDIA Ampere, Ada, Hopper, and Blackwell. Bf16 support is required for best performance, which rules out GTX cards and older Pascal or Volta. The sweet spots we deploy most often:
- RTX 4090 (24 GB): ideal for 7B-class QLoRA fine-tunes, fast iteration, and proof-of-concept work. Fits Llama 3.1 8B with 2K context comfortably.
- RTX 6000 Ada (48 GB): the workstation card we recommend for teams that want a single machine under their desk for 8B-13B fine-tunes with long contexts.
- A100 80 GB: the data-center workhorse. Handles 70B QLoRA fine-tunes, long contexts, and heavier batch sizes.
- H100 / H200: required if you want FP8 training, multi-node Infiniband, or bleeding-edge throughput on 70B-class models.
- NVIDIA DGX Spark and DGX Station: Unsloth explicitly supports these smaller workgroup systems, which land in a price band many mid-market teams can actually justify.
Macs with MLX support and AMD GPUs are progressing. Today we recommend MLX for running fine-tuned models locally on Apple silicon and CUDA for the training step itself. If you are evaluating hardware, our AI training workstation builds and full AI workstation lineup walk through the trade-offs we see every week.
Worked example: fine-tune Llama 3.1 8B on a 4090 with QLoRA
This is the shortest honest path from a fresh environment to a trained LoRA adapter you can merge and serve. Assume you have Python 3.11, CUDA 12.x, and a clean virtual environment. Install Unsloth along with its training stack:
pip install unsloth pip install --upgrade --no-deps "trl<0.9.0" peft accelerate bitsandbytes
Load a 4-bit base model with FastLanguageModel.from_pretrained. The key parameters are the model name, maximum sequence length, and the 4-bit flag. Unsloth auto-detects bf16 vs fp16 for you:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Meta-Llama-3.1-8B-Instruct",
max_seq_length = max_seq_length,
dtype = None, # auto: bf16 on Ada/Hopper, fp16 on Ampere
load_in_4bit = True,
use_gradient_checkpointing = "unsloth",
)
Attach LoRA adapters. The defaults below are what we ship for 90 percent of customer projects. Start with r=16 and target the full attention and MLP projection set. You can raise rank later if eval shows the adapter is underfitting:
model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha = 16,
lora_dropout = 0,
bias = "none",
use_gradient_checkpointing = "unsloth",
random_state = 3407,
)
Point TRL's SFTTrainer at your dataset. We prefer packing=True because it crushes wall-clock time on mixed-length data. Use adamw_8bit for the optimizer to save more memory:
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
dataset = load_dataset("your-org/internal-tickets", split="train")
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
packing = True,
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_ratio = 0.1,
num_train_epochs = 2,
learning_rate = 2e-4,
bf16 = torch.cuda.is_bf16_supported(),
fp16 = not torch.cuda.is_bf16_supported(),
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
logging_steps = 10,
output_dir = "outputs",
seed = 3407,
),
)
trainer.train()
Save the LoRA adapter, or merge it into the base weights and export to GGUF for llama.cpp, Ollama, or vLLM serving:
# Save LoRA adapter only (small, shareable)
model.save_pretrained("lora-adapter")
# Or merge to full weights and save in 16-bit
model.save_pretrained_merged("merged-16bit", tokenizer, save_method = "merged_16bit")
# Or export to GGUF for llama.cpp / Ollama
model.save_pretrained_gguf("merged-gguf", tokenizer, quantization_method = "q4_k_m")
That is the full loop. On a 4090 with 24 GB of VRAM, this configuration runs at roughly 2x the throughput of the equivalent stock HuggingFace pipeline, with headroom to spare for dataset preprocessing. Your mileage will vary by dataset, sequence length, and packing ratio, but the shape of the result holds.
Failure modes we see in production, and how to fix them
Unsloth works. Your training job can still fail. These are the issues we have debugged repeatedly for customers and what the real fix looks like.
OOM at first forward pass
Almost always one of three things. Your max_seq_length is higher than your actual data requires, which inflates the KV cache. Your batch size is above 1 on a 24 GB card with an 8B model and long context. You forgot to set use_gradient_checkpointing="unsloth". The fix order is sequence length first, then gradient checkpointing, then batch size. Gradient accumulation gets you back to your target effective batch.
Loss spikes to NaN a few hundred steps in
Bf16 is usually fine. Fp16 with a small LoRA rank is the usual culprit because the tiny gradients underflow. Switch to bf16 if the card supports it (Ada, Hopper, Blackwell all do). If you are stuck on Ampere without bf16 headroom, drop the learning rate from 2e-4 to 1e-4 and add a longer warmup.
Model converges but eval is worse than the base model
The training ran. Your adapter captured something that is not what you wanted. The fix is rarely more training. It is almost always data hygiene: duplicate examples that push the model toward a narrow response pattern, label leakage between your SFT set and your eval set, or a template mismatch where the chat template you used at training time does not match the one you use at inference.
Generated outputs hallucinate worse than before
Classic sign of a too-high learning rate or too many epochs. LoRA is lossy. Over-training a LoRA adapter past the natural loss plateau causes catastrophic forgetting of the pretrained capabilities. Stop at the plateau, not past it. Use an eval harness that runs generation on held-out prompts and inspect qualitatively every few hundred steps.
GGUF export produces a broken model
The single biggest cause is a tokenizer that was modified during fine-tuning without re-saving, so the GGUF embeds the original vocab while the adapter expects the modified one. Always save tokenizer and model together, even if you think the tokenizer is unchanged.
Getting from a working notebook to a production pipeline
A script that trains cleanly in a notebook is about 15 percent of the work. The remaining 85 percent is the part that breaks on Monday morning. Here is the operational scaffolding we build with every Petronella engagement.
Data pipeline
Raw records rarely arrive in a training-ready format. We build idempotent preprocessing jobs, usually on Airflow or a lightweight n8n pipeline, that normalize, deduplicate, PII-scrub, split, and version your dataset.
- Versioned splits so eval stays reproducible
- Automatic PII redaction for HIPAA and CMMC scopes
- Template enforcement so chat format matches inference
Training orchestration
Unsloth runs inside a container. We wrap it with a job runner so every training run is tagged, logged, and reproducible.
- Docker image pinned to a known CUDA and Unsloth version
- Hyperparameter config in YAML under git
- MLflow or Weights and Biases for run tracking
Eval harness
If you cannot measure it, do not ship it. We wire up an automated eval that runs on every checkpoint against your real business tasks, not just perplexity.
- Task-specific rubrics, graded by an LLM judge or humans
- Regression tests against a known-good baseline
- Blind A-B comparisons against the stock base model
Serving
A LoRA adapter is not a product. We merge and quantize for the target serving engine, then stand up a monitored endpoint behind your existing auth.
- vLLM for high-throughput GPU serving
- llama.cpp / Ollama for CPU or Mac fallbacks
- OpenAI-compatible API so downstream code does not change
On-prem, cloud, or hybrid: which makes sense
The cloud-versus-on-prem decision for fine-tuning is not religious. It is a three-way trade between data gravity, cost predictability, and compliance obligation. Here is how we frame it with clients.
Choose on-prem when
- Your training data includes PHI, FCI, CUI, or anything covered by HIPAA, CMMC, ITAR, or a board-level data sovereignty policy.
- You fine-tune more than once a quarter. Hardware pays back inside 12 months at that cadence.
- You need predictable latency for inference and cannot accept shared-tenancy variance.
- Your security team will not sign off on exporting raw data to a managed service, even with a BAA.
Choose cloud GPUs when
- You are running one-off experiments and do not want to own hardware.
- You need transient access to H100s or B200s for a specific burst.
- Your data is already in a major cloud and moving it costs more than training.
Choose hybrid when
- Most fine-tuning happens on-prem, but occasional large training runs borrow cloud capacity.
- Data preparation happens in the cloud (where your source systems live) and training happens on-prem (where compliance wants it).
For Petronella clients operating under CMMC or HIPAA, on-prem or private-tenant cloud is almost always the answer. The combination of Unsloth's efficiency and a single workstation-class GPU means you do not need a six-figure hardware budget to make it work. An AI training workstation with a single RTX 6000 Ada or a pair of 4090s lands under $25K and handles the typical enterprise fine-tune schedule with room to spare.
What does an Unsloth fine-tune actually cost
Cost conversations tend to collapse into three scenarios. We walk clients through all three during discovery.
Scenario A: cloud API fine-tuning
Managed services charge per token of training data and per token of inference. For a modest 50 MB JSONL dataset, training typically runs a few hundred dollars per run. The hidden cost is inference: every production call costs per-token for the lifetime of the product. A chatbot that serves 10 million tokens per day can easily reach five figures monthly.
Scenario B: renting cloud GPUs
An H100 on a hyperscaler runs $2-$5 per hour on-demand. A typical 8B QLoRA fine-tune with Unsloth finishes in 1-6 hours depending on dataset size. Call it $20 per run. Good for experiments. The economics flip the moment you need recurring training or predictable inference.
Scenario C: on-prem with Unsloth
A single RTX 6000 Ada workstation is a one-time purchase under $12K. Electricity at 450W running 8 hours per day is roughly $15 per month at typical US rates. Training runs are free after hardware amortization. Inference uses the same box. For teams doing more than two fine-tunes per quarter, on-prem breaks even inside a year and keeps going.
Private data, never leaves your network
The most common reason Petronella clients choose Unsloth over a managed service is simple. The data cannot leave the premises. Healthcare providers, defense contractors, law firms, accounting firms, and financial services firms all face regulatory regimes that either forbid or heavily restrict sending sensitive data to a third-party AI service.
Unsloth lets you train the model the data stays with. That changes the compliance conversation from "how do we get a BAA with OpenAI" to "how do we harden our existing infrastructure." We bring to that conversation:
- A CMMC-RP-certified team (Craig Petronella, Blake Rea, Justin Summers, Jonathan Wood) that works CMMC assessments as a day job.
- HIPAA experience rooted in Craig's Amazon-published HIPAA book and a client portfolio covering dental, medical, and legal verticals.
- PPSB accreditation and BBB A+ since 2003.
- Standing infrastructure for on-prem GPU deployment, including physical, network, and identity controls that map to CMMC Level 2 practices.
For covered entities, the combination of Unsloth on an on-prem GPU plus our HIPAA compliance consulting produces an environment where you can train on production PHI safely. We document the boundary, the key management, and the audit trail so you can pass a regulator's questions without sweating.
When to call Petronella
Plenty of teams do not need us. If you have two ML engineers who have shipped a fine-tune before, a clean dataset, and an on-prem GPU already racked, follow the Unsloth docs and ship. The library is genuinely that approachable.
The conversations we do have usually fall into one of these patterns:
Compliance-bound fine-tuning
You have PHI, CUI, or internal data a managed service will not take. You need the training to happen behind your firewall and the pipeline to survive an auditor. We design and build the environment, then hand the runbook to your team. This is the core AI services engagement and typically runs four to eight weeks.
GPU hardware selection and deployment
You know you need on-prem, but you do not know whether a 4090, a 6000 Ada, an H100, or a DGX Station is the right fit. We size the hardware to your actual workload, source it, rack it, and leave you with a working training environment. See the hardware catalog for what we typically deploy.
Notebook to production lift
Your team got a fine-tune working in a notebook. Now you need it to run on a schedule, with versioned data, tracked runs, an eval harness, and a serving endpoint. We build the scaffolding and train your team on it.
Multi-modal or specialty training
Vision-language adapters, speech TTS, embedding fine-tunes, or GRPO-style RL. These all work with Unsloth but have more edges. We have done them. Our custom AI development team takes on the weird stuff.
Strategic AI planning
You are earlier in the journey and are not sure whether fine-tuning, RAG, or prompt engineering is the right move. Our AI consulting in Raleigh and the Research Triangle walks through the decision before anyone writes training code.
Questions engineers actually ask
Straight answers based on what we see in production deployments.
Does Unsloth really match HuggingFace accuracy, or is something being traded away?
Can Unsloth fine-tune a 70B model on a single GPU?
How does Unsloth compare to axolotl?
Can we fine-tune on PHI or CUI with Unsloth?
What about multi-GPU? Is that Pro-only?
How long does a typical fine-tune take?
Do we need Flash Attention installed?
What Python and CUDA versions are supported?
Can Unsloth export to GGUF for llama.cpp and Ollama?
model.save_pretrained_gguf() handles the merge and quantization in a single call. We typically export Q4_K_M for balance of size and quality, Q5_K_M when quality matters more, or Q8_0 when you want near-lossless and do not care about file size.How much does a Petronella Unsloth engagement cost?
Ready to ship a private fine-tuned model?
Tell us the model, the data, and the compliance boundary. We will tell you what the fastest honest path to production looks like, on your hardware, under your rules.