Unsloth Fine-Tuning: ship private LLMs without the cloud bill

Unsloth trains Llama, Mistral, Gemma, Qwen, and Phi models roughly 2x faster while cutting VRAM use by up to 70 percent. Petronella Technology Group helps technical teams go from a notebook script to a hardened on-prem fine-tune pipeline that fits HIPAA, CMMC, and internal data policies.

Single-GPU and multi-GPU | QLoRA, LoRA, full fine-tune | On-prem, air-gapped, or private cloud

What it is

What Unsloth actually is, and why it keeps winning benchmarks

Unsloth is an open-source Python library that makes fine-tuning large language models faster and far cheaper on a single consumer or data-center GPU. It is not a model. It is not a cloud service. It is a drop-in replacement for the HuggingFace transformers + trl training loop, with custom Triton kernels, manual backprop for the expensive ops, and memory-layout tricks that eliminate the waste HuggingFace leaves on the table.

The practical promise is simple. You write the same training script you already know. You swap AutoModelForCausalLM for FastLanguageModel. You keep the HuggingFace dataset APIs, the TRL SFTTrainer, and the PEFT LoRA config. Training then runs about 2x faster and uses roughly 70 percent less VRAM, which means a 4090 can fine-tune models that previously required an H100, and an H100 can fine-tune models that previously required a multi-GPU node.

~2x
Training throughput vs HuggingFace + FA2 baseline
~70%
VRAM reduction on QLoRA fine-tunes
500+
Supported models across text, vision, audio, embeddings
0
Accuracy loss claimed on published benchmarks

Those numbers come from the Unsloth team's published benchmarks on their official site and the GitHub README. Our job at Petronella is not to re-benchmark the library. Our job is to translate those gains into production wins on your hardware, your data, and your compliance boundary.

Under the hood

How Unsloth gets 2x faster, explained for engineers

If you have ever profiled a HuggingFace fine-tune, you already know the bottleneck is rarely the math. It is memory bandwidth, redundant kernel launches, and the Python-C++ boundary. Unsloth rewrites the hot paths end to end. The specific wins worth knowing:

Custom Triton kernels for cross-entropy, RoPE, and RMSNorm

The default HuggingFace implementations keep intermediate tensors around that the training loop never reads. Unsloth fuses the forward and backward passes for these ops into single Triton kernels, which removes the extra allocation and the extra HBM round-trip. On a Llama 3 8B fine-tune, this alone saves several gigabytes of activation memory per step.

Manual backprop for LoRA

PyTorch autograd is general-purpose. LoRA does not need general-purpose autograd because the graph is narrow and predictable. Unsloth writes the backward pass by hand for the LoRA layers, which eliminates a chunk of autograd bookkeeping and further cuts memory.

4-bit QLoRA with bitsandbytes NF4

Quantizing base weights to 4-bit NF4 shrinks the model footprint by roughly 4x. Unsloth integrates this cleanly with the Triton kernels so you do not pay a performance penalty for the dequantize-on-the-fly math. In practice, a Llama 3.1 8B QLoRA fine-tune with sequence length 2048 fits comfortably in 16 GB of VRAM. A 4090 handles it with room for longer contexts.

Gradient checkpointing rewritten

PyTorch's default torch.utils.checkpoint re-runs the forward pass to recover activations. Unsloth ships a variant, enabled with use_gradient_checkpointing="unsloth", that pre-computes and stores only what the backward actually needs, trading a small amount of compute for a big drop in peak memory. On context lengths above 4K this is the setting that lets you fit the whole batch.

Production tip. Combining use_gradient_checkpointing="unsloth" with packing=True in SFTTrainer gives you short-sequence packing and low-memory long-context handling in the same run. Most teams forget the second flag and burn 30 percent of their wall-clock training time on padding tokens.

Multi-GPU

The open-source single-GPU path is the core of Unsloth. Multi-GPU training is available today and continues to ship upgrades. For teams that need to scale past a single H100 or past a single 48 GB card, we help with DDP, FSDP, and careful attention to gradient accumulation so the effective batch size matches what your eval harness expects.

Use it or skip it

When Unsloth is the right tool (and when it is not)

Unsloth is not a silver bullet. It is the right pick for a specific, common workload: parameter-efficient fine-tuning of a transformer decoder model on a single GPU or small multi-GPU node. That covers the majority of real enterprise fine-tuning work, but not all of it.

WorkloadUnsloth fitBetter alternative
QLoRA fine-tune, 7B-70B, single GPUBest-in-classNone
Full fine-tune, 8B-class on an H100Strong, uses FP8 / bf16torchtune for research comparisons
Massive multi-node pretrainingNot the goalNeMo, Megatron-LM, or torchtitan
DPO, ORPO, GRPO RL fine-tuningSupported, ~80 percent VRAM savings on GRPOStock TRL if you need bleeding-edge algorithms
Vision or speech adapter trainingSupported, growingHuggingFace direct for unusual modalities
Mixing into axolotl YAML workflowsPossible via kernel patchesaxolotl native config for team familiarity

If your team already lives in axolotl and the YAML-driven pipeline works, we will often keep you there and patch in Unsloth kernels where they help. If your team is starting fresh and needs to fine-tune a Llama 3.3 70B against a private dataset on a pair of H100s without a month of plumbing, Unsloth is the fastest path from zero to running.

Models and hardware

Supported models and the hardware sweet spot

Unsloth supports more than 500 model checkpoints spanning text generation, vision-language models, text-to-speech, and embedding models. The ones enterprise teams ask us about most often:

Llama 3.1 8B / 70B Llama 3.2 1B / 3B Llama 3.3 70B Mistral 7B / Small Gemma 2 / Gemma 3 Qwen 2.5 / Qwen 3 Phi-4 DeepSeek gpt-oss embeddinggemma

NVIDIA hardware, practical guidance

Unsloth runs on NVIDIA Ampere, Ada, Hopper, and Blackwell. Bf16 support is required for best performance, which rules out GTX cards and older Pascal or Volta. The sweet spots we deploy most often:

  • RTX 4090 (24 GB): ideal for 7B-class QLoRA fine-tunes, fast iteration, and proof-of-concept work. Fits Llama 3.1 8B with 2K context comfortably.
  • RTX 6000 Ada (48 GB): the workstation card we recommend for teams that want a single machine under their desk for 8B-13B fine-tunes with long contexts.
  • A100 80 GB: the data-center workhorse. Handles 70B QLoRA fine-tunes, long contexts, and heavier batch sizes.
  • H100 / H200: required if you want FP8 training, multi-node Infiniband, or bleeding-edge throughput on 70B-class models.
  • NVIDIA DGX Spark and DGX Station: Unsloth explicitly supports these smaller workgroup systems, which land in a price band many mid-market teams can actually justify.

Macs with MLX support and AMD GPUs are progressing. Today we recommend MLX for running fine-tuned models locally on Apple silicon and CUDA for the training step itself. If you are evaluating hardware, our AI training workstation builds and full AI workstation lineup walk through the trade-offs we see every week.

Hands-on

Worked example: fine-tune Llama 3.1 8B on a 4090 with QLoRA

This is the shortest honest path from a fresh environment to a trained LoRA adapter you can merge and serve. Assume you have Python 3.11, CUDA 12.x, and a clean virtual environment. Install Unsloth along with its training stack:

bash
pip install unsloth
pip install --upgrade --no-deps "trl<0.9.0" peft accelerate bitsandbytes

Load a 4-bit base model with FastLanguageModel.from_pretrained. The key parameters are the model name, maximum sequence length, and the 4-bit flag. Unsloth auto-detects bf16 vs fp16 for you:

python, model loading
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length = max_seq_length,
    dtype = None,                # auto: bf16 on Ada/Hopper, fp16 on Ampere
    load_in_4bit = True,
    use_gradient_checkpointing = "unsloth",
)

Attach LoRA adapters. The defaults below are what we ship for 90 percent of customer projects. Start with r=16 and target the full attention and MLP projection set. You can raise rank later if eval shows the adapter is underfitting:

python, LoRA config
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)

Point TRL's SFTTrainer at your dataset. We prefer packing=True because it crushes wall-clock time on mixed-length data. Use adamw_8bit for the optimizer to save more memory:

python, training loop
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

dataset = load_dataset("your-org/internal-tickets", split="train")

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    packing = True,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_ratio = 0.1,
        num_train_epochs = 2,
        learning_rate = 2e-4,
        bf16 = torch.cuda.is_bf16_supported(),
        fp16 = not torch.cuda.is_bf16_supported(),
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        logging_steps = 10,
        output_dir = "outputs",
        seed = 3407,
    ),
)

trainer.train()

Save the LoRA adapter, or merge it into the base weights and export to GGUF for llama.cpp, Ollama, or vLLM serving:

python, save + export
# Save LoRA adapter only (small, shareable)
model.save_pretrained("lora-adapter")

# Or merge to full weights and save in 16-bit
model.save_pretrained_merged("merged-16bit", tokenizer, save_method = "merged_16bit")

# Or export to GGUF for llama.cpp / Ollama
model.save_pretrained_gguf("merged-gguf", tokenizer, quantization_method = "q4_k_m")

That is the full loop. On a 4090 with 24 GB of VRAM, this configuration runs at roughly 2x the throughput of the equivalent stock HuggingFace pipeline, with headroom to spare for dataset preprocessing. Your mileage will vary by dataset, sequence length, and packing ratio, but the shape of the result holds.

Debugging

Failure modes we see in production, and how to fix them

Unsloth works. Your training job can still fail. These are the issues we have debugged repeatedly for customers and what the real fix looks like.

OOM at first forward pass

Almost always one of three things. Your max_seq_length is higher than your actual data requires, which inflates the KV cache. Your batch size is above 1 on a 24 GB card with an 8B model and long context. You forgot to set use_gradient_checkpointing="unsloth". The fix order is sequence length first, then gradient checkpointing, then batch size. Gradient accumulation gets you back to your target effective batch.

Loss spikes to NaN a few hundred steps in

Bf16 is usually fine. Fp16 with a small LoRA rank is the usual culprit because the tiny gradients underflow. Switch to bf16 if the card supports it (Ada, Hopper, Blackwell all do). If you are stuck on Ampere without bf16 headroom, drop the learning rate from 2e-4 to 1e-4 and add a longer warmup.

Model converges but eval is worse than the base model

The training ran. Your adapter captured something that is not what you wanted. The fix is rarely more training. It is almost always data hygiene: duplicate examples that push the model toward a narrow response pattern, label leakage between your SFT set and your eval set, or a template mismatch where the chat template you used at training time does not match the one you use at inference.

Generated outputs hallucinate worse than before

Classic sign of a too-high learning rate or too many epochs. LoRA is lossy. Over-training a LoRA adapter past the natural loss plateau causes catastrophic forgetting of the pretrained capabilities. Stop at the plateau, not past it. Use an eval harness that runs generation on held-out prompts and inspect qualitatively every few hundred steps.

GGUF export produces a broken model

The single biggest cause is a tokenizer that was modified during fine-tuning without re-saving, so the GGUF embeds the original vocab while the adapter expects the modified one. Always save tokenizer and model together, even if you think the tokenizer is unchanged.

When in doubt, reduce scope. Train on 10 percent of your data for one epoch. If that fails, the full run will also fail, and you learn the lesson in 20 minutes instead of 20 hours.
Notebook to production

Getting from a working notebook to a production pipeline

A script that trains cleanly in a notebook is about 15 percent of the work. The remaining 85 percent is the part that breaks on Monday morning. Here is the operational scaffolding we build with every Petronella engagement.

Data pipeline

Raw records rarely arrive in a training-ready format. We build idempotent preprocessing jobs, usually on Airflow or a lightweight n8n pipeline, that normalize, deduplicate, PII-scrub, split, and version your dataset.

  • Versioned splits so eval stays reproducible
  • Automatic PII redaction for HIPAA and CMMC scopes
  • Template enforcement so chat format matches inference

Training orchestration

Unsloth runs inside a container. We wrap it with a job runner so every training run is tagged, logged, and reproducible.

  • Docker image pinned to a known CUDA and Unsloth version
  • Hyperparameter config in YAML under git
  • MLflow or Weights and Biases for run tracking

Eval harness

If you cannot measure it, do not ship it. We wire up an automated eval that runs on every checkpoint against your real business tasks, not just perplexity.

  • Task-specific rubrics, graded by an LLM judge or humans
  • Regression tests against a known-good baseline
  • Blind A-B comparisons against the stock base model

Serving

A LoRA adapter is not a product. We merge and quantize for the target serving engine, then stand up a monitored endpoint behind your existing auth.

  • vLLM for high-throughput GPU serving
  • llama.cpp / Ollama for CPU or Mac fallbacks
  • OpenAI-compatible API so downstream code does not change
On-prem vs cloud

On-prem, cloud, or hybrid: which makes sense

The cloud-versus-on-prem decision for fine-tuning is not religious. It is a three-way trade between data gravity, cost predictability, and compliance obligation. Here is how we frame it with clients.

Choose on-prem when

  • Your training data includes PHI, FCI, CUI, or anything covered by HIPAA, CMMC, ITAR, or a board-level data sovereignty policy.
  • You fine-tune more than once a quarter. Hardware pays back inside 12 months at that cadence.
  • You need predictable latency for inference and cannot accept shared-tenancy variance.
  • Your security team will not sign off on exporting raw data to a managed service, even with a BAA.

Choose cloud GPUs when

  • You are running one-off experiments and do not want to own hardware.
  • You need transient access to H100s or B200s for a specific burst.
  • Your data is already in a major cloud and moving it costs more than training.

Choose hybrid when

  • Most fine-tuning happens on-prem, but occasional large training runs borrow cloud capacity.
  • Data preparation happens in the cloud (where your source systems live) and training happens on-prem (where compliance wants it).

For Petronella clients operating under CMMC or HIPAA, on-prem or private-tenant cloud is almost always the answer. The combination of Unsloth's efficiency and a single workstation-class GPU means you do not need a six-figure hardware budget to make it work. An AI training workstation with a single RTX 6000 Ada or a pair of 4090s lands under $25K and handles the typical enterprise fine-tune schedule with room to spare.

Cost analysis

What does an Unsloth fine-tune actually cost

Cost conversations tend to collapse into three scenarios. We walk clients through all three during discovery.

Scenario A: cloud API fine-tuning

Managed services charge per token of training data and per token of inference. For a modest 50 MB JSONL dataset, training typically runs a few hundred dollars per run. The hidden cost is inference: every production call costs per-token for the lifetime of the product. A chatbot that serves 10 million tokens per day can easily reach five figures monthly.

Scenario B: renting cloud GPUs

An H100 on a hyperscaler runs $2-$5 per hour on-demand. A typical 8B QLoRA fine-tune with Unsloth finishes in 1-6 hours depending on dataset size. Call it $20 per run. Good for experiments. The economics flip the moment you need recurring training or predictable inference.

Scenario C: on-prem with Unsloth

A single RTX 6000 Ada workstation is a one-time purchase under $12K. Electricity at 450W running 8 hours per day is roughly $15 per month at typical US rates. Training runs are free after hardware amortization. Inference uses the same box. For teams doing more than two fine-tunes per quarter, on-prem breaks even inside a year and keeps going.

The quiet win. On-prem with Unsloth also removes per-token inference fees. For products with sustained usage, the inference savings often dwarf the training savings. We have seen mid-market clients replace $6K-$15K per month in OpenAI bills with a single $12K-$25K workstation amortized over 24 months.
Security and compliance

Private data, never leaves your network

The most common reason Petronella clients choose Unsloth over a managed service is simple. The data cannot leave the premises. Healthcare providers, defense contractors, law firms, accounting firms, and financial services firms all face regulatory regimes that either forbid or heavily restrict sending sensitive data to a third-party AI service.

Unsloth lets you train the model the data stays with. That changes the compliance conversation from "how do we get a BAA with OpenAI" to "how do we harden our existing infrastructure." We bring to that conversation:

  • A CMMC-RP-certified team (Craig Petronella, Blake Rea, Justin Summers, Jonathan Wood) that works CMMC assessments as a day job.
  • HIPAA experience rooted in Craig's Amazon-published HIPAA book and a client portfolio covering dental, medical, and legal verticals.
  • PPSB accreditation and BBB A+ since 2003.
  • Standing infrastructure for on-prem GPU deployment, including physical, network, and identity controls that map to CMMC Level 2 practices.

For covered entities, the combination of Unsloth on an on-prem GPU plus our HIPAA compliance consulting produces an environment where you can train on production PHI safely. We document the boundary, the key management, and the audit trail so you can pass a regulator's questions without sweating.

How we help

When to call Petronella

Plenty of teams do not need us. If you have two ML engineers who have shipped a fine-tune before, a clean dataset, and an on-prem GPU already racked, follow the Unsloth docs and ship. The library is genuinely that approachable.

The conversations we do have usually fall into one of these patterns:

Compliance-bound fine-tuning

You have PHI, CUI, or internal data a managed service will not take. You need the training to happen behind your firewall and the pipeline to survive an auditor. We design and build the environment, then hand the runbook to your team. This is the core AI services engagement and typically runs four to eight weeks.

GPU hardware selection and deployment

You know you need on-prem, but you do not know whether a 4090, a 6000 Ada, an H100, or a DGX Station is the right fit. We size the hardware to your actual workload, source it, rack it, and leave you with a working training environment. See the hardware catalog for what we typically deploy.

Notebook to production lift

Your team got a fine-tune working in a notebook. Now you need it to run on a schedule, with versioned data, tracked runs, an eval harness, and a serving endpoint. We build the scaffolding and train your team on it.

Multi-modal or specialty training

Vision-language adapters, speech TTS, embedding fine-tunes, or GRPO-style RL. These all work with Unsloth but have more edges. We have done them. Our custom AI development team takes on the weird stuff.

Strategic AI planning

You are earlier in the journey and are not sure whether fine-tuning, RAG, or prompt engineering is the right move. Our AI consulting in Raleigh and the Research Triangle walks through the decision before anyone writes training code.

FAQ

Questions engineers actually ask

Straight answers based on what we see in production deployments.

Does Unsloth really match HuggingFace accuracy, or is something being traded away?
The published benchmarks and our observed results both say yes, accuracy matches. The speed gains come from memory and kernel efficiency, not from cutting corners in the math. The one subtle exception: if you use aggressive 4-bit quantization and a very small LoRA rank, you can leave capacity on the table. That is a configuration choice, not an Unsloth limitation. At standard settings (r=16, NF4 quant, bf16 compute) we have not seen a delta we could not attribute to normal run-to-run variance.
Can Unsloth fine-tune a 70B model on a single GPU?
Yes, with QLoRA on an 80 GB H100 or A100, or on an RTX 6000 Ada with careful sequence length and batch size settings. A 4090 with 24 GB is not enough for 70B. You can go as large as mid-range 30B models on a 4090 with aggressive settings if you keep sequence length modest. For reliable 70B work, plan on 80 GB of VRAM.
How does Unsloth compare to axolotl?
They solve different problems. Axolotl is a YAML-driven training orchestrator that supports many backends. Unsloth is a kernel-level speed and memory library. You can run Unsloth kernels inside an axolotl pipeline and get the best of both. Teams comfortable with YAML tend to stay in axolotl. Teams that prefer writing Python scripts directly tend to use Unsloth native. Both are valid.
Can we fine-tune on PHI or CUI with Unsloth?
Yes, provided the entire pipeline runs on hardware you control and the network boundary is sized correctly. Unsloth itself has no phone-home behavior. We run fine-tunes on PHI and CUI routinely for healthcare and defense-industrial-base clients, with appropriate system hardening, encryption at rest, and audit logging. The library is compatible. The compliance work is mostly outside the library.
What about multi-GPU? Is that Pro-only?
Multi-GPU training is available in the open-source library today and continues to receive upgrades. For single-node multi-GPU setups (2x or 4x H100) the DDP path works well. If your team needs multi-node or exotic topologies, that is where we spend the integration time. Plan on a few days of engineering to stabilize gradient accumulation and checkpoint sync.
How long does a typical fine-tune take?
Highly dependent on dataset size and sequence length, but a useful reference: a Llama 3.1 8B QLoRA fine-tune on 50,000 high-quality instruction examples with 2K context takes roughly 1-3 hours on a 4090 and 30-90 minutes on an H100. A 70B fine-tune on the same dataset is closer to 8-24 hours on an 80 GB card. Longer contexts and larger datasets scale roughly linearly.
Do we need Flash Attention installed?
Unsloth ships its own attention kernels. Flash Attention 2 is not required. If it is already installed in your environment, Unsloth will still work. Do not install FA2 purely for Unsloth.
What Python and CUDA versions are supported?
Python 3.10 through 3.12 cover the majority of working configurations today, with 3.13 support available. CUDA 12.1 and later is where we live. Older CUDA works for some combinations but gets harder to support over time. We standardize new client deployments on Python 3.11 + CUDA 12.4 unless there is a good reason to deviate.
Can Unsloth export to GGUF for llama.cpp and Ollama?
Yes, natively. model.save_pretrained_gguf() handles the merge and quantization in a single call. We typically export Q4_K_M for balance of size and quality, Q5_K_M when quality matters more, or Q8_0 when you want near-lossless and do not care about file size.
How much does a Petronella Unsloth engagement cost?
It depends on scope, hardware, and compliance overhead. A focused notebook-to-production lift with an existing dataset typically lands in a two-to-four-week engagement. A full on-prem build including GPU sourcing, environment hardening, pipeline, and compliance documentation takes longer. Start with a scoping call and we will come back with a fixed-fee proposal.

Ready to ship a private fine-tuned model?

Tell us the model, the data, and the compliance boundary. We will tell you what the fastest honest path to production looks like, on your hardware, under your rules.