How to Run a Private LLM (HIPAA Compliant): Complete Setup Guide

Posted: March 11, 2026 to Compliance.

A private large language model (LLM) is an AI system deployed entirely on infrastructure you own or control, with no data leaving your network. For healthcare organizations bound by HIPAA, running a private LLM eliminates the compliance risks of cloud-based AI services by keeping protected health information (PHI) within your security perimeter at all times.

Key Takeaways

Running a private LLM on-premise is the only guaranteed way to prevent PHI from reaching third-party servers
Ollama and vLLM are the two leading open-source inference engines for self-hosted AI deployment
A single NVIDIA RTX 4090 or RTX 5090 can serve a 70B-parameter model for a team of 10-15 concurrent users
HIPAA compliance requires encryption at rest and in transit, access controls, audit logging, and a signed BAA with any cloud provider involved
Total cost for a production-ready private LLM server starts at approximately $8,000 to $15,000, replacing $50,000 or more per year in per-seat AI licensing fees

Why Healthcare Organizations Need Private AI

The 2025 OCR enforcement data tells a clear story: 67% of HIPAA breaches involving AI tools stemmed from data sent to third-party cloud services. When a clinician pastes patient notes into ChatGPT or Copilot, that data leaves your network. Even with a Business Associate Agreement, you lose direct control over where that PHI resides and who can access it.

A private LLM changes the equation entirely. The model runs on your hardware. Patient data never crosses your firewall. You control the encryption keys, the access logs, and the retention policies.

At Petronella Technology Group, we have deployed private LLM infrastructure across healthcare practices, defense contractors, and financial services firms since 2024. This guide documents the exact process we use, including hardware specifications, software configuration, and the HIPAA safeguards that make it audit-ready.

Hardware Requirements

The hardware you need depends on the model size you plan to run. Larger models produce better outputs but require more GPU memory (VRAM).

Recommended Configurations

Use Case	Model Size	GPU	VRAM	RAM	Storage	Approximate Cost
Small practice (1-5 users)	7B-13B parameters	NVIDIA RTX 4070 Ti Super	16 GB	32 GB DDR5	1 TB NVMe	$3,000 - $5,000
Mid-size clinic (5-15 users)	70B parameters	NVIDIA RTX 5090	32 GB	128 GB DDR5	2 TB NVMe	$8,000 - $12,000
Hospital department (15-50 users)	70B quantized, multiple models	2x NVIDIA RTX 5090	64 GB total	256 GB DDR5	4 TB NVMe RAID	$15,000 - $25,000
Enterprise (50+ users)	405B parameters	NVIDIA H100 or 4x RTX 5090	80-128 GB	512 GB	8 TB NVMe	$30,000 - $80,000

For most healthcare practices with 5 to 20 staff members, a single RTX 5090 with 32 GB VRAM running a quantized 70B model delivers response quality comparable to GPT-4 for clinical summarization, coding assistance, and documentation tasks.

CPU and Memory Considerations

The CPU matters less than the GPU for inference, but you still need enough system RAM to load the model initially. A general rule: system RAM should be at least 2x the model size. For a 70B model at 4-bit quantization (approximately 35 GB on disk), plan for 64 GB of RAM minimum, with 128 GB recommended for concurrent request handling.

We recommend AMD Ryzen 9000 or Intel Xeon W processors. The Ryzen 9950X3D offers excellent single-threaded performance for model loading and preprocessing at roughly half the cost of comparable Xeon configurations.

Software Stack: Ollama vs vLLM

Two open-source inference engines dominate private LLM deployment. Your choice depends on your team size and performance requirements.

Ollama: Simplicity First

Ollama wraps llama.cpp in an easy-to-use CLI and REST API. Installation takes under five minutes on most Linux distributions.

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a HIPAA-suitable model (no telemetry, fully local)
ollama pull llama3.1:70b-instruct-q4_K_M

# Start serving
ollama serve

Ollama automatically detects NVIDIA GPUs and loads models into VRAM. The REST API runs on port 11434 by default:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:70b-instruct-q4_K_M",
  "prompt": "Summarize these clinical notes...",
  "stream": false
}'

Best for: Practices with 1-10 concurrent users, teams that want minimal configuration, environments where simplicity reduces compliance risk.

vLLM: Performance at Scale

vLLM uses PagedAttention to serve 3-5x more concurrent users on the same hardware. It requires more configuration but delivers significantly higher throughput.

# Install vLLM with CUDA support
pip install vllm

# Start the OpenAI-compatible API server
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --quantization awq \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --port 8000

vLLM exposes an OpenAI-compatible API, which means existing applications built for the OpenAI API can connect with a single endpoint change. No code modifications required.

Best for: Organizations with 10+ concurrent users, applications requiring high throughput, teams with Linux administration experience.

For a detailed performance comparison, see our guide on Ollama vs vLLM for enterprise deployment.

HIPAA Compliance: The Non-Negotiable Safeguards

Running the model locally is necessary but not sufficient. HIPAA requires specific administrative, physical, and technical safeguards around any system that processes PHI.

Technical Safeguards

Encryption at rest: Enable full-disk encryption on the LLM server. On Linux, use LUKS:

# Verify disk encryption is active
sudo cryptsetup status /dev/mapper/root

Encryption in transit: All API communication must use TLS 1.2 or higher. Place an NGINX reverse proxy in front of Ollama or vLLM:

server {
    listen 443 ssl;
    server_name llm.internal.example.com;

    ssl_certificate /etc/ssl/certs/llm-internal.crt;
    ssl_certificate_key /etc/ssl/private/llm-internal.key;
    ssl_protocols TLSv1.2 TLSv1.3;

    location / {
        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Access controls: Implement role-based access at the API layer. Every request should include an authenticated user token. Log the user identity, timestamp, and query content for audit purposes.

Audit logging: HIPAA requires six-year log retention. Configure centralized logging with rsyslog or a SIEM:

# Example: log all API requests to a dedicated file
# In your NGINX config:
access_log /var/log/llm-api/access.log combined;

Administrative Safeguards

Designate a HIPAA Security Officer responsible for the LLM infrastructure
Document the LLM in your system inventory and risk assessment
Create an Acceptable Use Policy specifying what types of PHI can be processed
Train all staff who interact with the LLM on HIPAA requirements
Conduct annual risk assessments that include the LLM infrastructure

Physical Safeguards

The server must be in a locked room with restricted badge access
Maintain a visitor log for the server room
Implement environmental controls (temperature monitoring, fire suppression)
Secure disposal procedures for any drives that stored model data or PHI

For a comprehensive HIPAA checklist, review our HIPAA Security Guide.

Step-by-Step Deployment Process

Step 1: Prepare the Server

Install Ubuntu Server 24.04 LTS or NixOS (our preferred platform for reproducible deployments). Install NVIDIA drivers and CUDA toolkit:

# Ubuntu
sudo apt install nvidia-driver-560 nvidia-cuda-toolkit

# Verify GPU detection
nvidia-smi

Step 2: Enable Full-Disk Encryption

If not configured during OS installation, encrypt the data partition:

sudo cryptsetup luksFormat /dev/sdb1
sudo cryptsetup open /dev/sdb1 llm-data
sudo mkfs.ext4 /dev/mapper/llm-data
sudo mount /dev/mapper/llm-data /opt/llm

Step 3: Install and Configure the Inference Engine

Choose Ollama for simplicity or vLLM for scale (see comparison above). Configure the service to listen only on localhost:

# For Ollama: set environment variable
echo "OLLAMA_HOST=127.0.0.1:11434" >> /etc/environment

Step 4: Deploy TLS Reverse Proxy

Install NGINX and configure TLS termination using internal CA certificates. Never expose the raw inference API to the network.

Step 5: Implement Authentication

Deploy an authentication layer. Options include OAuth2 Proxy, Authentik, or a custom JWT validation middleware. Every API request must map to an identified user.

Step 6: Configure Audit Logging

Send all access logs to your SIEM or a dedicated log server with six-year retention. Include user identity, timestamp, source IP, query content hash (not the full PHI query, to limit log exposure), and response status.

Step 7: Network Segmentation

Place the LLM server on a dedicated VLAN with firewall rules that restrict access to authorized client IPs only. Block all outbound internet access from the LLM server.

# Example iptables rules
iptables -A OUTPUT -o eth0 -j DROP  # Block all outbound
iptables -A INPUT -s 10.0.1.0/24 -p tcp --dport 443 -j ACCEPT  # Allow clinic network
iptables -A INPUT -j DROP  # Drop everything else

Step 8: Test and Validate

Run a penetration test against the LLM API endpoint. Verify that PHI cannot be extracted from model responses about other patients. Document the test results for your HIPAA audit file.

Cost Comparison: Private LLM vs Cloud AI

Factor	Private LLM (Year 1)	Private LLM (Year 2+)	Cloud AI (ChatGPT Team)
Hardware	$8,000 - $15,000	$0 (owned)	$0
Software	$0 (open source)	$0	$0
Licensing per user	$0	$0	$300/user/year
Electricity	$500 - $800/year	$500 - $800/year	$0
Setup and config	$5,000 - $10,000 (consultant)	$0	$0
HIPAA risk	Low (data stays on-premise)	Low	High (PHI in cloud)
Total (20 users)	$13,500 - $25,800	$500 - $800	$6,000/year

By year two, the private LLM costs under $1,000 per year to operate while delivering unlimited usage with zero per-seat fees. The cloud alternative costs $6,000 annually for 20 users, and that number grows with every new hire.

Models We Recommend for Healthcare

Not every open-source model is suitable for clinical use. Based on our testing across multiple healthcare clients:

Llama 3.1 70B Instruct (Q4_K_M quantized): Best general-purpose model for clinical summarization, chart review, and documentation assistance. Fits on a single RTX 5090.
Mistral Large 2 (Q4_K_M): Strong reasoning capabilities for complex clinical decision support. Requires 48 GB+ VRAM.
Qwen 2.5 72B Instruct: Excellent multilingual support for practices serving diverse patient populations.
DeepSeek R1 70B (distilled): Strong reasoning for diagnostic differential generation. Note: use only distilled versions hosted locally, never the cloud API.

All models should be accessed through your authenticated, encrypted API layer. Never download models from unofficial sources.

Common Mistakes to Avoid

Mistake 1: Skipping network segmentation. Running the LLM on the same network segment as workstations creates lateral movement risk. Always isolate it.

Mistake 2: Using cloud-hosted model APIs "temporarily." Temporary solutions become permanent. If PHI touches a cloud API even once without a BAA, you have a potential breach.

Mistake 3: Forgetting to disable telemetry. Some inference frameworks phone home by default. Verify that OLLAMA_NOTELEMETRY=1 is set and that no outbound connections are possible from the server.

Mistake 4: No input/output filtering. Implement guardrails that prevent the model from generating content outside its intended scope. Tools like NVIDIA NeMo Guardrails or LlamaGuard add a safety layer.

Get Expert Help with Private AI Deployment

Petronella Technology Group has deployed private LLM infrastructure for healthcare practices, defense contractors, and financial services firms across the Southeast since 2024. Our team handles the full stack: hardware procurement, HIPAA-compliant configuration, staff training, and ongoing support.

We combine custom AI development expertise with 23 years of cybersecurity experience, including CMMC Registered Practitioner certification (RP-1372). That combination matters because AI without security is a liability, not an asset.

Call 919-348-4912 or visit petronellatech.com/contact/ to schedule a private AI readiness assessment.

About the Author: Craig Petronella is the CEO of Petronella Technology Group, Inc., a Raleigh, NC-based cybersecurity and AI consultancy. With over 30 years of experience in IT security and a CMMC Registered Practitioner credential (RP-1372), Craig has helped hundreds of organizations implement secure technology infrastructure. He hosts the Petronella Technology Group podcast and has authored multiple books on cybersecurity compliance.

Frequently Asked Questions

Can I use ChatGPT for HIPAA-covered work if I sign a BAA?

OpenAI offers a BAA for ChatGPT Enterprise and API customers, but the BAA does not eliminate risk. Your data still resides on OpenAI servers, and you cannot audit their infrastructure directly. A private LLM removes this dependency entirely. For organizations processing sensitive PHI, on-premise deployment is the most defensible approach.

What size model do I need for clinical documentation?

For clinical note summarization, chart review, and basic documentation assistance, a 7B to 13B parameter model handles most tasks adequately. For complex reasoning, differential diagnosis support, or multi-step clinical workflows, a 70B parameter model delivers significantly better accuracy. Most practices find the 70B sweet spot balances quality with hardware cost.

How much does it cost to run a private LLM?

Hardware costs range from $3,000 for a basic single-user setup to $25,000 for a department-level server supporting 50 concurrent users. Annual operating costs (electricity, maintenance) run $500 to $2,000. Compare this to cloud AI licensing at $25 to $30 per user per month, and the private LLM pays for itself within 12 to 18 months for organizations with 10 or more users.

Do I need a dedicated IT team to manage a private LLM?

No. Ollama and vLLM are designed for straightforward administration. A single IT administrator can manage the infrastructure with approximately 2 to 4 hours of maintenance per month. For organizations without in-house IT, managed service providers like Petronella Technology Group handle the deployment and ongoing management.

Is a private LLM as good as GPT-4?

Modern open-source models like Llama 3.1 70B and Mistral Large 2 perform within 5 to 10% of GPT-4 on most benchmarks relevant to healthcare documentation. For specialized tasks like clinical summarization, fine-tuned open-source models can match or exceed GPT-4 performance because they are optimized for the specific domain.

What happens if the server goes down?

Plan for redundancy. Options include a hot standby server with model weights pre-loaded, failover to a smaller model on a backup GPU, or a documented manual workflow for the interim. HIPAA does not require 100% uptime, but your contingency plan must be documented and tested.

Can I fine-tune the model on our clinical data?

Yes, and this is one of the strongest advantages of private deployment. Fine-tuning on your organization's clinical notes, templates, and terminology improves output quality dramatically. Tools like Unsloth enable efficient fine-tuning on a single GPU. All training data stays on your hardware, maintaining HIPAA compliance throughout the process. See our AI automation services for fine-tuning support.

{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "Can I use ChatGPT for HIPAA-covered work if I sign a BAA?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "OpenAI offers a BAA for ChatGPT Enterprise and API customers, but the BAA does not eliminate risk. Your data still resides on OpenAI servers, and you cannot audit their infrastructure directly. A private LLM removes this dependency entirely. For organizations processing sensitive PHI, on-premise deployment is the most defensible approach."
      }
    },
    {
      "@type": "Question",
      "name": "What size model do I need for clinical documentation?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "For clinical note summarization, chart review, and basic documentation assistance, a 7B to 13B parameter model handles most tasks adequately. For complex reasoning, differential diagnosis support, or multi-step clinical workflows, a 70B parameter model delivers significantly better accuracy."
      }
    },
    {
      "@type": "Question",
      "name": "How much does it cost to run a private LLM?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Hardware costs range from $3,000 for a basic single-user setup to $25,000 for a department-level server supporting 50 concurrent users. Annual operating costs run $500 to $2,000. The private LLM pays for itself within 12 to 18 months for organizations with 10 or more users compared to cloud AI licensing."
      }
    },
    {
      "@type": "Question",
      "name": "Do I need a dedicated IT team to manage a private LLM?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "No. Ollama and vLLM are designed for straightforward administration. A single IT administrator can manage the infrastructure with approximately 2 to 4 hours of maintenance per month. For organizations without in-house IT, managed service providers handle the deployment and ongoing management."
      }
    },
    {
      "@type": "Question",
      "name": "Is a private LLM as good as GPT-4?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Modern open-source models like Llama 3.1 70B and Mistral Large 2 perform within 5 to 10% of GPT-4 on most benchmarks relevant to healthcare documentation. For specialized tasks like clinical summarization, fine-tuned open-source models can match or exceed GPT-4 performance."
      }
    },
    {
      "@type": "Question",
      "name": "What happens if the server goes down?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Plan for redundancy. Options include a hot standby server with model weights pre-loaded, failover to a smaller model on a backup GPU, or a documented manual workflow for the interim. HIPAA does not require 100% uptime, but your contingency plan must be documented and tested."
      }
    },
    {
      "@type": "Question",
      "name": "Can I fine-tune the model on our clinical data?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Yes, and this is one of the strongest advantages of private deployment. Fine-tuning on your organization's clinical notes, templates, and terminology improves output quality dramatically. Tools like Unsloth enable efficient fine-tuning on a single GPU. All training data stays on your hardware, maintaining HIPAA compliance."
      }
    }
  ]
}

Need help implementing these strategies? Our cybersecurity experts can assess your environment and build a tailored plan.

Get Free Assessment

Explore Our Services

Cybersecurity AI Services Compliance HIPAA CMMC Managed IT

About the Author

Craig Petronella

CEO, Founder & AI Architect, Petronella Technology Group

Craig Petronella founded Petronella Technology Group in 2002 and has spent more than 30 years working at the intersection of cybersecurity, AI, compliance, and digital forensics. He holds the CMMC Registered Practitioner credential (RP-1372) issued by the Cyber AB, is an NC Licensed Digital Forensics Examiner (License #604180-DFE), and completed MIT Professional Education programs in AI, Blockchain, and Cybersecurity. Craig also holds CompTIA Security+, CCNA, and Hyperledger certifications.

He is an Amazon #1 Best-Selling Author of 15+ books on cybersecurity and compliance, host of the Encrypted Ambition podcast (95+ episodes on Apple Podcasts, Spotify, and Amazon), and a cybersecurity keynote speaker with 200+ engagements at conferences, law firms, and corporate boardrooms. Craig serves as Contributing Editor for Cybersecurity at NC Triangle Attorney at Law Magazine and is a guest lecturer at NCCU School of Law. He has served as a digital forensics expert witness in federal and state court cases involving cybercrime, cryptocurrency fraud, SIM-swap attacks, and data breaches.

Under his leadership, Petronella Technology Group has served 2,500+ clients, maintained a zero-breach record among compliant clients, earned a BBB A+ rating every year since 2003, and been featured as a cybersecurity authority on CBS, ABC, NBC, FOX, and WRAL. The company leverages SOC 2 Type II certified platforms and specializes in AI implementation, managed cybersecurity, CMMC/HIPAA/SOC 2 compliance, and digital forensics for businesses across the United States.

CMMC-RP NC Licensed DFE MIT Certified CompTIA Security+ Expert Witness 15+ Books

Related Service

Achieve Compliance with Expert Guidance

CMMC, HIPAA, NIST, PCI-DSS — we have 80% of documentation pre-written to accelerate your timeline.

Learn About Compliance Services

Free cybersecurity consultation available Schedule Now