Most teams discover the air-gap constraint the hard way: a compliance review three weeks before launch that rules out every hosted LLM API they've built against. Suddenly api.openai.com is a forbidden destination, and the whole inference layer has to move inside the network boundary. This is the deployment pattern regulated buyers are actually asking for — and the one most vendors can't deliver.
Key Takeaways
- Air-gapped LLM deployment has two working architectures: self-hosted open-weight models (Llama 3.3, Qwen2.5) served with vLLM, Ollama, or NVIDIA NIM, or AWS Bedrock via a VPC interface endpoint — both with zero internet egress
- A 70B model needs roughly 140 GB of VRAM in FP16 (2× 80 GB GPUs) or ~40 GB with 4-bit quantization — GPU sizing is the first budget decision, not an afterthought
- Model weights and container images enter the environment through a controlled ingestion process (private S3 bucket + private registry), never a runtime pull from Hugging Face or Docker Hub
- Bedrock via VPC endpoint reaches production fastest; self-hosted gives full model control and avoids per-token costs at scale — Prodinit has shipped both
Air-gapped LLM deployment runs large language models on infrastructure with zero internet egress — model weights, prompts, and inference never cross your security boundary. Two architectures dominate: self-hosted open-weight models (Llama 3.3, Qwen2.5, Mistral) served with vLLM, Ollama, or NVIDIA NIM, or Amazon Bedrock accessed through a VPC interface endpoint so foundation-model calls stay inside the AWS network.
The distinction that trips teams up: air-gapped is stricter than "on-prem" or "private cloud." On-prem describes who owns the hardware; air-gapped describes the network boundary. Zero egress is the defining constraint. For a full definition, see our explainer on what air-gapped AI is. This guide is about deploying it.
What does air-gapped LLM deployment actually require?
Air-gapped LLM deployment requires four things to hold simultaneously: no outbound internet route from the compute running inference, model weights ingested through a controlled process rather than pulled at runtime, all supporting services (databases, registries, object storage) reached over private networking, and an audit trail for every inference call. Miss any one and you have a private-ish deployment, not an air-gapped one.
The trap is that standard LLM tooling assumes internet access everywhere. Ollama pulls models from its registry. vLLM downloads weights from Hugging Face on first run. Container images pull base layers from Docker Hub. Every one of those calls fails silently the moment egress is removed — and usually not in staging, where someone left a NAT gateway attached, but in the isolated production environment where it actually matters.
So the real work is inversion: instead of "connect and pull," everything becomes "pre-stage and reference." Weights land in a private S3 bucket or on a local volume before deployment. Images get mirrored into a private registry. Runtimes are configured to read only from those private sources.
The two architectures: self-hosted models vs Bedrock
There are exactly two production-grade paths to air-gapped LLM inference, and the choice comes down to control versus operational overhead. Self-hosted open-weight models give you full ownership of the model and no per-token cost, but you size and patch the GPU fleet yourself. Amazon Bedrock via a VPC interface endpoint keeps foundation-model calls inside the AWS network with no servers to manage, but you're bounded by the model catalog and per-token pricing.
| Dimension | Self-hosted open-weight | Bedrock via VPC endpoint |
|---|---|---|
| Models | Llama 3.3, Qwen2.5, Mistral, DeepSeek (open weights) | Claude, Llama, Titan, Mistral (managed catalog) |
| Egress | Zero — weights served from private storage | Zero — traffic stays on the AWS network via PrivateLink |
| Infra you run | GPU nodes, serving runtime, autoscaling | None — fully managed |
| Cost model | Fixed GPU cost; cheaper at high volume | Per-token; cheaper at low/spiky volume |
| Time to production | Weeks (GPU sizing, serving, evals) | Days (endpoint + IAM scoping) |
| Compliance | You own the entire boundary | Covered under AWS BAA where applicable |
For most regulated buyers, the honest answer is a hybrid: Bedrock via VPC endpoint for the primary inference workload where a managed foundation model is acceptable, and self-hosted open-weight models for anything requiring full data ownership or a fine-tuned domain model. Prodinit deployed exactly this split for a regulated fintech client — Bedrock via VPC endpoint for primary inference and Amazon Transcribe via VPC endpoint for audio, all inside a zero-egress VPC. The full infrastructure teardown is in our air-gapped AWS EKS deployment guide.
How to deploy a self-hosted LLM with zero egress
A self-hosted air-gapped LLM deployment follows five steps, and the order matters because each one removes a dependency on the public internet before it becomes a runtime failure. The sequence is: size the GPU, ingest the weights, mirror the runtime image, serve behind an internal endpoint, and validate the zero-egress path end to end.
1. Size the GPU for the model
GPU memory is the binding constraint. A rough rule: an FP16 model needs about 2 GB of VRAM per billion parameters, plus 20–40% headroom for the KV cache under concurrency.
- 7–8B models (Llama 3.1 8B, Qwen2.5 7B): ~16 GB FP16 → fits on a single 24 GB GPU (L4, A10G)
- 70B models (Llama 3.3 70B, Qwen2.5 72B): ~140 GB FP16 → 2× 80 GB GPUs (A100/H100) with tensor parallelism, or ~40 GB with 4-bit AWQ/GPTQ quantization on a single 48 GB card
- Vision-language models (Qwen2.5-VL): budget extra for the vision encoder and higher-resolution image tokens
Quantization is the lever that makes self-hosting affordable. 4-bit AWQ on a 70B model cuts VRAM roughly 3.5× with minimal quality loss on most production tasks — the difference between a two-GPU node and one.
2. Ingest the weights through a controlled process
This is where the air gap is enforced. Download the weights outside the isolated environment, scan them, then stage them in a private S3 bucket (reached via gateway endpoint) or a local volume:
# Outside the air gap: fetch and stage
huggingface-cli download Qwen/Qwen2.5-72B-Instruct --local-dir ./qwen2.5-72b
aws s3 sync ./qwen2.5-72b s3://private-model-bucket/qwen2.5-72b/
# Inside the air gap: pull from private S3 only (no Hugging Face call)
aws s3 sync s3://private-model-bucket/qwen2.5-72b/ /models/qwen2.5-72b/
Never let the serving runtime resolve a model name against a public registry. With vLLM, always point --model at a local path; with Ollama, pre-load the model into the private volume rather than running ollama pull at runtime.
3. Mirror the serving runtime image
vLLM, Ollama, NVIDIA NIM, and Text Generation Inference all ship as container images from public registries. Mirror the one you're using into a private registry (Amazon ECR, Harbor) before deployment — the same discipline that applies to every Kubernetes controller in an air-gapped cluster:
docker pull vllm/vllm-openai:latest
docker tag vllm/vllm-openai:latest \
123456789.dkr.ecr.us-east-1.amazonaws.com/vllm:v0.6.3
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/vllm:v0.6.3
4. Serve behind an internal endpoint
Run the model with an OpenAI-compatible API on a private address so application code needs almost no changes — it just points at an internal DNS name instead of api.openai.com:
python -m vllm.entrypoints.openai.api_server \
--model /models/qwen2.5-72b \
--tensor-parallel-size 2 \
--quantization awq \
--served-model-name qwen2.5-72b \
--host 0.0.0.0 --port 8000
On Kubernetes, this becomes a Deployment on GPU node groups with a ClusterIP Service — no ingress, no public load balancer for the model itself. NVIDIA NIM packages this pattern with optimized TensorRT-LLM engines if you want a supported runtime rather than raw vLLM.
5. Validate the zero-egress path
Don't call it done until you've proven no packet can leave. Run inference, then confirm the pod has no route out: attempt an outbound call to a public endpoint from inside the container and verify it times out. Check VPC flow logs for any egress attempts. In an air-gapped setup, "it works" and "it's actually isolated" are two different tests — run both.
Security, secrets, and audit for air-gapped LLMs
Air-gapped LLM deployments still need internal security controls, because the air gap protects against external egress — not against an over-privileged pod or a credential sitting in a manifest. The three that regulated auditors look for: secrets pulled from a managed store at runtime, IAM/RBAC scoped to specific models, and an audit log of every inference call.
Kubernetes Secret objects are base64-encoded, not encrypted — any cluster admin with RBAC read access decodes them in one command. Use the External Secrets Operator with AWS Secrets Manager so credentials sync into pods at startup and never appear in a manifest or image layer. For Bedrock, scope IAM policies to specific model ARNs (bedrock:InvokeModel on named models) rather than bedrock:*.
The audit trail is the part teams forget until the SOC 2 assessment. Every model call — prompt, response metadata, latency, model version — should land in CloudWatch Logs or an equivalent private log store. For most financial and healthcare frameworks, "we can't show which model saw which data when" is an automatic finding.
How Prodinit deploys air-gapped LLMs
Prodinit has shipped air-gapped LLM systems on both architectures, which is why the recommendation is usually a hybrid rather than a religious position. The two reference deployments cover the full spectrum from managed foundation models to fully self-hosted open-weight inference.
For a regulated fintech, Prodinit deployed a fully air-gapped AWS EKS platform in four weeks — zero internet egress, 10+ VPC interface endpoints, private ECR, and Amazon Bedrock via VPC endpoint for primary inference, with the whole thing documented in our regulated air-gapped case study. For a document-heavy enterprise, Prodinit built a fully air-gapped document-processing pipeline using PaddleOCR and the Qwen2.5-VL vision-language model served through Ollama, with NVIDIA NIM for optimized inference — no document, prompt, or response ever left the network.
The pattern that repeats across both: the hard part isn't the model, it's the environment. Get the ingestion process, private registry, and endpoint list right, and air-gapped inference is repeatable. Skip that discipline and the cluster won't even bootstrap. This is the core of our AI infrastructure and LLMOps practice.
Get Prodinit's AI engineering guides in your inbox
Deep-dives on production LLMs, voice AI, and MLOps — published weekly. No sales emails.
Frequently Asked Questions
Yes. Open-weight models like Llama 3.3, Qwen2.5, and Mistral can be downloaded once, staged in private storage, and served entirely offline with vLLM, Ollama, or NVIDIA NIM. The only constraint is GPU capacity — you provision and scale the serving hardware yourself instead of relying on a hosted API. Alternatively, Amazon Bedrock via a VPC interface endpoint keeps foundation-model calls inside the AWS network.
It depends on volume. Bedrock's per-token pricing is cheaper for low or spiky workloads because you pay nothing when idle. Self-hosted open-weight models carry fixed GPU cost — a 70B model on two 80 GB GPUs runs continuously whether or not requests arrive — so they become cheaper past a steady, high request volume. High, predictable throughput favors self-hosting; bursty or early-stage usage favors Bedrock.
Roughly 140 GB of VRAM to serve a 70B model in FP16 — typically two 80 GB GPUs (A100 or H100) with tensor parallelism. With 4-bit quantization (AWQ or GPTQ), that drops to about 40 GB, which fits on a single 48 GB card with minimal quality loss for most production tasks. Add 20–40% headroom for the KV cache under concurrent requests.
Through a controlled ingestion process, not an automatic pull. New model weights or container images are downloaded and scanned outside the isolated environment, then deliberately staged into the private S3 bucket or registry the runtime reads from. This keeps the air gap intact while still allowing patches and upgrades — the update is an explicit, audited action rather than a background download.
Not strictly, but it's the cleanest way to satisfy the underlying requirement. HIPAA and PCI-DSS don't mandate an air gap; they require that sensitive data isn't exposed to unauthorized parties. Air-gapping removes the third-party-API exposure path entirely at the infrastructure level, which is far easier to prove in an audit than policy-based controls. The key question for any managed model is whether the provider will sign a BAA — Bedrock supports this, most third-party LLM APIs do not.