How to Deploy an LLM on Kubernetes: GPU Nodes, Model Serving, and Autoscaling
Running LLMs in production on Kubernetes means GPU node management, model serving (vLLM or Triton), resource limits that actually work, and KEDA-based autoscaling. Here's the full picture.

Running an LLM isn't like running a web service.
Whether you chose Kubernetes, Docker Swarm, or Nomad for your orchestrator, the requirements for LLM workloads are distinct and unforgiving. A web service can burst CPU for a few seconds, share a node with a dozen other services, and restart from a crash in under five seconds. An LLM serving pod owns a GPU exclusively, takes 3-5 minutes to start (model weights take time to load), and will OOMKill if you get the memory configuration wrong — not gradually degrade, but hard-crash.
Most guides on "LLM on Kubernetes" cover the happy path. This one covers the happy path and the failure modes: what actually breaks, why, and how to prevent it before it causes a production incident.
The Architecture We're Building
Components:
- GPU node pool on EKS with
g5.xlarge(NVIDIA A10G, 24GB VRAM) - NVIDIA GPU Operator — installs device plugin, DCGM exporter, container toolkit
- vLLM — our model server, OpenAI-compatible API, PagedAttention for memory efficiency
- Model weights PVC — pre-populated EFS volume, shared across serving pods
- KEDA — scales vLLM pods based on
vllm:num_requests_waitingPrometheus metric - Karpenter — provisions new GPU nodes when KEDA scales out beyond current capacity
- DCGM Exporter — GPU utilization and memory metrics for dashboards and alerting
Step 1: GPU Node Pool Setup
EKS Managed Node Group
GPU instances on EKS require a specific AMI (Amazon Linux 2 GPU-optimized) and appropriate instance types. The g5 family (A10G GPUs) is the current sweet spot for inference:
g5.xlarge: 1x A10G (24GB VRAM), 4 vCPU, 16GB RAM — good for 7B modelsg5.2xlarge: 1x A10G, 8 vCPU, 32GB RAM — more CPU headroom for preprocessingg5.12xlarge: 4x A10G — for 70B models with tensor parallelism
For Llama 3 8B in FP16, a single g5.xlarge gives you comfortable headroom. For 70B models, you need either g5.48xlarge (8 GPUs) or multiple nodes with tensor parallelism.
1# Terraform: EKS GPU managed node group
2resource "aws_eks_node_group" "gpu" {
3 cluster_name = aws_eks_cluster.main.name
4 node_group_name = "gpu-inference"
5 node_role_arn = aws_iam_role.node.arn
6 subnet_ids = var.private_subnet_ids
7
8 # GPU-optimized AMI
9 ami_type = "AL2_x86_64_GPU"
10 instance_types = ["g5.xlarge"]
11
12 scaling_config {
13 desired_size = 0 # Start at 0, let Karpenter provision on demand
14 min_size = 0
15 max_size = 10
16 }
17
18 labels = {
19 "workload-type" = "gpu-inference"
20 "nvidia.com/gpu" = "true"
21 }
22
23 taint {
24 key = "nvidia.com/gpu"
25 value = "true"
26 effect = "NO_SCHEDULE"
27 }
28
29 tags = {
30 "karpenter.sh/discovery" = var.cluster_name
31 }
32}The critical pieces:
ami_type = "AL2_x86_64_GPU"— this AMI has NVIDIA drivers pre-installed. Do not run the GPU Operator's driver installer on top of this AMI (more on this below).- The
NoScheduletaint onnvidia.com/gpuensures no regular workloads land on expensive GPU nodes accidentally. desired_size = 0with Karpenter managing provisioning keeps costs at zero when no LLM requests are in flight.
Karpenter NodePool for GPU Instances
With GPU nodes starting at $1.00/hour per node, you don't want idle GPU nodes sitting around. Karpenter's node provisioner + consolidation handles this automatically:
1apiVersion: karpenter.sh/v1
2kind: NodePool
3metadata:
4 name: gpu-inference
5spec:
6 template:
7 metadata:
8 labels:
9 workload-type: gpu-inference
10 spec:
11 requirements:
12 - key: karpenter.sh/capacity-type
13 operator: In
14 values: ["on-demand"] # Switch to spot for dev/staging
15 - key: node.kubernetes.io/instance-type
16 operator: In
17 values: ["g5.xlarge", "g5.2xlarge"]
18 - key: kubernetes.io/arch
19 operator: In
20 values: ["amd64"]
21 nodeClassRef:
22 apiVersion: karpenter.k8s.aws/v1
23 kind: EC2NodeClass
24 name: gpu-inference
25 taints:
26 - key: nvidia.com/gpu
27 value: "true"
28 effect: NoSchedule
29 disruption:
30 consolidationPolicy: WhenEmptyOrUnderutilized
31 consolidateAfter: 5mconsolidateAfter: 5m — Karpenter will decommission a GPU node 5 minutes after its last vLLM pod terminates. GPU instances are expensive; don't leave them idle.
Step 2: NVIDIA GPU Operator
The GPU Operator is a Kubernetes operator that manages the NVIDIA software stack on your nodes: device plugin (exposes GPUs as schedulable resources), container toolkit (configures Docker/containerd to use NVIDIA runtime), DCGM exporter (GPU metrics for Prometheus), and optionally the NVIDIA driver itself.
Installation
1helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
2helm repo update
3
4helm install gpu-operator nvidia/gpu-operator \
5 --namespace gpu-operator \
6 --create-namespace \
7 --set driver.enabled=false \ # AL2 GPU AMI already has drivers
8 --set toolkit.enabled=true \
9 --set devicePlugin.enabled=true \
10 --set dcgmExporter.enabled=true \
11 --set dcgmExporter.serviceMonitor.enabled=true # If using Prometheus OperatorCritical flag: driver.enabled=false. The EKS GPU-optimized AMI (AL2_x86_64_GPU) ships with NVIDIA drivers pre-installed. If you let the GPU Operator install drivers on top, you'll get driver conflicts that cause all GPU pods to fail with Failed to initialize NVML: Driver/library version mismatch. I've watched three teams hit this exact issue. Always check whether drivers are pre-installed in your AMI before enabling the GPU Operator's driver component.
Validating GPU Availability
Once the Operator is running, GPUs appear as schedulable resources:
kubectl get nodes -o json | jq '.items[] | {
name: .metadata.name,
gpu: .status.capacity["nvidia.com/gpu"]
}'Expected output:
{
"name": "ip-10-0-1-42.ec2.internal",
"gpu": "1"
}If nvidia.com/gpu is null or missing, the device plugin DaemonSet pod on that node has a problem. Check its logs:
kubectl logs -n gpu-operator \
-l app.kubernetes.io/component=nvidia-device-plugin-daemonset \
--tail=50Step 3: Model Weight Storage
This is a decision that has significant operational consequences. You have three options.
Option A: Bake Weights Into Container Image
Simple. Wrong for anything above 3B parameters.
A 7B parameter model in FP16 is ~14GB of weights. Baking that into a container image means a 15GB+ image that every pod pull has to download, that every image scan has to process, and that every image rebuild regenerates unnecessarily when you update serving code. ECR image pulls of 15GB on a cold node take 8-12 minutes.
Use this pattern only for tiny models (< 2B parameters) in development environments.
Option B: Download Weights at Startup (Init Container)
Better, but creates cold-start problems at scale.
1initContainers:
2 - name: download-model
3 image: amazon/aws-cli:latest
4 command:
5 - sh
6 - -c
7 - |
8 aws s3 sync s3://my-model-bucket/llama-3-8b/ /models/llama-3-8b/ \
9 --region us-east-1 \
10 --no-progress
11 volumeMounts:
12 - name: model-storage
13 mountPath: /models
14 resources:
15 requests:
16 cpu: "1"
17 memory: "4Gi"Downloading ~14GB from S3 to a new pod takes 2-4 minutes on a well-provisioned node. If KEDA scales you from 1 to 5 pods during a traffic spike, all 4 new pods are downloading in parallel, adding S3 transfer costs and extending your scale-out latency.
This pattern works for small models and environments where cold-start latency is acceptable.
Option C: Shared PVC (Recommended for Production)
Pre-populate an EFS volume with model weights once. Mount it ReadWriteMany across all serving pods. No download on startup — weights are already on the volume.
1# 1. Create the PVC (EFS StorageClass for ReadWriteMany)
2apiVersion: v1
3kind: PersistentVolumeClaim
4metadata:
5 name: model-weights
6 namespace: llm-serving
7spec:
8 accessModes:
9 - ReadWriteMany
10 storageClassName: efs-sc
11 resources:
12 requests:
13 storage: 100Gi
14---
15# 2. One-time Job to populate the PVC from S3
16apiVersion: batch/v1
17kind: Job
18metadata:
19 name: model-downloader
20 namespace: llm-serving
21spec:
22 template:
23 spec:
24 containers:
25 - name: downloader
26 image: amazon/aws-cli:latest
27 command:
28 - sh
29 - -c
30 - |
31 aws s3 sync s3://my-model-bucket/llama-3-8b/ /models/llama-3-8b/ \
32 --region us-east-1
33 volumeMounts:
34 - name: model-weights
35 mountPath: /models
36 volumes:
37 - name: model-weights
38 persistentVolumeClaim:
39 claimName: model-weights
40 restartPolicy: NeverRun the Job once when you add a new model version. After that, all serving pods mount the PVC and start in seconds (just the model load from local NFS, not S3 download).
EFS throughput is provisioned — ensure you're on Elastic throughput mode for inference workloads. Alternatively, use EBS gp3 volumes if you have a single replica (ReadWriteOnce) or are willing to use node-local storage.
Step 4: Deploying vLLM
vLLM is the right choice for most production LLM serving workloads in 2026. Its PagedAttention algorithm manages the KV cache like OS virtual memory, dramatically reducing memory fragmentation compared to static allocation. The result: significantly higher throughput at the same VRAM budget.
It also ships an OpenAI-compatible API — POST /v1/chat/completions works exactly like the OpenAI API, so clients written for OpenAI work without modification.
The Deployment
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4 name: vllm-llama3-8b
5 namespace: llm-serving
6spec:
7 replicas: 1 # KEDA will manage this
8 selector:
9 matchLabels:
10 app: vllm
11 model: llama3-8b
12 template:
13 metadata:
14 labels:
15 app: vllm
16 model: llama3-8b
17 spec:
18 # Tolerate the GPU taint
19 tolerations:
20 - key: nvidia.com/gpu
21 value: "true"
22 effect: NoSchedule
23
24 # Schedule only on GPU nodes
25 nodeSelector:
26 workload-type: gpu-inference
27
28 # Don't start until model weights are mounted
29 initContainers:
30 - name: check-model
31 image: busybox:latest
32 command: ["sh", "-c", "test -d /models/llama-3-8b && echo 'Model found' || exit 1"]
33 volumeMounts:
34 - name: model-weights
35 mountPath: /models
36
37 containers:
38 - name: vllm
39 image: vllm/vllm-openai:v0.6.0
40 command:
41 - python3
42 - -m
43 - vllm.entrypoints.openai.api_server
44 args:
45 - --model=/models/llama-3-8b
46 - --served-model-name=llama-3-8b
47 - --host=0.0.0.0
48 - --port=8000
49 - --gpu-memory-utilization=0.90 # Reserve 10% for CUDA overhead
50 - --max-model-len=8192 # Cap context to control KV cache size
51 - --max-num-seqs=64 # Max concurrent sequences
52 - --dtype=half # FP16 — required for A10G
53 - --enforce-eager # Disable CUDA graph capture if you see OOM on startup
54 ports:
55 - containerPort: 8000
56 name: http
57
58 resources:
59 requests:
60 cpu: "4"
61 memory: "16Gi"
62 nvidia.com/gpu: "1"
63 limits:
64 cpu: "8"
65 memory: "20Gi"
66 nvidia.com/gpu: "1" # Always set GPU limit = request (GPU is not compressible)
67
68 readinessProbe:
69 httpGet:
70 path: /health
71 port: 8000
72 initialDelaySeconds: 120 # Model loading takes time — give it space
73 periodSeconds: 15
74 failureThreshold: 20
75
76 livenessProbe:
77 httpGet:
78 path: /health
79 port: 8000
80 initialDelaySeconds: 180
81 periodSeconds: 30
82 failureThreshold: 3
83
84 volumeMounts:
85 - name: model-weights
86 mountPath: /models
87
88 env:
89 - name: HUGGING_FACE_HUB_TOKEN
90 valueFrom:
91 secretKeyRef:
92 name: hf-token
93 key: token
94
95 volumes:
96 - name: model-weights
97 persistentVolumeClaim:
98 claimName: model-weights
99
100 # Prevent eviction during model loading
101 terminationGracePeriodSeconds: 300
102---
103apiVersion: v1
104kind: Service
105metadata:
106 name: vllm-llama3-8b
107 namespace: llm-serving
108spec:
109 selector:
110 app: vllm
111 model: llama3-8b
112 ports:
113 - port: 80
114 targetPort: 8000
115 name: httpKey Configuration Parameters Explained
--gpu-memory-utilization=0.90: vLLM pre-allocates this fraction of GPU VRAM for the KV cache. Leave at least 10% for CUDA kernels, model weight loading buffers, and overhead. Setting this to 1.0 will cause OOM during startup.
--max-model-len=8192: The maximum context window in tokens. KV cache size scales quadratically with context length. A 128k context window on a 24GB GPU is only possible with quantized models. If you're using FP16 Llama 3 8B, 8192-16384 is the practical range on a g5.xlarge.
--max-num-seqs=64: Maximum concurrent sequences (active requests being processed). Higher values increase throughput but also peak memory usage.
--dtype=half: A10G GPUs (and most Ampere-generation) run FP16 efficiently. For very small GPUs (T4), float is safer. For newer hardware (H100), bfloat16 is preferred.
readinessProbe.initialDelaySeconds=120: Llama 3 8B in FP16 takes 60-90 seconds to load from an EFS volume. Set your initial delay generously. A pod that fails its readiness probe during model loading will be taken out of rotation and rescheduled — you'll spend 5 minutes debugging a non-problem.
Resource Limits: GPU Is Not Compressible
CPU and memory have soft and hard limits in Kubernetes. CPU can be throttled; memory triggers OOMKill. GPUs have no throttling mechanism. A GPU resource limit in Kubernetes is a scheduling constraint — it controls which pods can schedule onto GPU nodes — but it doesn't prevent a running pod from using more VRAM than it requested.
This means:
- Always set
nvidia.com/gpurequest == limit (fractional GPU requests are not supported by the standard device plugin) - GPU OOM is not the same as CPU OOM — a GPU out-of-memory error in CUDA causes a CUDA error in the model server process, not a kernel OOMKill. vLLM will catch this and either fail the request or crash the process
- Monitor
DCGM_FI_DEV_FB_USED(framebuffer used) andDCGM_FI_DEV_FB_FREEvia the DCGM exporter — set an alert when VRAM usage exceeds 90% of capacity
Step 5: KEDA Autoscaling
Standard HPA is the wrong tool for LLM autoscaling. Here's why:
HPA scales on CPU/memory utilization. During an LLM request, the GPU is maxed but CPU and memory might be moderate. More importantly, HPA reacts to current resource usage — by the time CPU spikes from a queue buildup, you're already degrading. New GPU pods take 3-5 minutes to start (including node provisioning by Karpenter), so you need to scale ahead of saturation, not in response to it.
KEDA's solution: scale on request queue depth using a Prometheus metric that vLLM exports natively — vllm:num_requests_waiting.
1apiVersion: keda.sh/v1alpha1
2kind: ScaledObject
3metadata:
4 name: vllm-scaler
5 namespace: llm-serving
6spec:
7 scaleTargetRef:
8 name: vllm-llama3-8b
9
10 # Min/max replicas
11 minReplicaCount: 1
12 maxReplicaCount: 5
13
14 # How long to keep a scaled-up pod around after queue drains
15 cooldownPeriod: 300 # 5 minutes — GPU pods are expensive, don't yo-yo
16
17 # How aggressively to scale up
18 pollingInterval: 15
19
20 triggers:
21 - type: prometheus
22 metadata:
23 serverAddress: http://prometheus-operated.monitoring.svc:9090
24 metricName: vllm_requests_waiting
25 query: |
26 sum(vllm:num_requests_waiting{namespace="llm-serving", model="llama3-8b"})
27 threshold: "5" # Scale up when 5+ requests are waiting for a GPU
28 activationThreshold: "1" # Activate scaling when >= 1 request is waitingWhen vllm:num_requests_waiting >= 5, KEDA adds another replica. Each new replica triggers Karpenter to provision a new GPU node (since there's likely no spare capacity — GPU nodes are tainted and not shared). The new pod won't serve traffic for 4-6 minutes while the node provisions and the model loads. Plan your threshold accordingly: scale early, not late.
Scale-to-zero consideration: minReplicaCount: 1 keeps one pod warm at all times. Setting minReplicaCount: 0 enables scale-to-zero — the cluster has zero GPU nodes running when idle, maximizing cost savings. The tradeoff: the first request after a quiet period waits 5+ minutes for a cold start. For development clusters or low-latency-tolerant workloads, scale-to-zero is worth it. For production chatbots, keep at least one replica warm.
Step 6: Monitoring GPU Workloads
DCGM Exporter (installed by the GPU Operator) exports these metrics to Prometheus:
| Metric | What It Tells You |
|---|---|
DCGM_FI_DEV_GPU_UTIL | GPU compute utilization % (0-100) |
DCGM_FI_DEV_FB_USED | VRAM used (bytes) |
DCGM_FI_DEV_FB_FREE | VRAM free (bytes) |
DCGM_FI_DEV_SM_CLOCK | Streaming multiprocessor clock speed |
DCGM_FI_DEV_POWER_USAGE | GPU power draw (watts) |
DCGM_FI_DEV_GPU_TEMP | GPU temperature |
Alerts to configure:
1# Alert: VRAM nearly full (will cause GPU OOM on next large request)
2- alert: GPUVRAMCritical
3 expr: |
4 DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) > 0.92
5 for: 2m
6 labels:
7 severity: warning
8 annotations:
9 summary: "GPU VRAM usage above 92% on {{ $labels.instance }}"
10
11# Alert: GPU utilization sustained at 100% (probably a stuck request)
12- alert: GPUSustainedFullUtilization
13 expr: DCGM_FI_DEV_GPU_UTIL > 99
14 for: 10m
15 labels:
16 severity: warning
17 annotations:
18 summary: "GPU at 100% utilization for 10+ minutes — check for stuck requests"
19
20# Alert: GPU temperature critical (throttling imminent)
21- alert: GPUTempCritical
22 expr: DCGM_FI_DEV_GPU_TEMP > 85
23 for: 5m
24 labels:
25 severity: critical
26 annotations:
27 summary: "GPU temperature critical on {{ $labels.instance }}"vLLM also exports its own Prometheus metrics at :8000/metrics:
vllm:num_requests_running— requests currently being processedvllm:num_requests_waiting— requests in queue (the KEDA trigger metric)vllm:gpu_cache_usage_perc— KV cache utilization (high values = PagedAttention under pressure)vllm:generation_tokens_total— total tokens generated (useful for billing/cost attribution)
Step 7: Cost Controls
GPU instances are expensive. A g5.xlarge in us-east-1 runs ~$1.00/hour on-demand. At 720 hours/month, that's $720/month per always-on node. Controls that matter:
Scale to zero in non-production: Use minReplicaCount: 0 in dev/staging KEDA configs. Accept the cold-start latency; developers can wait 5 minutes.
Spot instances for non-critical workloads: g5 spot instances run at 60-70% discount. The risk is interruption (2-minute warning). vLLM handles SIGTERM gracefully — in-flight requests complete, then the pod shuts down. Configure Karpenter's NodePool with capacity-type: spot for acceptable spot interruption risk.
# Karpenter NodePool addition for spot GPU
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"] # Try spot first, fall back to on-demandRight-size for the model: Don't pay for g5.2xlarge if g5.xlarge works. Verify VRAM utilization under load before committing to a larger instance type.
Token-based cost attribution: vllm:generation_tokens_total labeled by model and namespace lets you build per-team cost allocation in Kubecost or a custom dashboard. At $0.002-0.01 per 1000 tokens for internal serving (rough GPU cost), this adds up at scale.
For broader Kubernetes cost optimization patterns, see Kubernetes Cost Optimization on AWS.
Debugging Common Failures
Insufficient nvidia.com/gpu
Pod stuck in Pending with this event:
0/5 nodes are available: 5 Insufficient nvidia.com/gpu
Causes:
- No GPU nodes in the cluster yet (if using scale-to-zero, Karpenter needs a moment to provision)
- GPU device plugin not running — check
kubectl get pods -n gpu-operator -l app.kubernetes.io/component=nvidia-device-plugin-daemonset - Pod doesn't tolerate the
nvidia.com/gpu: NoScheduletaint — checktolerationsin your Deployment - GPU nodes exist but are cordoned or have the wrong labels for the
nodeSelector
GPU OOM During Serving (Not Pod OOMKill)
The pod stays running but requests return HTTP 500 with a CUDA error. This is a GPU out-of-memory error, not a Kubernetes OOM event.
Fix: Reduce --gpu-memory-utilization (try 0.85), reduce --max-model-len (shorter context = smaller KV cache), or reduce --max-num-seqs.
Pod OOMKill (CPU Memory, Not GPU)
The pod is killed and restarted with OOMKilled reason. This is the CPU memory limit, not GPU VRAM.
vLLM uses CPU memory for tokenization, request batching, and intermediate buffers. A memory: 16Gi limit is usually sufficient for 7B models, but large-context requests with many concurrent users can push CPU memory. Increase the memory limit or reduce --max-num-seqs.
Slow Cold Start (Pod Takes 8+ Minutes to Become Ready)
Two causes:
- Model downloading at startup — switch to the pre-populated PVC pattern (Option C above)
- CUDA graph capture — vLLM captures CUDA graphs at startup for inference optimization. This takes 1-3 minutes for large models. If you see OOM during this phase, add
--enforce-eagerto skip CUDA graph capture (reduces throughput ~10% but eliminates the OOM risk during startup)
Pod Stuck Pending Because Node Not Provisioned
If you're using Karpenter and a new pod is stuck Pending, check whether Karpenter is actually trying to provision:
kubectl get events --field-selector reason=ProvisioningFailed -n kube-system
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter --tail=100Common cause: EC2 capacity limits in your availability zone. Add g5.2xlarge and g5.4xlarge as fallback instance types in your NodePool to increase the capacity pool Karpenter draws from.
Security Considerations
Network policy: LLM inference endpoints should not be publicly accessible. Restrict access to known namespaces or service accounts:
1apiVersion: networking.k8s.io/v1
2kind: NetworkPolicy
3metadata:
4 name: restrict-vllm-access
5 namespace: llm-serving
6spec:
7 podSelector:
8 matchLabels:
9 app: vllm
10 policyTypes:
11 - Ingress
12 ingress:
13 - from:
14 - namespaceSelector:
15 matchLabels:
16 name: api-gateway # Only the gateway namespace can call vLLM
17 ports:
18 - port: 8000API key authentication: vLLM supports an API key via --api-key. Mount the key from a Kubernetes Secret. Your ingress/gateway should validate the key before traffic reaches vLLM.
Model integrity: Verify model checksum after download. A corrupted model can produce coherent-looking but incorrect output — harder to detect than a crash.
Further Reading
For scaling strategies beyond simple KEDA Prometheus triggers, KEDA: Event-Driven Autoscaling for Kubernetes covers queue-based scaling patterns in depth.
For cost controls at the Kubernetes level beyond just GPU workloads, Kubernetes Cost Optimization on AWS covers right-sizing, spot strategies, and Karpenter configuration.
For multi-cluster patterns when you need GPU inference spread across regions, Multi-Cluster Kubernetes Patterns and Pitfalls covers the architecture and failure modes.
GPU workloads on Kubernetes have a higher operational ceiling than typical web services, but the patterns are learnable. If you're deploying LLMs for the first time and want a second opinion on your architecture — node sizing, model serving choice, or autoscaling configuration — reach out via the contact page.
Frequently Asked Questions
Which GPU is best for LLM inference on EKS?
The NVIDIA A10G (found in AWS g5 instances) is currently the best price-to-performance choice for models like Llama 3 8B. For 70B+ models, you'll need the higher VRAM and bandwidth of NVIDIA A100 (p4 instances) or H100 (p5 instances).
Why use vLLM over other model servers?
vLLM's PagedAttention algorithm is the primary reason. It allows you to run higher batch sizes and longer context lengths by managing GPU memory with near-zero fragmentation. Other servers (Triton, TGI) have their strengths, but vLLM is often the fastest to deploy and most memory-efficient for LLM inference.
How much can we save with scale-to-zero?
For a single g5.xlarge node, you'll save ~$720/month if the node is idle half the time. At scale, this can be tens of thousands of dollars. The only tradeoff is the ~5-minute "cold start" wait for the first request.
Is Kubernetes enough for LLM security?
Standard Kubernetes security (RBAC and Network Policy) is a good start, but consider eBPF-based runtime security like Tetragon (part of the Cilium CNI) to detect and prevent unauthorized shell execution or unusual syscalls inside your GPU pods.


