Kubernetes
14 min readMarch 28, 2026

How to Deploy an LLM on Kubernetes: GPU Nodes, Model Serving, and Autoscaling

Running LLMs in production on Kubernetes means GPU node management, model serving (vLLM or Triton), resource limits that actually work, and KEDA-based autoscaling. Here's the full picture.

AJ
Ajeet Yadav
Platform & Cloud Engineer
How to Deploy an LLM on Kubernetes: GPU Nodes, Model Serving, and Autoscaling

Running an LLM isn't like running a web service.

Whether you chose Kubernetes, Docker Swarm, or Nomad for your orchestrator, the requirements for LLM workloads are distinct and unforgiving. A web service can burst CPU for a few seconds, share a node with a dozen other services, and restart from a crash in under five seconds. An LLM serving pod owns a GPU exclusively, takes 3-5 minutes to start (model weights take time to load), and will OOMKill if you get the memory configuration wrong — not gradually degrade, but hard-crash.

Most guides on "LLM on Kubernetes" cover the happy path. This one covers the happy path and the failure modes: what actually breaks, why, and how to prevent it before it causes a production incident.


The Architecture We're Building

Rendering diagram…

Components:

  • GPU node pool on EKS with g5.xlarge (NVIDIA A10G, 24GB VRAM)
  • NVIDIA GPU Operator — installs device plugin, DCGM exporter, container toolkit
  • vLLM — our model server, OpenAI-compatible API, PagedAttention for memory efficiency
  • Model weights PVC — pre-populated EFS volume, shared across serving pods
  • KEDA — scales vLLM pods based on vllm:num_requests_waiting Prometheus metric
  • Karpenter — provisions new GPU nodes when KEDA scales out beyond current capacity
  • DCGM Exporter — GPU utilization and memory metrics for dashboards and alerting

Step 1: GPU Node Pool Setup

EKS Managed Node Group

GPU instances on EKS require a specific AMI (Amazon Linux 2 GPU-optimized) and appropriate instance types. The g5 family (A10G GPUs) is the current sweet spot for inference:

  • g5.xlarge: 1x A10G (24GB VRAM), 4 vCPU, 16GB RAM — good for 7B models
  • g5.2xlarge: 1x A10G, 8 vCPU, 32GB RAM — more CPU headroom for preprocessing
  • g5.12xlarge: 4x A10G — for 70B models with tensor parallelism

For Llama 3 8B in FP16, a single g5.xlarge gives you comfortable headroom. For 70B models, you need either g5.48xlarge (8 GPUs) or multiple nodes with tensor parallelism.

hcl
1# Terraform: EKS GPU managed node group
2resource "aws_eks_node_group" "gpu" {
3  cluster_name    = aws_eks_cluster.main.name
4  node_group_name = "gpu-inference"
5  node_role_arn   = aws_iam_role.node.arn
6  subnet_ids      = var.private_subnet_ids
7
8  # GPU-optimized AMI
9  ami_type       = "AL2_x86_64_GPU"
10  instance_types = ["g5.xlarge"]
11
12  scaling_config {
13    desired_size = 0  # Start at 0, let Karpenter provision on demand
14    min_size     = 0
15    max_size     = 10
16  }
17
18  labels = {
19    "workload-type" = "gpu-inference"
20    "nvidia.com/gpu" = "true"
21  }
22
23  taint {
24    key    = "nvidia.com/gpu"
25    value  = "true"
26    effect = "NO_SCHEDULE"
27  }
28
29  tags = {
30    "karpenter.sh/discovery" = var.cluster_name
31  }
32}

The critical pieces:

  • ami_type = "AL2_x86_64_GPU" — this AMI has NVIDIA drivers pre-installed. Do not run the GPU Operator's driver installer on top of this AMI (more on this below).
  • The NoSchedule taint on nvidia.com/gpu ensures no regular workloads land on expensive GPU nodes accidentally.
  • desired_size = 0 with Karpenter managing provisioning keeps costs at zero when no LLM requests are in flight.

Karpenter NodePool for GPU Instances

With GPU nodes starting at $1.00/hour per node, you don't want idle GPU nodes sitting around. Karpenter's node provisioner + consolidation handles this automatically:

yaml
1apiVersion: karpenter.sh/v1
2kind: NodePool
3metadata:
4  name: gpu-inference
5spec:
6  template:
7    metadata:
8      labels:
9        workload-type: gpu-inference
10    spec:
11      requirements:
12        - key: karpenter.sh/capacity-type
13          operator: In
14          values: ["on-demand"]  # Switch to spot for dev/staging
15        - key: node.kubernetes.io/instance-type
16          operator: In
17          values: ["g5.xlarge", "g5.2xlarge"]
18        - key: kubernetes.io/arch
19          operator: In
20          values: ["amd64"]
21      nodeClassRef:
22        apiVersion: karpenter.k8s.aws/v1
23        kind: EC2NodeClass
24        name: gpu-inference
25      taints:
26        - key: nvidia.com/gpu
27          value: "true"
28          effect: NoSchedule
29  disruption:
30    consolidationPolicy: WhenEmptyOrUnderutilized
31    consolidateAfter: 5m

consolidateAfter: 5m — Karpenter will decommission a GPU node 5 minutes after its last vLLM pod terminates. GPU instances are expensive; don't leave them idle.


Step 2: NVIDIA GPU Operator

The GPU Operator is a Kubernetes operator that manages the NVIDIA software stack on your nodes: device plugin (exposes GPUs as schedulable resources), container toolkit (configures Docker/containerd to use NVIDIA runtime), DCGM exporter (GPU metrics for Prometheus), and optionally the NVIDIA driver itself.

Installation

bash
1helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
2helm repo update
3
4helm install gpu-operator nvidia/gpu-operator \
5  --namespace gpu-operator \
6  --create-namespace \
7  --set driver.enabled=false \  # AL2 GPU AMI already has drivers
8  --set toolkit.enabled=true \
9  --set devicePlugin.enabled=true \
10  --set dcgmExporter.enabled=true \
11  --set dcgmExporter.serviceMonitor.enabled=true  # If using Prometheus Operator

Critical flag: driver.enabled=false. The EKS GPU-optimized AMI (AL2_x86_64_GPU) ships with NVIDIA drivers pre-installed. If you let the GPU Operator install drivers on top, you'll get driver conflicts that cause all GPU pods to fail with Failed to initialize NVML: Driver/library version mismatch. I've watched three teams hit this exact issue. Always check whether drivers are pre-installed in your AMI before enabling the GPU Operator's driver component.

Validating GPU Availability

Once the Operator is running, GPUs appear as schedulable resources:

bash
kubectl get nodes -o json | jq '.items[] | {
  name: .metadata.name,
  gpu: .status.capacity["nvidia.com/gpu"]
}'

Expected output:

json
{
  "name": "ip-10-0-1-42.ec2.internal",
  "gpu": "1"
}

If nvidia.com/gpu is null or missing, the device plugin DaemonSet pod on that node has a problem. Check its logs:

bash
kubectl logs -n gpu-operator \
  -l app.kubernetes.io/component=nvidia-device-plugin-daemonset \
  --tail=50

Step 3: Model Weight Storage

This is a decision that has significant operational consequences. You have three options.

Option A: Bake Weights Into Container Image

Simple. Wrong for anything above 3B parameters.

A 7B parameter model in FP16 is ~14GB of weights. Baking that into a container image means a 15GB+ image that every pod pull has to download, that every image scan has to process, and that every image rebuild regenerates unnecessarily when you update serving code. ECR image pulls of 15GB on a cold node take 8-12 minutes.

Use this pattern only for tiny models (< 2B parameters) in development environments.

Option B: Download Weights at Startup (Init Container)

Better, but creates cold-start problems at scale.

yaml
1initContainers:
2  - name: download-model
3    image: amazon/aws-cli:latest
4    command:
5      - sh
6      - -c
7      - |
8        aws s3 sync s3://my-model-bucket/llama-3-8b/ /models/llama-3-8b/ \
9          --region us-east-1 \
10          --no-progress
11    volumeMounts:
12      - name: model-storage
13        mountPath: /models
14    resources:
15      requests:
16        cpu: "1"
17        memory: "4Gi"

Downloading ~14GB from S3 to a new pod takes 2-4 minutes on a well-provisioned node. If KEDA scales you from 1 to 5 pods during a traffic spike, all 4 new pods are downloading in parallel, adding S3 transfer costs and extending your scale-out latency.

This pattern works for small models and environments where cold-start latency is acceptable.

Pre-populate an EFS volume with model weights once. Mount it ReadWriteMany across all serving pods. No download on startup — weights are already on the volume.

yaml
1# 1. Create the PVC (EFS StorageClass for ReadWriteMany)
2apiVersion: v1
3kind: PersistentVolumeClaim
4metadata:
5  name: model-weights
6  namespace: llm-serving
7spec:
8  accessModes:
9    - ReadWriteMany
10  storageClassName: efs-sc
11  resources:
12    requests:
13      storage: 100Gi
14---
15# 2. One-time Job to populate the PVC from S3
16apiVersion: batch/v1
17kind: Job
18metadata:
19  name: model-downloader
20  namespace: llm-serving
21spec:
22  template:
23    spec:
24      containers:
25        - name: downloader
26          image: amazon/aws-cli:latest
27          command:
28            - sh
29            - -c
30            - |
31              aws s3 sync s3://my-model-bucket/llama-3-8b/ /models/llama-3-8b/ \
32                --region us-east-1
33          volumeMounts:
34            - name: model-weights
35              mountPath: /models
36      volumes:
37        - name: model-weights
38          persistentVolumeClaim:
39            claimName: model-weights
40      restartPolicy: Never

Run the Job once when you add a new model version. After that, all serving pods mount the PVC and start in seconds (just the model load from local NFS, not S3 download).

EFS throughput is provisioned — ensure you're on Elastic throughput mode for inference workloads. Alternatively, use EBS gp3 volumes if you have a single replica (ReadWriteOnce) or are willing to use node-local storage.


Step 4: Deploying vLLM

vLLM is the right choice for most production LLM serving workloads in 2026. Its PagedAttention algorithm manages the KV cache like OS virtual memory, dramatically reducing memory fragmentation compared to static allocation. The result: significantly higher throughput at the same VRAM budget.

It also ships an OpenAI-compatible API — POST /v1/chat/completions works exactly like the OpenAI API, so clients written for OpenAI work without modification.

The Deployment

yaml
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4  name: vllm-llama3-8b
5  namespace: llm-serving
6spec:
7  replicas: 1  # KEDA will manage this
8  selector:
9    matchLabels:
10      app: vllm
11      model: llama3-8b
12  template:
13    metadata:
14      labels:
15        app: vllm
16        model: llama3-8b
17    spec:
18      # Tolerate the GPU taint
19      tolerations:
20        - key: nvidia.com/gpu
21          value: "true"
22          effect: NoSchedule
23
24      # Schedule only on GPU nodes
25      nodeSelector:
26        workload-type: gpu-inference
27
28      # Don't start until model weights are mounted
29      initContainers:
30        - name: check-model
31          image: busybox:latest
32          command: ["sh", "-c", "test -d /models/llama-3-8b && echo 'Model found' || exit 1"]
33          volumeMounts:
34            - name: model-weights
35              mountPath: /models
36
37      containers:
38        - name: vllm
39          image: vllm/vllm-openai:v0.6.0
40          command:
41            - python3
42            - -m
43            - vllm.entrypoints.openai.api_server
44          args:
45            - --model=/models/llama-3-8b
46            - --served-model-name=llama-3-8b
47            - --host=0.0.0.0
48            - --port=8000
49            - --gpu-memory-utilization=0.90      # Reserve 10% for CUDA overhead
50            - --max-model-len=8192               # Cap context to control KV cache size
51            - --max-num-seqs=64                  # Max concurrent sequences
52            - --dtype=half                       # FP16 — required for A10G
53            - --enforce-eager                    # Disable CUDA graph capture if you see OOM on startup
54          ports:
55            - containerPort: 8000
56              name: http
57
58          resources:
59            requests:
60              cpu: "4"
61              memory: "16Gi"
62              nvidia.com/gpu: "1"
63            limits:
64              cpu: "8"
65              memory: "20Gi"
66              nvidia.com/gpu: "1"  # Always set GPU limit = request (GPU is not compressible)
67
68          readinessProbe:
69            httpGet:
70              path: /health
71              port: 8000
72            initialDelaySeconds: 120  # Model loading takes time — give it space
73            periodSeconds: 15
74            failureThreshold: 20
75
76          livenessProbe:
77            httpGet:
78              path: /health
79              port: 8000
80            initialDelaySeconds: 180
81            periodSeconds: 30
82            failureThreshold: 3
83
84          volumeMounts:
85            - name: model-weights
86              mountPath: /models
87
88          env:
89            - name: HUGGING_FACE_HUB_TOKEN
90              valueFrom:
91                secretKeyRef:
92                  name: hf-token
93                  key: token
94
95      volumes:
96        - name: model-weights
97          persistentVolumeClaim:
98            claimName: model-weights
99
100      # Prevent eviction during model loading
101      terminationGracePeriodSeconds: 300
102---
103apiVersion: v1
104kind: Service
105metadata:
106  name: vllm-llama3-8b
107  namespace: llm-serving
108spec:
109  selector:
110    app: vllm
111    model: llama3-8b
112  ports:
113    - port: 80
114      targetPort: 8000
115      name: http

Key Configuration Parameters Explained

--gpu-memory-utilization=0.90: vLLM pre-allocates this fraction of GPU VRAM for the KV cache. Leave at least 10% for CUDA kernels, model weight loading buffers, and overhead. Setting this to 1.0 will cause OOM during startup.

--max-model-len=8192: The maximum context window in tokens. KV cache size scales quadratically with context length. A 128k context window on a 24GB GPU is only possible with quantized models. If you're using FP16 Llama 3 8B, 8192-16384 is the practical range on a g5.xlarge.

--max-num-seqs=64: Maximum concurrent sequences (active requests being processed). Higher values increase throughput but also peak memory usage.

--dtype=half: A10G GPUs (and most Ampere-generation) run FP16 efficiently. For very small GPUs (T4), float is safer. For newer hardware (H100), bfloat16 is preferred.

readinessProbe.initialDelaySeconds=120: Llama 3 8B in FP16 takes 60-90 seconds to load from an EFS volume. Set your initial delay generously. A pod that fails its readiness probe during model loading will be taken out of rotation and rescheduled — you'll spend 5 minutes debugging a non-problem.

Resource Limits: GPU Is Not Compressible

CPU and memory have soft and hard limits in Kubernetes. CPU can be throttled; memory triggers OOMKill. GPUs have no throttling mechanism. A GPU resource limit in Kubernetes is a scheduling constraint — it controls which pods can schedule onto GPU nodes — but it doesn't prevent a running pod from using more VRAM than it requested.

This means:

  • Always set nvidia.com/gpu request == limit (fractional GPU requests are not supported by the standard device plugin)
  • GPU OOM is not the same as CPU OOM — a GPU out-of-memory error in CUDA causes a CUDA error in the model server process, not a kernel OOMKill. vLLM will catch this and either fail the request or crash the process
  • Monitor DCGM_FI_DEV_FB_USED (framebuffer used) and DCGM_FI_DEV_FB_FREE via the DCGM exporter — set an alert when VRAM usage exceeds 90% of capacity

Step 5: KEDA Autoscaling

Standard HPA is the wrong tool for LLM autoscaling. Here's why:

HPA scales on CPU/memory utilization. During an LLM request, the GPU is maxed but CPU and memory might be moderate. More importantly, HPA reacts to current resource usage — by the time CPU spikes from a queue buildup, you're already degrading. New GPU pods take 3-5 minutes to start (including node provisioning by Karpenter), so you need to scale ahead of saturation, not in response to it.

KEDA's solution: scale on request queue depth using a Prometheus metric that vLLM exports natively — vllm:num_requests_waiting.

yaml
1apiVersion: keda.sh/v1alpha1
2kind: ScaledObject
3metadata:
4  name: vllm-scaler
5  namespace: llm-serving
6spec:
7  scaleTargetRef:
8    name: vllm-llama3-8b
9
10  # Min/max replicas
11  minReplicaCount: 1
12  maxReplicaCount: 5
13
14  # How long to keep a scaled-up pod around after queue drains
15  cooldownPeriod: 300  # 5 minutes — GPU pods are expensive, don't yo-yo
16
17  # How aggressively to scale up
18  pollingInterval: 15
19
20  triggers:
21    - type: prometheus
22      metadata:
23        serverAddress: http://prometheus-operated.monitoring.svc:9090
24        metricName: vllm_requests_waiting
25        query: |
26          sum(vllm:num_requests_waiting{namespace="llm-serving", model="llama3-8b"})
27        threshold: "5"  # Scale up when 5+ requests are waiting for a GPU
28        activationThreshold: "1"  # Activate scaling when >= 1 request is waiting

When vllm:num_requests_waiting >= 5, KEDA adds another replica. Each new replica triggers Karpenter to provision a new GPU node (since there's likely no spare capacity — GPU nodes are tainted and not shared). The new pod won't serve traffic for 4-6 minutes while the node provisions and the model loads. Plan your threshold accordingly: scale early, not late.

Scale-to-zero consideration: minReplicaCount: 1 keeps one pod warm at all times. Setting minReplicaCount: 0 enables scale-to-zero — the cluster has zero GPU nodes running when idle, maximizing cost savings. The tradeoff: the first request after a quiet period waits 5+ minutes for a cold start. For development clusters or low-latency-tolerant workloads, scale-to-zero is worth it. For production chatbots, keep at least one replica warm.


Step 6: Monitoring GPU Workloads

DCGM Exporter (installed by the GPU Operator) exports these metrics to Prometheus:

MetricWhat It Tells You
DCGM_FI_DEV_GPU_UTILGPU compute utilization % (0-100)
DCGM_FI_DEV_FB_USEDVRAM used (bytes)
DCGM_FI_DEV_FB_FREEVRAM free (bytes)
DCGM_FI_DEV_SM_CLOCKStreaming multiprocessor clock speed
DCGM_FI_DEV_POWER_USAGEGPU power draw (watts)
DCGM_FI_DEV_GPU_TEMPGPU temperature

Alerts to configure:

yaml
1# Alert: VRAM nearly full (will cause GPU OOM on next large request)
2- alert: GPUVRAMCritical
3  expr: |
4    DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) > 0.92
5  for: 2m
6  labels:
7    severity: warning
8  annotations:
9    summary: "GPU VRAM usage above 92% on {{ $labels.instance }}"
10
11# Alert: GPU utilization sustained at 100% (probably a stuck request)
12- alert: GPUSustainedFullUtilization
13  expr: DCGM_FI_DEV_GPU_UTIL > 99
14  for: 10m
15  labels:
16    severity: warning
17  annotations:
18    summary: "GPU at 100% utilization for 10+ minutes — check for stuck requests"
19
20# Alert: GPU temperature critical (throttling imminent)
21- alert: GPUTempCritical
22  expr: DCGM_FI_DEV_GPU_TEMP > 85
23  for: 5m
24  labels:
25    severity: critical
26  annotations:
27    summary: "GPU temperature critical on {{ $labels.instance }}"

vLLM also exports its own Prometheus metrics at :8000/metrics:

  • vllm:num_requests_running — requests currently being processed
  • vllm:num_requests_waiting — requests in queue (the KEDA trigger metric)
  • vllm:gpu_cache_usage_perc — KV cache utilization (high values = PagedAttention under pressure)
  • vllm:generation_tokens_total — total tokens generated (useful for billing/cost attribution)

Step 7: Cost Controls

GPU instances are expensive. A g5.xlarge in us-east-1 runs ~$1.00/hour on-demand. At 720 hours/month, that's $720/month per always-on node. Controls that matter:

Scale to zero in non-production: Use minReplicaCount: 0 in dev/staging KEDA configs. Accept the cold-start latency; developers can wait 5 minutes.

Spot instances for non-critical workloads: g5 spot instances run at 60-70% discount. The risk is interruption (2-minute warning). vLLM handles SIGTERM gracefully — in-flight requests complete, then the pod shuts down. Configure Karpenter's NodePool with capacity-type: spot for acceptable spot interruption risk.

yaml
# Karpenter NodePool addition for spot GPU
requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values: ["spot", "on-demand"]  # Try spot first, fall back to on-demand

Right-size for the model: Don't pay for g5.2xlarge if g5.xlarge works. Verify VRAM utilization under load before committing to a larger instance type.

Token-based cost attribution: vllm:generation_tokens_total labeled by model and namespace lets you build per-team cost allocation in Kubecost or a custom dashboard. At $0.002-0.01 per 1000 tokens for internal serving (rough GPU cost), this adds up at scale.

For broader Kubernetes cost optimization patterns, see Kubernetes Cost Optimization on AWS.


Debugging Common Failures

Insufficient nvidia.com/gpu

Pod stuck in Pending with this event:

0/5 nodes are available: 5 Insufficient nvidia.com/gpu

Causes:

  1. No GPU nodes in the cluster yet (if using scale-to-zero, Karpenter needs a moment to provision)
  2. GPU device plugin not running — check kubectl get pods -n gpu-operator -l app.kubernetes.io/component=nvidia-device-plugin-daemonset
  3. Pod doesn't tolerate the nvidia.com/gpu: NoSchedule taint — check tolerations in your Deployment
  4. GPU nodes exist but are cordoned or have the wrong labels for the nodeSelector

GPU OOM During Serving (Not Pod OOMKill)

The pod stays running but requests return HTTP 500 with a CUDA error. This is a GPU out-of-memory error, not a Kubernetes OOM event.

Fix: Reduce --gpu-memory-utilization (try 0.85), reduce --max-model-len (shorter context = smaller KV cache), or reduce --max-num-seqs.

Pod OOMKill (CPU Memory, Not GPU)

The pod is killed and restarted with OOMKilled reason. This is the CPU memory limit, not GPU VRAM.

vLLM uses CPU memory for tokenization, request batching, and intermediate buffers. A memory: 16Gi limit is usually sufficient for 7B models, but large-context requests with many concurrent users can push CPU memory. Increase the memory limit or reduce --max-num-seqs.

Slow Cold Start (Pod Takes 8+ Minutes to Become Ready)

Two causes:

  1. Model downloading at startup — switch to the pre-populated PVC pattern (Option C above)
  2. CUDA graph capture — vLLM captures CUDA graphs at startup for inference optimization. This takes 1-3 minutes for large models. If you see OOM during this phase, add --enforce-eager to skip CUDA graph capture (reduces throughput ~10% but eliminates the OOM risk during startup)

Pod Stuck Pending Because Node Not Provisioned

If you're using Karpenter and a new pod is stuck Pending, check whether Karpenter is actually trying to provision:

bash
kubectl get events --field-selector reason=ProvisioningFailed -n kube-system
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter --tail=100

Common cause: EC2 capacity limits in your availability zone. Add g5.2xlarge and g5.4xlarge as fallback instance types in your NodePool to increase the capacity pool Karpenter draws from.


Security Considerations

Network policy: LLM inference endpoints should not be publicly accessible. Restrict access to known namespaces or service accounts:

yaml
1apiVersion: networking.k8s.io/v1
2kind: NetworkPolicy
3metadata:
4  name: restrict-vllm-access
5  namespace: llm-serving
6spec:
7  podSelector:
8    matchLabels:
9      app: vllm
10  policyTypes:
11    - Ingress
12  ingress:
13    - from:
14        - namespaceSelector:
15            matchLabels:
16              name: api-gateway  # Only the gateway namespace can call vLLM
17      ports:
18        - port: 8000

API key authentication: vLLM supports an API key via --api-key. Mount the key from a Kubernetes Secret. Your ingress/gateway should validate the key before traffic reaches vLLM.

Model integrity: Verify model checksum after download. A corrupted model can produce coherent-looking but incorrect output — harder to detect than a crash.


Further Reading

For scaling strategies beyond simple KEDA Prometheus triggers, KEDA: Event-Driven Autoscaling for Kubernetes covers queue-based scaling patterns in depth.

For cost controls at the Kubernetes level beyond just GPU workloads, Kubernetes Cost Optimization on AWS covers right-sizing, spot strategies, and Karpenter configuration.

For multi-cluster patterns when you need GPU inference spread across regions, Multi-Cluster Kubernetes Patterns and Pitfalls covers the architecture and failure modes.


GPU workloads on Kubernetes have a higher operational ceiling than typical web services, but the patterns are learnable. If you're deploying LLMs for the first time and want a second opinion on your architecture — node sizing, model serving choice, or autoscaling configuration — reach out via the contact page.


Frequently Asked Questions

Which GPU is best for LLM inference on EKS?

The NVIDIA A10G (found in AWS g5 instances) is currently the best price-to-performance choice for models like Llama 3 8B. For 70B+ models, you'll need the higher VRAM and bandwidth of NVIDIA A100 (p4 instances) or H100 (p5 instances).

Why use vLLM over other model servers?

vLLM's PagedAttention algorithm is the primary reason. It allows you to run higher batch sizes and longer context lengths by managing GPU memory with near-zero fragmentation. Other servers (Triton, TGI) have their strengths, but vLLM is often the fastest to deploy and most memory-efficient for LLM inference.

How much can we save with scale-to-zero?

For a single g5.xlarge node, you'll save ~$720/month if the node is idle half the time. At scale, this can be tens of thousands of dollars. The only tradeoff is the ~5-minute "cold start" wait for the first request.

Is Kubernetes enough for LLM security?

Standard Kubernetes security (RBAC and Network Policy) is a good start, but consider eBPF-based runtime security like Tetragon (part of the Cilium CNI) to detect and prevent unauthorized shell execution or unusual syscalls inside your GPU pods.

Related Topics

Kubernetes
LLM
GPU
AI
vLLM
KEDA
NVIDIA
Platform Engineering
Machine Learning

Read Next