eBPF for Platform Engineers: Cilium, Hubble, and Tetragon Without the Hype
eBPF lets you run sandboxed programs in the Linux kernel without writing kernel modules — here's what that actually means for your networking, observability, and security stack.

Every few years a piece of infrastructure technology gets declared revolutionary before most teams have a reason to care about it. eBPF is one of those — and unlike most, the hype is at least partially justified.
If you are already comparing Cilium vs Calico on AKS, you're already encountering the power of eBPF. But "eBPF is amazing" isn't actionable. Let me tell you what it actually is, what Cilium and Tetragon do with it, and whether you should care right now.
What eBPF Actually Is
eBPF stands for "extended Berkeley Packet Filter." The name is historical baggage — modern eBPF has little to do with packet filtering. What it actually is: a way to run sandboxed programs inside the Linux kernel without writing kernel modules.
Kernel modules are dangerous. They run with full kernel privileges, a bug in one can panic the entire system, and they require kernel version compatibility. eBPF programs are different — they're verified by the kernel's in-kernel verifier before execution. The verifier ensures no unbounded loops, no invalid memory access, no crash paths. If your program doesn't pass verification, it doesn't load.
Once loaded, eBPF programs attach to hooks — network events, system calls, kernel functions, tracepoints. They run in response to those events with near-zero overhead because they execute in kernel space, bypassing the cost of context-switching to userspace.
From a platform engineering perspective, eBPF enables three things that previously required either heavyweight agents or kernel modules:
- Network data plane — intercept and process packets at kernel level without iptables
- Observability — collect syscall, network, and process telemetry with minimal overhead
- Security enforcement — block syscalls, network connections, and process executions in kernel space
Why kube-proxy Is a Problem Worth Solving
The default Kubernetes network implementation uses kube-proxy, which programs iptables rules on every node to implement Service routing. At small scale, this is fine. At large scale, it becomes a genuine operational problem.
iptables is not designed for dynamic updates at high frequency. Rules are evaluated linearly — adding a Service adds O(n) rules to every node. With 1000 Services and 5 pods each, you're evaluating 5000+ rules per connection. kube-proxy also watches the API server and rewrites the full iptables ruleset on every change, which can cause connection drops under churn.
The numbers I've seen in production: clusters with 500+ Services start showing measurable latency from iptables rule evaluation. At 2000+ Services, kube-proxy churning the ruleset becomes a CPU and connection-drop problem.
eBPF-based networking eliminates this entirely. Instead of iptables rules, packet processing happens via BPF maps — O(1) hash table lookups in kernel space. Adding a Service doesn't add rules to evaluate; it adds an entry to a map.
Cilium: Replacing kube-proxy and Extending NetworkPolicy
Cilium is a CNI plugin built on eBPF. It does three things: pod networking, Service load balancing (replacing kube-proxy), and network policy enforcement.
Installation
Installing Cilium with kube-proxy replacement on EKS:
1helm repo add cilium https://helm.cilium.io/
2helm repo update
3
4helm install cilium cilium/cilium \
5 --namespace kube-system \
6 --set kubeProxyReplacement=strict \
7 --set k8sServiceHost=<API_SERVER_ENDPOINT> \
8 --set k8sServicePort=443 \
9 --set ipam.mode=eni \
10 --set eni.enabled=true \
11 --set egressMasqueradeInterfaces=eth0The kubeProxyReplacement=strict flag tells Cilium it must fully replace kube-proxy — it will fail to start if it can't. Don't use partial in production; you'll end up with split brain between iptables and BPF.
If you're on EKS, make sure to disable kube-proxy before installing Cilium:
kubectl -n kube-system patch daemonset kube-proxy \
-p '{"spec":{"template":{"spec":{"nodeSelector":{"non-existing":"true"}}}}}'L7 Network Policy
Standard Kubernetes NetworkPolicy operates at L3/L4 — IP addresses and ports. Cilium extends this to L7. You can write policies that allow HTTP GET to /api/v1/users but deny POST, or allow DNS queries only to specific domains.
1apiVersion: cilium.io/v2
2kind: CiliumNetworkPolicy
3metadata:
4 name: allow-api-read-only
5 namespace: production
6spec:
7 endpointSelector:
8 matchLabels:
9 app: backend
10 ingress:
11 - fromEndpoints:
12 - matchLabels:
13 app: frontend
14 toPorts:
15 - ports:
16 - port: "8080"
17 protocol: TCP
18 rules:
19 http:
20 - method: GET
21 path: /api/v1/.*This is something you simply cannot do with standard NetworkPolicy. The enforcement happens in the eBPF data plane — no sidecar proxy required.
Hubble: Network Observability Without the Overhead
One of the chronic problems with Kubernetes networking is visibility. When a connection fails, you often have no idea whether the packet was dropped at the CNI level, rejected by a NetworkPolicy, or never sent at all.
Hubble is Cilium's observability layer. It hooks into the same eBPF programs that handle packet forwarding and extracts flow data — source/destination, protocol, verdict (forwarded/dropped), and the policy rule that made the decision.
Enable it during Cilium installation:
1helm upgrade cilium cilium/cilium \
2 --namespace kube-system \
3 --reuse-values \
4 --set hubble.enabled=true \
5 --set hubble.relay.enabled=true \
6 --set hubble.ui.enabled=true \
7 --set hubble.metrics.enabled="{dns,drop,tcp,flow,icmp,http}"Then use the CLI:
1# Install hubble CLI
2HUBBLE_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/hubble/master/stable.txt)
3curl -L --fail --remote-name-all \
4 "https://github.com/cilium/hubble/releases/download/${HUBBLE_VERSION}/hubble-linux-amd64.tar.gz"
5tar xzvf hubble-linux-amd64.tar.gz
6sudo mv hubble /usr/local/bin
7
8# Port-forward relay
9kubectl port-forward -n kube-system svc/hubble-relay 4245:80 &
10
11# Watch flows
12hubble observe --namespace production --verdict DROPPED
13hubble observe --namespace production --pod backend-7d9f5b4c8-xk2mnThe --verdict DROPPED filter is immediately useful for debugging NetworkPolicy issues. Instead of adding temporary allow-all policies and hoping the problem goes away, you can see exactly which policy dropped the packet.
Hubble UI provides a service map — a real-time graph of which pods are talking to which, with flow rates and error rates. This is generated from actual observed traffic, not service discovery metadata, so it reflects what's actually happening.
Tetragon: Runtime Security Without Sidecars
Tetragon is the security enforcement piece (which complements your Kubernetes RBAC strategy). It uses eBPF to observe and optionally block syscalls, process executions, file access, and network connections at the kernel level.
The key difference from something like Falco (which I respect and have used) is enforcement. Falco detects and alerts. Tetragon can block — a BPF program attached to the syscall hook can return EPERM before the operation completes. No process, no network call, no file write ever happens.
Install Tetragon:
helm repo add cilium https://helm.cilium.io/
helm install tetragon cilium/tetragon \
--namespace kube-system \
--set tetragon.enableProcessCred=true \
--set tetragon.enableProcessNs=trueTracingPolicy: Detecting Crypto Mining
Here's a real-world TracingPolicy that detects and blocks crypto mining by watching for network connections to common mining pool ports:
1apiVersion: cilium.io/v1alpha1
2kind: TracingPolicy
3metadata:
4 name: block-crypto-mining
5spec:
6 kprobes:
7 - call: tcp_connect
8 syscall: false
9 args:
10 - index: 0
11 type: sock
12 selectors:
13 - matchArgs:
14 - index: 0
15 operator: DAddr
16 values:
17 - "0.0.0.0/0"
18 matchActions:
19 - action: Signal
20 argSig: 9A more targeted policy — detecting shell execution inside containers (a common indicator of compromise):
1apiVersion: cilium.io/v1alpha1
2kind: TracingPolicy
3metadata:
4 name: detect-shell-exec
5spec:
6 kprobes:
7 - call: sys_execve
8 syscall: true
9 args:
10 - index: 0
11 type: string
12 selectors:
13 - matchArgs:
14 - index: 0
15 operator: Postfix
16 values:
17 - /sh
18 - /bash
19 - /dash
20 matchNamespaces:
21 - namespace: Pid
22 operator: NotIn
23 values:
24 - "host_pid"
25 matchActions:
26 - action: PostThis emits a security event (without blocking) whenever a shell is exec'd inside a container. You can escalate to action: Signal with argSig: 9 (SIGKILL) once you're confident in your policy.
Overhead Reality Check
The concern I hear most often: "eBPF programs run in kernel space on every packet/syscall — won't that tank performance?"
In practice, Tetragon's overhead is lower than any userspace agent. Cilium's own benchmarks show 5-10% network throughput overhead vs raw networking — compare that to iptables which can be 20-30% at scale. For Tetragon specifically, the overhead is proportional to how selective your TracingPolicies are. Watching every execve on every container is expensive. Watching only specific binaries in specific namespaces is nearly free.
Migrating from Flannel or Calico
I've done this migration twice — once from Flannel, once from Calico. The honest answer is: there is no in-place migration. You're replacing the CNI, which means replacing the overlay network all pods run on.
The approach I recommend:
Option 1: Blue-green node groups (preferred for EKS)
- Provision new node group with Cilium-compatible configuration
- Cordon all old nodes:
kubectl cordon <node> - Drain pods to new nodes:
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data - Old nodes come up on new node group with Cilium running
- Terminate old nodes
This takes 20-40 minutes per node if your drain is slow. At least it's zero-downtime if your workloads have proper PodDisruptionBudgets.
Option 2: Full cluster replacement
Provision a parallel cluster with Cilium, migrate workloads via GitOps, cut DNS over. More work, but a cleaner outcome and a useful exercise for your disaster recovery process.
What you can't do: install Cilium alongside Flannel and expect pods to communicate. Two CNIs fighting over the same network namespace is a bad time.
Should You Do This Now?
Here's my honest take: if you're running fewer than 200 Services and you don't have specific network security requirements, you probably don't need Cilium today. kube-proxy works fine at small scale, and adding Cilium introduces operational complexity.
The cases where I'd push for Cilium:
- You're hitting kube-proxy scaling problems (500+ Services, high churn)
- You need L7 network policy (HTTP method filtering, gRPC service filtering)
- You want runtime security enforcement without deploying sidecar agents to every pod
- You're building a multi-tenant platform where network isolation guarantees matter
Hubble is genuinely useful even without the full Cilium migration, but you do need Cilium as the CNI to get it.
The technology is production-ready. Cilium is CNCF Graduated. Tetragon is CNCF Incubating. Major cloud providers and large-scale operators run this in production. The question isn't whether it works — it's whether the complexity overhead is justified for your current scale.
Frequently Asked Questions
Does eBPF require a specific Linux kernel version?
Yes. Most modern Cilium/Tetragon features require at least Linux kernel 4.19, with 5.10+ being the recommended baseline for advanced features like BPF-to-BPF calls and BTF (BPF Type Format) support. Most managed Kubernetes services (EKS, GKE, AKS) currently ship with eBPF-compatible kernels.
Can I use eBPF for observability without changing my CNI?
Yes! Tools like Pixie or Parca use eBPF to collect observability data (distributed tracing, profiling) without requiring you to replace your existing networking plugin (like Flannel or Calico). However, you won't get the networking performance benefits of Cilium.
Is eBPF safe to run in production?
eBPF is designed to be safer than traditional kernel modules because every program is verified before it's allowed to run. The verifier ensures the program cannot crash the kernel or access unauthorized memory. Major tech companies like Meta, Google, and Netflix have been running eBPF at massive scale for years.
How does Tetragon differ from Falco?
Falco primarily works by monitoring syscalls from userspace. Tetragon runs its enforcement logic directly in the kernel using eBPF. This allows Tetragon to not just detect, but actually block malicious activity (like an unauthorized shell execution) before it even starts.
Running Cilium in production and hitting edge cases with your network policies or Tetragon enforcement? Talk to us at Coding Protocols. We help platform teams implement eBPF-based networking and security that actually holds up under production load.


