AWS NAT Gateway Cost Optimization: A Practical Survival Guide (2026)

A client once asked me: "We barely run anything in our AWS environment. Why is it so expensive?"

If you're already optimizing other parts of your cloud stack using frameworks like Choosing the Right DevOps Tools, this is one of those hidden expenses that often slips through. At first glance, it didn't make sense. Their dashboard showed low traffic, few active services, and absolutely no heavy compute workloads. It was a "ghost town" environment, yet the AWS bill was telling a very different story—one of a high-traffic enterprise application. After digging in, I found the real issue wasn’t compute power, auto-scaling specific instances, or expensive database provisioned IOPS. It was something much quieter — repeated Docker image pulls.

Here is what was happening, why it cost so much, and how we fixed it.

The "Infinite" Loop

The environment was running Apple architecture-based ECS tasks. One specific ECS service was configured with a rather large Docker image—about 1GB in size.

The problem started with the application itself. The container was failing its startup health checks.

ECS attempts to start the task.
The application fails to respond to the health check within the grace period.
ECS kills the unhealthy task.
ECS immediately provisions a new task to maintain the desired service count.
Repeat.

This retry loop is standard behavior for orchestrators like ECS or Kubernetes. It's usually annoying because your service is down, but it's rarely expensive in terms of infrastructure costs... unless your network plumbing is wrong.

The Cost Driver: NAT Gateway

The real cost driver wasn't the compute hours (since the tasks were dying quickly). It was the Data Transfer.

These tasks were running in Private Subnets for security purposes. This is a best practice. However, the VPC configuration had a critical gap:

No VPC Endpoints for ECR (Elastic Container Registry).
No VPC Endpoints for S3 (where the image layers are actually stored).
No Direct Private Path to AWS services.

Rendering diagram…

Because the subnets were private, the only way for the ECS nodes to reach ECR to pull the Docker image was through the NAT Gateway.

Why This Matters

AWS charges for data processed by the NAT Gateway.

Every time the task restarted (which was happening constantly due to the health check failure), the underlying infrastructure had to pull that 1GB Docker image again.

In a standard EC2-based ECS cluster, you might benefit from Docker image caching on the host. But this client was using AWS Fargate.

In Fargate, each task runs in a purely isolated environment. There is no shared "node" cache for Docker images between task restarts. Every restart meant a fresh 1GB download.

The Math of the Mistake

Let's look at the math that generated the bill:

Image Size: ~1 GB
Loop Frequency: The task was failing and restarting roughly every 2 minutes.
Downloads per hour: 30 downloads.
Data Transfer per hour: 30 GB.
Data Transfer per day: 720 GB.

That is nearly 21 TB of data transfer per month for a service that had zero users and zero actual traffic.

All of this data was flowing through the NAT Gateway, which charges for usage, plus the standard Data Transfer OUT fees if applicable (though usually intra-region data transfer has specific pricing, NAT processing fees alone can be significant).

The bill was skyrocketing purely from internal retry loops.

The Fix

Once we identified the flow of traffic, the fix was a multi-step process involving both application and infrastructure changes.

1. Infrastructure: VPC Endpoints

We created VPC Interface Endpoints for ECR (com.amazonaws.region.ecr.dkr and com.amazonaws.region.ecr.api) and a Gateway Endpoint for S3.

This routes traffic effectively "internally" within the AWS network, bypassing the NAT Gateway entirely.

Result: Data transfer costs for pulling images dropped to zero (or near-zero, as S3 Gateway endpoints are free and Interface endpoints have a much lower hourly/processing cost compared to the volume of NAT traffic).

2. Application: Fix the Health Check

We analyzed the application logs and found it was timing out connecting to a database that was essentially "sleeping" in this dev environment.

We increased the Generic Startup Timeout in the ECS definition to allow the application more time to initialize.
We fixed the database connection string configuration.

3. Optimization: Shrink the Image

A 1GB image for a simple microservice is massive. We moved the base image from a heavy generic Linux distro to alpine or Distroless, and implemented multi-stage builds.

Result: The image size dropped to ~150MB. Even if a loop occurred again, the blast radius would be 85% smaller.

The Takeaway

Blockbuster cloud bills aren’t always caused by blockbuster workloads.

Sometimes, they come from the quietest corners of your infrastructure:

Misconfigured health checks causing infinite loops.
Oversized Docker images amplifying bandwidth usage.
Missing VPC Endpoints forcing internal traffic through expensive NAT Gateways.

If your bill seems high but your CPU usage is low, look at your Data Transfer. Check your NAT Gateways. The plumbing of your cloud environment is just as important as the code running on top of it.

Frequently Asked Questions

Are S3 Gateway Endpoints free?

Yes. Unlike Interface Endpoints, AWS does not charge for S3 Gateway Endpoints. This is one of the easiest "Quick Wins" for any AWS environment: enable S3 Gateway Endpoints to route S3 traffic (including ECR image layers) through the private AWS network for zero cost.

Why not just use Public Subnets for Fargate?

While public subnets avoid the need for a NAT Gateway, they are generally discouraged for workloads that do not need to be accessible from the internet. Putting your application servers in a private subnet and using VPC Endpoints for external access is the industry-standard "secure by default" pattern.

How much can VPC Endpoints save me?

It depends on your traffic volume. In our case study, they reduced a multi-thousand dollar bill to less than $100/month. For most production clusters, the savings will far outweigh the $0.01/hr cost of the interface endpoints.

Do VPC Endpoints work across different AWS accounts?

By default, VPC Endpoints are local to your VPC. However, you can use AWS PrivateLink to share services across accounts. This is common in enterprise environments where the shared service (like an ECR registry or a core database) lives in a separate "infrastructure" account.

Is your AWS bill telling a story you can't read? Contact us at Coding Protocols. We specialize in finding the "silent killers" in your cloud costs.

The Silent Cost: How Docker Pulls Through NAT Gateway Can Bankrupt You