Why your AWS bill exploded overnight and how to actually fix it

The 3 AM Slack message every developer dreads

Last month I got pinged at 3 AM because our cloud bill had tripled in 24 hours. No new deployments. No traffic spike. Just a number that climbed while everyone slept.

If you've spent any time on a major cloud platform, you've probably been here. The dashboard shows green, the app runs fine, but somewhere a service is quietly burning money. After debugging this on three different projects in the last year, I've found the patterns are almost always the same.

Let me walk you through how I track these down and what I do to prevent them.

The root cause is almost never what you think

Here's the frustrating truth: surprise cloud bills are rarely from the obvious culprits. It's not your main compute instances. It's not your database. Those costs are predictable.

The real killers are usually one of these:

NAT gateway data transfer — every byte through a NAT costs money, and chatty services rack this up fast
Cross-AZ traffic — services in different availability zones talking to each other constantly
Unused load balancers and elastic IPs — they keep billing even when nothing uses them
Log ingestion — debug logging left on in production, multiplied by millions of requests
Snapshot retention — old EBS snapshots accumulating for years

The pattern I see most often? A misconfigured service inside a private subnet pulling gigabytes through a NAT gateway because someone forgot to set up a VPC endpoint.

Step 1: Find what changed

Before touching anything, figure out what's different. I always start with billing data grouped by service and usage type.

If you're using the AWS CLI, the Cost Explorer API is your friend:

aws ce get-cost-and-usage \
--time-period Start=2026-05-01,End=2026-05-11 \
--granularity DAILY \
--metrics UnblendedCost \
--group-by Type=DIMENSION,Key=USAGE_TYPE

The USAGE_TYPE grouping is the key. SERVICE will tell you EC2 is expensive — well, no kidding. But USAGE_TYPE will tell you it's specifically DataTransfer-Regional-Bytes or NatGateway-Bytes, which actually points you somewhere.

Once you know the usage type, you can dig deeper. For NAT gateway issues, VPC Flow Logs will show you exactly which instances are responsible.

Step 2: Trace the traffic

This is where most people get stuck. You know NAT traffic is high, but which service is causing it?

Enable VPC Flow Logs to CloudWatch or S3, then query them. Here's an Athena query I've used a dozen times:

SELECT
srcaddr,
dstaddr,
SUM(bytes) AS total_bytes
FROM vpc_flow_logs
WHERE day BETWEEN '2026/05/01' AND '2026/05/10'
-- Filter to traffic going through the NAT
AND action = 'ACCEPT'
AND dstport IN (443, 80)
GROUP BY srcaddr, dstaddr
ORDER BY total_bytes DESC
LIMIT 50;

The top results almost always tell the story. Last week this query showed me one ECS task pulling 400GB from S3 through the NAT gateway every day. Through the NAT. To get to S3. In the same region.

That's the kind of thing that hides for months until someone audits it.

Step 3: Fix the actual problem

For the S3-via-NAT issue, the fix is a gateway VPC endpoint. It's free, takes about two minutes to create, and stops the bleeding immediately:

# Terraform example for a gateway endpoint
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.s3"

# Attach to your private route tables so traffic to S3
# bypasses the NAT gateway entirely
route_table_ids = [
aws_route_table.private_a.id,
aws_route_table.private_b.id,
]

vpc_endpoint_type = "Gateway"
}

For cross-AZ chatter, you have a few options depending on the workload:

Use topology-aware service discovery so clients prefer same-AZ targets
For Kafka or similar, configure rack awareness
For databases, run read replicas in each AZ and route reads locally

For log ingestion costs, audit your log levels. I once found a service logging the entire request body at INFO level. After dropping it to DEBUG and sampling 1% of requests, our log bill dropped 80%.

Step 4: Set up guardrails so it doesn't happen again

The fix is only half the job. Without monitoring, you'll be back here in six months for a different reason.

I set up budget alerts at multiple thresholds, but more importantly, I set up anomaly detection on usage types. A 50% increase in NAT gateway bytes overnight is the kind of signal you want a page for, not a monthly summary.

Here's a CloudWatch alarm pattern I use:

# Pseudo-code for the alarm logic
if current_hour_nat_bytes > (baseline_avg * 2):
# Page on-call, not just email
# The bill is already being generated
trigger_pagerduty_alert()

I also run a weekly cron job that lists every load balancer, elastic IP, and EBS volume in the account, cross-references against what's actually in use, and posts a report to Slack. It takes about 50 lines of Python and has caught at least four forgotten resources in the last year.

Prevention tips that actually work

A few things I've learned the expensive way:

Tag everything at creation time. If you can't answer "who owns this resource?" in 10 seconds, you can't manage costs. I enforce this with SCPs that block resource creation without specific tags.
Treat NAT gateways as expensive by default. Any service that talks to AWS APIs should go through a VPC endpoint. S3, DynamoDB, SQS, Secrets Manager — all of them have endpoints.
Set log retention explicitly. The default "never expire" is a silent budget killer. 30 days is fine for most things; if you need longer, archive to S3 with lifecycle rules to Glacier.
Review your reserved capacity quarterly. Workloads shift. Reservations you bought 18 months ago might not match current usage at all.
Run a cost game day. Once a quarter, pretend the bill doubled and trace where it could have come from. You'll find problems before they become real.

The bigger lesson

Cloud costs aren't really an infrastructure problem — they're an observability problem. You can't fix what you can't see, and most teams aren't watching the right signals.

The services that cost you money quietly are the ones designed to scale invisibly. That's a feature most of the time. But when it breaks, it breaks expensively. Build the visibility before you need it, and these 3 AM pages get a lot less common.