We saved ₹50 lakhs per year on AWS - and it was all our own stupidity

We cut our AWS bill from ₹13L/month to ₹4L/month by rightsizing EC2, downgrading RDS, removing unused volumes, replacing ElastiCache with DynamoDB, and cleaning VPC/NAT usage. Total annual savings: ₹50L+.

We saved ₹50 lakhs per year on AWS - and it was all our own stupidity

It was 11:30 PM on a Tuesday when the CFO slacked the engineering channel:

"Why is our AWS bill ₹13 lakhs this month? We budgeted for ₹4 lakhs."

We were 18 months in. Series A money burning faster than we expected. Engineers were shipping features, not checking bills. Nobody owned costs. Honestly, nobody even knew where to look.

And apparently… breaking the bank.

No one knew why the bill was this high.
No one had time to look.
Until we had no choice.

What We Were Running

Frontend → ALB → ECS (14× m5.4xlarge) → RDS
Async jobs → Redis cluster

Everything deployed across 3 AZs, whether it needed to be or not.

We were building for traffic we didn’t have,
provisioned for 100×,
using 10×,
paying for 1000×.

We assumed “scaling for growth” meant leaving everything at max capacity forever.

Opening Cost Explorer felt like opening a fridge you forgot for a year:
Everything rotten, stinking, and expensive.


Architecture Before Cleanup

Our setup looked great on paper but extremely wasteful.
Spread across three AZs, oversized instances everywhere, and services running 24/7 for no reason.

Architecture Before Cleanup

EC2 - ₹8.19 lakh/month

14× m5.4xlarge
16 vCPU, 64GB RAM each

  • Average CPU utilization: 12%
  • Biggest sale-day peak: 28%

We were provisioning for Black Friday… every single day.


RDS - ₹1.26 lakh/month

db.r5.4xlarge, Multi-AZ
16 vCPU, 128GB RAM

  • Actual DB size: 50GB
  • IOPS: <500

This could’ve run on a laptop.


Data Transfer - ₹55k/month

Microservices were chatting across AZs like long-lost lovers.

  • No VPC endpoints
  • S3 calls going over the public internet
  • Health checks costing real money

The Graveyard of Waste

AWS Waste Breakdown

Once we opened Cost Explorer and started digging, we discovered an entire graveyard of forgotten AWS resources:

  • 8 EBS volumes unattached to any instance - but happily charging us rent.
  • 3 ALBs routing traffic to absolutely no one.
  • Snapshots kept for 90 days when we only needed 7.
  • A fully-provisioned Redis cluster used like a key-value dictionary.
  • 900 CloudWatch custom metrics - half duplicates, all expensive.
  • “Temporary” testing resources from developers… that mysteriously lived forever.
  • No tagging, no lifecycle rules, no schedules, no guardrails.

And then the punchline: p99 EC2 utilization: 15%.

We weren’t running an AWS environment.
We were running an abandoned theme park - with the power still on.


We knew this wouldn’t be a one-day fix.
It took 3 weeks, lots of inventory work, testing, gradual migration…
and more chai than sleep.


Architecture After Optimization

After 3 weeks of cleanup and right-sizing, we ended up with a lean, scalable, 85% cheaper architecture.

Architecture After Optimization

Week 1: The Easy Wins

We removed the obvious waste:

  • 8 unattached EBS volumes → deleted
  • 3 unused ALBs → removed
  • Snapshots → 90 days → 7 days
  • S3 logs → moved to Glacier after 30 days

Savings: ₹71k/month

These savings were lying on the floor.


Week 2: The Scary Part: EC2 Right-Sizing

We didn’t trust auto-scaling yet.
So we analyzed 30 days of CloudWatch metrics and load-tested on t3.medium.

Auto Scaling Group with Mixed Instances:

  • Base capacity:t3.medium (On-Demand)
  • Burst: up to 8× Spot instances

We shifted traffic slowly, 10% at a time.

Before: 14× m5.4xlarge = ₹8.19L
After: ASG = ₹1.45L
Savings: ₹6.74L/month

What broke:

  • Spot interruptions → fixed with Spot Capacity Rebalancing
  • One microservice had an 8GB memory leak → pain at 9 PM

Cost: 2 late nights
Pain: Medium
Worth it: Absolutely


Week 2.5: RDS Downgrade

Snapshot → downgrade to db.t3.medium → test 3 days → smooth.

Bought a 1-year Compute Savings Plan (21% discount).

Before: ₹1.26L
After: ₹24k
Savings: ₹1.02L/month

Issues:

  • Reboot took 8 mins (done at 2 AM)
  • Connection pools needed tuning

Week 3: Architecture Cleanup

NAT Gateway Consolidation

We had 3 NAT GWs “for HA.”
But our SLA was 99.5%, not 99.99%.

Before: ₹20.7k
After: ₹6.9k
Savings: ₹13.8k


Redis → DynamoDB Migration

Moved simple key-value caching into DynamoDB + DAX.

Before: Redis = ₹38k
After: DynamoDB = ₹9k
Savings: ₹29k

Pain points:

  • Rewriting GET/SET logic
  • DAX limitations
  • 5 days of staging tests

VPC Endpoints for S3 & DynamoDB

Removed cross-AZ + internet traffic.

Before: ₹55k
After: ₹5k
Savings: ₹50k


CloudWatch Cleanup

900 custom metrics → deleted 450.

Before: ₹54k
After: ₹8k
Savings: ₹46k


Guardrails We Added

To avoid repeating this mess:

  1. Budget alerts at 50%, 80%, 100% of ₹4L
  2. Cost Anomaly Detection → emails for 20% daily spikes
  3. Lambda scheduler → shut down non-prod at 7 PM
  4. Tagging policy enforced via SCPs
  5. Terraform only - no more ClickOps

Non-prod shutdown alone saved ₹18k/month.


Final Tally

Service Before (₹/month) After (₹/month) Savings
EC2 8,19,000 1,45,000 6,74,000
RDS 1,26,000 24,000 1,02,000
NAT Gateway 20,700 6,900 13,800
Redis → DynamoDB 38,000 9,000 29,000
CloudWatch 54,000 8,000 46,000
ALB (deleted) 42,000 0 42,000
EBS (deleted) 12,000 0 12,000
S3 Lifecycle 9,800 2,100 7,700
Data Transfer 55,000 5,000 50,000
Snapshots 17,000 1,200 15,800
Monthly Total 12,93,000 2,03,000 10,90,000
Annual Savings ₹1.30 Crore

Cost Comparison - Before vs After

Cost Comparison

The Lesson

If you don’t watch your AWS bill weekly, you’re burning money quietly.
Until someone finally looks - and then it’s too late.

Most startups don’t die from bad code. They die from burning cash on idle infrastructure.

Honestly? The lesson is embarrassing. We're supposed to be engineers. We knew about auto-scaling, tagging, budget alerts. We just... didn't do it. Because shipping features felt more important than watching bills. Until it wasn't.

And remember:

Three weeks, ₹50L saved, and zero downtime (well, one 8-minute RDS reboot at 2 AM). Would I do it again? Yes. Would I rather have never needed to? Also yes. Don't be us. Check your bill before your CFO does.

Found this useful?

If you liked this post, share it with your team or follow us for more engineering war stories. And if you have your own AWS cost-saving tips (or horror stories), drop them in the comments below.

Read more

Building a Production-Grade Blockchain Security Audit Platform on AWS

Designing a Production-Ready Multi-Environment AWS VPC Foundation with CDK & TypeScript

Building an AWS Chaos Engineering Platform: Architecture, Experiments, and Real-World Resilience Testing

Building a Cloud-Native APM Platform with Distributed Profiling on AWS

Subscribe to new posts