How We Saved Redis from Choking Under Load

1. Introduction: When Redis Became the Bottleneck

Redis is well known for its ultra-low latency and high throughput, making it an ideal fit for use cases such as caching, session state management, real-time analytics, and pub/sub queues. In our system, Redis played a critical role and was deployed on AWS EC2 for cost control and flexibility — specifically on an r6g.xlarge instance with a 200 GB gp3 EBS volume and 3000 IOPS.

This setup worked well for many months. However, one day, a major spike in traffic exposed a serious weakness in our deployment:

Background persistence (BGSAVE) operations began overlapping and saturating disk I/O.
The Redis process became slow to respond, causing application latency to increase.
Eventually, we observed persistence errors (BGSAVE failures) and potential data loss.

In this case study, I'll walk through:

How we diagnosed the issue using Grafana and Redis Exporter
How we reproduced the problem using controlled load tests
Why we migrated to Amazon ElastiCache (OSS Redis 7.1.0)
How we used Redis-Shake and a dual-write strategy to minimize downtime

2. The Problem: Redis Struggled with Persistence and Latency

2.1 Initial Setup

EC2 instance: r6g.xlarge (4 vCPUs, 32 GB RAM)
Disk: 200 GB gp3 EBS (3000 baseline IOPS)
Redis version: 7.1.0
AOF: Disabled
Eviction policy: No eviction
Snapshot save policy:
- save 1800 1 (1 key in 30 min)
- save 900 2000 (2000 keys in 15 min)
- save 300 10000 (10,000 keys in 5 min)

With millions of operations per minute, these thresholds were constantly exceeded, triggering frequent snapshots. Each snapshot (BGSAVE) uses fork(), which is CPU and disk-intensive—especially with 200 GB of memory in use.

2.2 Symptoms During Production Spike

Memory usage exceeded 199 GB
Redis was stuck in BGSAVE state, failing to complete saves
Logs showed snapshot failures

Key metrics:

used_memory: 214.5 GB
used_memory_dataset_perc: 98.6%
rdb_bgsave_in_progress: 1
rdb_last_bgsave_status: err
rdb_last_bgsave_time_sec: 706

This indicated that save operations were taking upwards of 10 minutes and often failing. At the same time, Redis continued serving live traffic, making the system unstable and vulnerable to data loss.

Key technical observation:

Redis snapshotting (RDB) uses fork(), and when memory exceeds 100 GB, copy-on-write (COW) cost becomes significant.
BGSAVE causes memory fragmentation and disk write spikes, and EC2 disks like gp3 are not always fast enough to keep up.

(Reference: https://redis.io/docs/management/persistence/#snapshotting)

3. Reproducing the Issue via Load Tests

To validate our suspicions and quantify the impact, we cloned the EC2 Redis setup and simulated real-world traffic via controlled load tests.

Load Test Configuration:

200 Virtual Users (VUs)
100,000 to 150,000 requests per minute
Duration: 5 minutes
Mix: 80% SET, 20% GET operations
Redis Exporter + Grafana to monitor live performance

What we observed:

The EC2 Redis instance began struggling once memory usage exceeded ~180 GB.
Snapshot times increased significantly, often overlapping with ongoing writes.
Ops/sec dropped, latency spiked, and cache hit ratios dropped.

4. Moving to Amazon ElastiCache: Performance Without the Headaches 🚀

After validating that our EC2-hosted Redis was failing under pressure, we evaluated three options:

Tuning the EC2 Redis save policy
Vertical scaling (bigger instance) and faster disk
Switching to Amazon ElastiCache (Redis OSS mode)

We chose Option 3: ElastiCache, because it offered:

A fully managed, highly available Redis environment
Lower operational overhead (no OS/disk tuning needed)
Tuned persistence behavior with better memory handling
Seamless integration with CloudWatch and AWS IAM

4.1 ElastiCache Configuration

Engine: Redis OSS 7.1.0
Instance Type: cache.r6g.xlarge (same class as EC2)
Replication: None (single node)
Cluster Mode: Disabled
AOF: Disabled (same as EC2 Redis)
Eviction Policy: No eviction
Parameter Group: Default
Snapshotting: Enabled with optimized save policy

4.2 Updated Save Policy

To strike a better balance between data durability and performance, we changed the Redis save directive to:

save 900 1
save 300 10
save 60

This configuration reduces snapshot frequency while still ensuring regular persistence of changes.

4.3 ElastiCache Load Test Results

Running the same 5-minute load test (200 VU, 150k ops) showed clear improvements:

Commands/sec doubled to ~8,000
RDB snapshot time dropped from ~160 seconds to ~20 seconds
No errors or background save interruptions
CPU and memory pressure remained well within limits

Metric	EC2 Redis	ElastiCache
Avg ops/sec	~4,000	~8,000
Cache hit ratio	0.75	0.90
Snapshot time	~160 sec	~20 sec
99th percentile latency	>300 ms	<50 ms

This confirmed that ElastiCache's optimizations (dedicated hardware, memory-tuned OS, faster disks) made a significant difference.

Key points:

Redis bgsave is safe but expensive under large datasets
EC2 lacks the tuning and stability of a managed service
ElastiCache provided 2x the throughput and much better tail latency

(Reference: https://redis.io/docs/reference/optimization/persistence/#latency-during-bgsave)

Redis performance comparison: EC2 vs ElastiCache

Redis on EC2 vs ElastiCache Redis performance comparison

Conclusion: The results suggest that ElastiCache is slightly more optimized to handle high traffic scenarios than EC2 Redis

Performance: ElastiCache Redis outperforms EC2 Redis, handling 2x more commands per second (8K vs. 4K).
Efficiency: ElastiCache shows a higher cache hit ratio (0.9 vs. 0.75), improving data retrieval speed and reducing database load.
Optimization: AWS ElastiCache is optimized for Redis, benefiting from better performance, memory tuning, and managed optimizations.
Operational Overhead: ElastiCache offers fully managed services with built-in failover and scaling, reducing manual effort compared to EC2.

Why ElastiCache Performed Better

Memory overcommit handling is optimized in AWS Redis nodes
Kernel-level tuning reduces latency during fork
IO is better isolated from noisy neighbors (dedicated infra)
AWS engineers fine-tune Redis process lifecycles under the hood

This confirmed that ElastiCache was the right target for migration.

5. Seamless Migration with Redis-Shake and Dual Write

A key goal for us was to avoid downtime and ensure rollback capability during migration. We adopted a two-pronged strategy:

One-shot key migration using Redis-Shake
Application-level dual-write until confident in ElastiCache

5.1 Redis-Shake: One-Time Data Migration

We used Redis-Shake, an open-source tool by Alibaba, to migrate existing keys from EC2 Redis to ElastiCache.

Redis-Shake Features:

Supports full + incremental sync
High performance (multi-threaded)
Easy to set up for standalone Redis clusters

Redis-Shake Setup:

[sync_reader]
type = "sync"
address = "127.0.0.1:6379"
cluster = false

[filter]
allow_db = [2, 5]  # Only migrating DBs 2 and 5

[redis_writer]
type = "standalone"
address = "storefront-redis.unt198.ng.0001.aps1.cache.amazonaws.com:6379"

[advanced]
log_level = "info"
log_file = "redis-shake.log"

Execution Flow:

git clone https://github.com/tair-opensource/RedisShake
cd RedisShake
sh build.sh
cd bin
./redis-shake config.toml

This copied all keys from EC2 to ElastiCache in a few minutes without errors. Logs were monitored in real-time using screen.

5.2 Dual-Write Logic in Application

To protect against data inconsistency or hidden bugs, we enabled dual-write in the app layer:

All SET/DEL operations were sent to both Redis nodes (EC2 and ElastiCache)
All GET operations were redirected to ElastiCache
Write failures were logged, but did not block reads
Redis health checks were added to catch silent failures

This gave us time to:

Validate live data behavior
Compare key count, memory use, and TTLs
Ensure functional parity

We ran in dual-write mode for several days, monitoring metrics and verifying correctness through:

Redis memory usage (info memory)
Ops/sec on both Redis servers
Alerts via Grafana/Prometheus for any anomalies

Dual-write monitoring dashboard showing Redis performance metrics

5.3 Final Cutover and Decommissioning

Once confident:

Dual-write logic was removed from the app
All writes and reads switched fully to ElastiCache
EC2 Redis instance was gracefully stopped after backup
DNS/pipeline references were cleaned up

Conclusion: Future-Proofing Our Redis Architecture

Our journey from a self-managed Redis instance on EC2 to a fully managed Amazon ElastiCache solution was both enlightening and rewarding. We learned firsthand how operational bottlenecks—especially around persistence and resource limits—can threaten the reliability of even the most robust caching layers.

By leveraging ElastiCache, we not only resolved our immediate performance and stability issues but also unlocked a platform that scales effortlessly with our needs. The migration process, powered by tools like Redis-Shake and a careful dual-write strategy, ensured a seamless transition with zero downtime and no data loss.

Key Takeaways:

Proactive monitoring and load testing are essential for uncovering hidden bottlenecks.
Managed services like ElastiCache offer significant operational and performance benefits for mission-critical workloads.
A well-planned migration strategy minimizes risk and ensures business continuity.

As our systems continue to grow, we're confident that our new Redis architecture will keep pace—delivering the speed, reliability, and peace of mind our users expect. If you're facing similar challenges, don't hesitate to explore managed solutions and invest in robust migration practices. Your future self—and your customers—will thank you!