An engineer's diary of sleuthing, sleepless nights, and a very happy finance team.
It was just another Tuesday morning when the alerts whispered that one of our Java services had doubled its memory overnight. I shrugged it off: traffic was climbing for the upcoming sale, so maybe the surge was natural. Besides, my dashboard still showed a comfortable 4 GiB limit. What could go wrong?
Screenshot 1: Prometheus graph of steady 2 GiB RSS before the incident
Over coffee, I remembered a neat JVM trick: "Let the heap size adjust automatically based on container resources." So I added:
-XX:+UseContainerSupport
-XX:MaxRAMPercentage=75.0
-XX:InitialRAMPercentage=50.0
and redeployed, trusting the JVM to read my Kubernetes requests. Thirty minutes later, Grafana lit up like a Christmas tree. Memory at 4 GiB… 6 GiB… 48 GiB.
"Relax," I told myself. "It's sale traffic."
Spoiler: it wasn't.
Screenshot 2: Memory climbs right after enabling the flags
The dev team pinged:
"We did merge an in-memory cache last week. Could that be it?"
Sounded plausible. To stay safe in peak hours, I cranked pod limits to 48 GiB. Yes, forty-eight. It felt wrong, but business comes first.
Memory kept climbing. Even after the cache was ripped out, nothing changed. Somewhere, my wallet whimpered.
Screenshot 3: Node Count spike after node autoscaling
Time to break out the heavy tools. I attached JProfiler to a single pod late at night and watched the heap build mountains: 2 GiB… 25 GiB… boom—GC—back to 2 GiB. One giant cliff every minute. It was oddly hypnotic, like tides on fast-forward.
Screenshot 4: JProfiler "saw-tooth" heap graph
The garbage collector was leisurely—just one major sweep per minute. Plenty of time for the heap to swallow whole nodes.
I reread the JVM docs, this time slowly. A single line felt like it was written in neon:
"Percentages are calculated from the container's memory limit, not its request."
I slapped my forehead. Of course!
75% of 8 GiB = 6 GiB heap—not the 3 GiB I expected.
After my panicked scale-up, 75% of 48 GiB = 36 GiB heap.
The JVM was only following orders. Bad orders.
I rolled out a brave little canary pod:
resources:
requests:
cpu: 600m
memory: 4Gi
limits:
cpu: 2000m
memory: 6Gi
and tweaked the flags:
-XX:+UseContainerSupport
-XX:MaxRAMPercentage=50.0
-XX:InitialRAMPercentage=40.0
Then I watched. An hour passed. Heap steady at ~3 GiB. Two hours. Still steady. Latency? Identical to the chunky siblings. I left the office that night with cautious optimism.
Screenshot 5a: Latency with updated config
Screenshot 5b: Latency with updated config
Screenshot 5c: Latency with old config
Screenshot 5d: Latency with old config
Twenty-four hours later, still no hiccups. We flipped traffic: 10% → 50% → 100%. Cluster nodes idled; AWS billing dropped like a stone. The finance team bought donuts.
Screenshot 6: Node Count "cliff" after rollout
If this tale feels familiar, that's because every engineer has a version of it. May yours cost less.