The Challenge: Why Production Kubernetes Fails
You deploy your microservice to Kubernetes. It works great in the lab. Then production happens.
Nodes crash. Your pods die. A cluster upgrade causes cascading failures. You wake up to alerts at 3am. Your SLA is in the trash.
This isn't a Kubernetes problem. It's an architecture problem. Most teams don't understand how to build for reliability in Kubernetes.
- No health checks: Kubernetes keeps routing to dead containers
- Resource starvation: Pods get killed unexpectedly during node pressure
- Single points of failure: Entire services go down during maintenance
- No disruption budgets: Rolling updates kill all replicas simultaneously
Result: Uptime = 97%. SLA = broken. Customers angry. Your team exhausted.
Why 99.9% Uptime Is Hard in Kubernetes
Challenge 1: The Health Check Gap
Your pod can be running but not healthy. Network timeout. Deadlock. Memory leak slowly killing performance. Kubernetes doesn't know.
Result: Traffic still routes to the zombie pod. Users get timeouts.
Challenge 2: Resource Contention
Multiple pods share a node. One pod becomes resource-hungry. Kubernetes evicts other pods to free space. No predictability.
Without proper resource limits and requests, your critical pods die unexpectedly.
Challenge 3: Maintenance Without Downtime
You need to upgrade a node, patch the OS, or update Kubernetes. How do you do this without stopping your service?
If you have 3 replicas and evict all 3 at once, your service is down.
Pattern 1: Comprehensive Health Checks
Health checks come in three flavors:
- Liveness: Is the pod alive? If not, kill it and restart.
- Readiness: Is the pod ready to accept traffic? If not, remove from load balancer.
- Startup: Is the pod initializing? Don't kill it during startup.
Pattern 2: Resource Management
Define requests and limits for every container. This prevents the cascading failures that kill reliability.
What does this mean?
| Resource | Requests | Limits | Impact |
|---|---|---|---|
| Memory | 256Mi guaranteed | Max 512Mi | Pod gets evicted if memory spike > 512Mi |
| CPU | 100m guaranteed | Throttled at 500m | Pod performance degrades if CPU > 500m |
Pro Tip: Set requests conservatively but realistically. Kubernetes uses requests for scheduling. Too high = wasted nodes. Too low = your pod starves and crashes.
Pattern 3: Pod Disruption Budgets
When Kubernetes needs to evict pods (node maintenance, resource pressure), it respects Pod Disruption Budgets (PDBs).
Without PDBs: All your replicas can be evicted simultaneously. Your service goes down during cluster upgrades.
With PDBs: Kubernetes guarantees minimum availability.
This says: "When evicting pods, keep at least 2 healthy replicas of my-service running."
Result: Kubernetes will wait to evict your third replica until the first two are running again.
Pattern 4: Multi-Zone Deployment
99.9% uptime = ~43 minutes of downtime per month. You can't tolerate a single zone failure.
Spread your replicas across availability zones:
The podAntiAffinity rule ensures: "Don't put two replicas of my-service in the same zone."
With 6 replicas spread across 3 zones (2 per zone): Even if one zone goes down, 4 replicas keep serving traffic.
Pattern 5: Monitoring and Alerting
You can't maintain 99.9% uptime without visibility. Monitor these metrics:
| Metric | What It Measures | Alert Threshold |
|---|---|---|
| Pod Restarts | How many times pods crash and restart | Alert if > 5 in 5 minutes |
| Unready Pods | Pods failing readiness checks | Alert if any pod unready > 1 minute |
| Memory Usage | % of requested memory in use | Alert if > 80% of limit |
| Request Latency | p99 response time | Alert if > SLA threshold |
| Request Errors | 5xx error rate | Alert if > 1% error rate |
Putting It Together: The 99.9% Architecture
✓ Liveness + Readiness + Startup probes configured
✓ Resource requests and limits defined
✓ Pod Disruption Budgets in place
✓ Replicas spread across zones with anti-affinity
✓ Monitoring alerts configured
✓ Graceful shutdown (terminationGracePeriodSeconds: 30)
✓ Auto-scaling configured for traffic spikes
With these patterns in place, you achieve:
- Dead pods detected and replaced within 30 seconds
- Unhealthy pods removed from load balancers within 15 seconds
- Zero downtime during cluster upgrades and maintenance
- Automatic recovery from zone failures
- 99.9% uptime SLA delivered consistently
Key Takeaways
✓ Health checks detect issues in seconds
✓ Resource management prevents cascading failures
✓ Pod disruption budgets enable safe maintenance
✓ Multi-zone deployment survives zone failures
✓ Continuous monitoring catches problems early