Kubernetes Production Readiness Checklist
A comprehensive pre-flight checklist for running Kubernetes in production — covering resource management, security hardening, observability, deployment strategies, and disaster recovery.
By VVVHQ Team ·
Before You Go Live on Kubernetes
Kubernetes is powerful, but deploying it in production without proper preparation is a recipe for 3 AM pages and weekend firefighting. This checklist covers the critical areas teams often overlook when moving from dev/staging to production Kubernetes.
Use this as a pre-flight checklist. Every item here has caused real production incidents at organizations that skipped it.
Resource Management
Requests and Limits
- [ ] Every container has CPU and memory requests set — without requests, the scheduler can't make intelligent placement decisions
- [ ] Memory limits are set and tested — containers without memory limits can OOM-kill neighboring pods
- [ ] CPU limits are set thoughtfully — overly aggressive CPU limits cause throttling; consider removing CPU limits if you use requests correctly
- [ ] Resource requests are based on actual usage data, not guesses — run load tests and analyze metrics before setting production values
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
# CPU limit optional - evaluate based on workload
Pro tip: Use Vertical Pod Autoscaler (VPA) in recommendation mode to right-size your requests based on actual usage.
Autoscaling
- [ ] Horizontal Pod Autoscaler (HPA) configured for stateless workloads — scale on CPU, memory, or custom metrics
- [ ] Minimum replicas set to 2+ for production services — never run a single replica of anything critical
- [ ] Cluster Autoscaler enabled — ensure the cluster can add nodes when pods can't be scheduled
- [ ] Pod Disruption Budgets (PDBs) defined — prevent voluntary disruptions from taking down too many replicas
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-pdb
spec:
minAvailable: "50%"
selector:
matchLabels:
app: web
Health Checks
- [ ] Liveness probes configured — detect and restart containers that are stuck (deadlocked, hung)
- [ ] Readiness probes configured — prevent traffic from reaching pods that aren't ready to serve
- [ ] Startup probes configured for slow-starting apps — prevent liveness probes from killing pods during initialization
- [ ] Probe timeouts and thresholds are tuned — default values are often too aggressive
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 2
Common mistake: Using the same endpoint for liveness and readiness. Liveness should check if the process is alive. Readiness should check if it can handle traffic (database connected, caches warm, etc.).
Networking and Traffic
- [ ] Network Policies applied — restrict pod-to-pod traffic to only what's needed (default-deny, then allow)
- [ ] Ingress controller configured with TLS — terminate TLS at the ingress, not in application pods
- [ ] Service mesh evaluated (Istio, Linkerd) — provides mTLS, observability, and traffic management without app changes
- [ ] DNS configuration tested — verify pods can resolve internal and external DNS reliably
- [ ] Graceful shutdown implemented — handle SIGTERM, drain connections, and respect
terminationGracePeriodSeconds
Security
- [ ] Containers run as non-root — set
runAsNonRoot: trueandrunAsUserin security context - [ ] Read-only root filesystem where possible —
readOnlyRootFilesystem: true - [ ] No privileged containers —
privileged: false, drop all capabilities, add only what's needed - [ ] Secrets stored in external secret manager — use External Secrets Operator with Vault, AWS Secrets Manager, or GCP Secret Manager
- [ ] RBAC configured with least privilege — service accounts get minimum required permissions
- [ ] Pod Security Standards enforced — use
restrictedprofile for production namespaces - [ ] Image scanning in CI/CD — scan all images for CVEs before they reach the cluster
- [ ] Private container registry — don't pull images from public registries in production
securityContext:
runAsNonRoot: true
runAsUser: 1000
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
Observability
- [ ] Metrics collection active — Prometheus or cloud-native monitoring (Datadog, New Relic)
- [ ] Logging aggregated — ship logs to a central system (Grafana Loki, ELK, CloudWatch)
- [ ] Distributed tracing enabled — OpenTelemetry for request tracing across services
- [ ] Dashboards created — RED metrics (Rate, Errors, Duration) for every service
- [ ] Alerts configured — PagerDuty/OpsGenie integration for critical conditions:
Deployment Strategy
- [ ] Rolling update strategy configured — set
maxSurgeandmaxUnavailableappropriately - [ ] Rollback tested — verify
kubectl rollout undoworks for your deployments - [ ] Image tags are immutable — never use
latest; always use specific version tags or SHA digests - [ ]
imagePullPolicy: IfNotPresentfor tagged images (notAlwaysunless using mutable tags) - [ ] Canary or blue-green deployment evaluated for critical services
Storage and State
- [ ] Persistent volumes use appropriate storage classes — SSD for databases, standard for logs
- [ ] Backup strategy for stateful workloads — Velero or provider-native snapshots
- [ ] StatefulSets used for stateful workloads — not Deployments with PVCs
- [ ] Volume expansion enabled —
allowVolumeExpansion: trueon storage classes
Disaster Recovery
- [ ] etcd backups scheduled — automated, tested, stored off-cluster
- [ ] Cluster recreation documented and tested — can you rebuild from scratch in under 1 hour?
- [ ] GitOps for all manifests — every resource in the cluster is defined in git (ArgoCD, Flux)
- [ ] Multi-AZ node distribution — use topology spread constraints or pod anti-affinity
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web
Cost Optimization
- [ ] Resource quotas per namespace — prevent any team from consuming unbounded resources
- [ ] Spot/preemptible nodes for non-critical workloads — 60-90% savings on node costs
- [ ] Kubecost or OpenCost deployed — track per-team, per-service costs
- [ ] Right-sized node pools — don't use
m5.2xlargefor pods requesting 256Mi memory
The 80/20 Rule
If you can only do five things from this list, do these:
- Set resource requests and limits on every container
- Configure health checks (liveness + readiness probes)
- Run 2+ replicas with Pod Disruption Budgets
- Enable monitoring and alerting for crash loops and error rates
- Run containers as non-root with read-only filesystems
These five items prevent the vast majority of Kubernetes production incidents.
Need a production readiness review? VVVHQ's Kubernetes engineers audit clusters for security, reliability, and performance gaps. Schedule a free assessment.