Kubernetes Production Readiness Checklist

A comprehensive pre-flight checklist for running Kubernetes in production — covering resource management, security hardening, observability, deployment strategies, and disaster recovery.

By VVVHQ Team ·

Before You Go Live on Kubernetes

Kubernetes is powerful, but deploying it in production without proper preparation is a recipe for 3 AM pages and weekend firefighting. This checklist covers the critical areas teams often overlook when moving from dev/staging to production Kubernetes.

Use this as a pre-flight checklist. Every item here has caused real production incidents at organizations that skipped it.

Resource Management

Requests and Limits

  • [ ] Every container has CPU and memory requests set — without requests, the scheduler can't make intelligent placement decisions
  • [ ] Memory limits are set and tested — containers without memory limits can OOM-kill neighboring pods
  • [ ] CPU limits are set thoughtfully — overly aggressive CPU limits cause throttling; consider removing CPU limits if you use requests correctly
  • [ ] Resource requests are based on actual usage data, not guesses — run load tests and analyze metrics before setting production values
resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"
    # CPU limit optional - evaluate based on workload

Pro tip: Use Vertical Pod Autoscaler (VPA) in recommendation mode to right-size your requests based on actual usage.

Autoscaling

  • [ ] Horizontal Pod Autoscaler (HPA) configured for stateless workloads — scale on CPU, memory, or custom metrics
  • [ ] Minimum replicas set to 2+ for production services — never run a single replica of anything critical
  • [ ] Cluster Autoscaler enabled — ensure the cluster can add nodes when pods can't be scheduled
  • [ ] Pod Disruption Budgets (PDBs) defined — prevent voluntary disruptions from taking down too many replicas
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-pdb
spec:
  minAvailable: "50%"
  selector:
    matchLabels:
      app: web

Health Checks

  • [ ] Liveness probes configured — detect and restart containers that are stuck (deadlocked, hung)
  • [ ] Readiness probes configured — prevent traffic from reaching pods that aren't ready to serve
  • [ ] Startup probes configured for slow-starting apps — prevent liveness probes from killing pods during initialization
  • [ ] Probe timeouts and thresholds are tuned — default values are often too aggressive
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 10
  failureThreshold: 3
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 2

Common mistake: Using the same endpoint for liveness and readiness. Liveness should check if the process is alive. Readiness should check if it can handle traffic (database connected, caches warm, etc.).

Networking and Traffic

  • [ ] Network Policies applied — restrict pod-to-pod traffic to only what's needed (default-deny, then allow)
  • [ ] Ingress controller configured with TLS — terminate TLS at the ingress, not in application pods
  • [ ] Service mesh evaluated (Istio, Linkerd) — provides mTLS, observability, and traffic management without app changes
  • [ ] DNS configuration tested — verify pods can resolve internal and external DNS reliably
  • [ ] Graceful shutdown implemented — handle SIGTERM, drain connections, and respect terminationGracePeriodSeconds

Security

  • [ ] Containers run as non-root — set runAsNonRoot: true and runAsUser in security context
  • [ ] Read-only root filesystem where possible — readOnlyRootFilesystem: true
  • [ ] No privileged containersprivileged: false, drop all capabilities, add only what's needed
  • [ ] Secrets stored in external secret manager — use External Secrets Operator with Vault, AWS Secrets Manager, or GCP Secret Manager
  • [ ] RBAC configured with least privilege — service accounts get minimum required permissions
  • [ ] Pod Security Standards enforced — use restricted profile for production namespaces
  • [ ] Image scanning in CI/CD — scan all images for CVEs before they reach the cluster
  • [ ] Private container registry — don't pull images from public registries in production
securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop: ["ALL"]

Observability

  • [ ] Metrics collection active — Prometheus or cloud-native monitoring (Datadog, New Relic)
  • [ ] Logging aggregated — ship logs to a central system (Grafana Loki, ELK, CloudWatch)
  • [ ] Distributed tracing enabled — OpenTelemetry for request tracing across services
  • [ ] Dashboards created — RED metrics (Rate, Errors, Duration) for every service
  • [ ] Alerts configured — PagerDuty/OpsGenie integration for critical conditions:
- Pod crash loops - High error rates (>1% 5xx) - Resource saturation (>80% memory/CPU) - Certificate expiration (<14 days)

Deployment Strategy

  • [ ] Rolling update strategy configured — set maxSurge and maxUnavailable appropriately
  • [ ] Rollback tested — verify kubectl rollout undo works for your deployments
  • [ ] Image tags are immutable — never use latest; always use specific version tags or SHA digests
  • [ ] imagePullPolicy: IfNotPresent for tagged images (not Always unless using mutable tags)
  • [ ] Canary or blue-green deployment evaluated for critical services

Storage and State

  • [ ] Persistent volumes use appropriate storage classes — SSD for databases, standard for logs
  • [ ] Backup strategy for stateful workloads — Velero or provider-native snapshots
  • [ ] StatefulSets used for stateful workloads — not Deployments with PVCs
  • [ ] Volume expansion enabledallowVolumeExpansion: true on storage classes

Disaster Recovery

  • [ ] etcd backups scheduled — automated, tested, stored off-cluster
  • [ ] Cluster recreation documented and tested — can you rebuild from scratch in under 1 hour?
  • [ ] GitOps for all manifests — every resource in the cluster is defined in git (ArgoCD, Flux)
  • [ ] Multi-AZ node distribution — use topology spread constraints or pod anti-affinity
topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: web

Cost Optimization

  • [ ] Resource quotas per namespace — prevent any team from consuming unbounded resources
  • [ ] Spot/preemptible nodes for non-critical workloads — 60-90% savings on node costs
  • [ ] Kubecost or OpenCost deployed — track per-team, per-service costs
  • [ ] Right-sized node pools — don't use m5.2xlarge for pods requesting 256Mi memory

The 80/20 Rule

If you can only do five things from this list, do these:

  1. Set resource requests and limits on every container
  2. Configure health checks (liveness + readiness probes)
  3. Run 2+ replicas with Pod Disruption Budgets
  4. Enable monitoring and alerting for crash loops and error rates
  5. Run containers as non-root with read-only filesystems

These five items prevent the vast majority of Kubernetes production incidents.

Need a production readiness review? VVVHQ's Kubernetes engineers audit clusters for security, reliability, and performance gaps. Schedule a free assessment.

Tags: kubernetes production, kubernetes checklist, k8s best practices, kubernetes security, kubernetes deployment