Kubernetes at Scale: Lessons from 50+ Clusters
Real-world lessons learned from managing large-scale Kubernetes fleets. Governance, security, and developer experience.

It Works on My Machine... But What About Prod?
Kubernetes (K8s) has indisputably won the container orchestration war. It is the operating system of the cloud. But while spinning up a minikube cluster on a laptop is easy, managing a fleet of 50+ production clusters across 3 regions is a different beast entirely.
At Sentrix, we manage Kubernetes infrastructure for large enterprises. Here are the hard-fought lessons we've learned from the trenches of Day 2 operations.
Lesson 1: Governance is Not Optional
In a small cluster, you might trust everyone with cluster-admin. At scale, this is suicide.
- Policy as Code: We use OPA Gatekeeper or Kyverno to enforce rules programmatically.
- Rule: No container can run as
root. - Rule: All images must come from the internal trusted registry (no random Docker Hub images).
- Rule: All pods must have CPU/Memory limits defined. If a developer tries to apply a manifest that violates these rules, the cluster rejects it with a helpful error message.
- Rule: No container can run as
Lesson 2: GitOps is the Only Way
If you are running kubectl apply -f from your laptop in production, you are doing it wrong. The state of your cluster is now dependent on your laptop's filesystem.
We use ArgoCD for 100% of our deployments.
- Developer merges code to
main. - CI builds the image.
- CI updates the
deployment.yamlin the Git repo. - ArgoCD detects the change in Git and syncs the cluster.
This provides an audit trail for every change. "Who changed the replica count?" Check the Git log. It also enables instant rollback: git revert.
Lesson 3: The "Goldilocks" Problem of Resources
Developers notoriously struggle to set correct resource requests and limits. Set them too low? OOMKilled (Out of Memory). Set them too high? You are wasting money on reserved capacity that sits idle.
- VPA (Vertical Pod Autoscaler): We run VPA in "recommendation mode" to analyze usage over time and suggest optimal requests.
- Karpenter: We replaced the standard Cluster Autoscaler with Karpenter. It provisions "just-in-time" nodes that perfectly fit the pending pods, rather than using rigid node groups. This reduced compute bills by ~20%.
Lesson 4: Developer Experience (Internal Platform)
Kubernetes is complex. Ingress, ServiceAccount, PersistentVolumeClaim—developers shouldn't need to be K8s certified to ship a feature.
We advocate for building an Internal Developer Platform (IDP) using tools like Backstage. The developer clicks "Create New Service". The platform:
- Scaffolds a Repo.
- Sets up CI/CD pipelines.
- Provisions a namespace with default quotas.
- Grants the team access.
This is "Platform Engineering"—treating your infrastructure as a product, and your developers as the customers.
Summary
Scaling Kubernetes is less about technology and more about process. It is about building guardrails that make it easy to do the right thing and hard to break the system. Automation, Policy as Code, and GitOps are the keys to sleeping soundly at night while running thousands of pods.
Related Reading
- Cloud Cost Optimization - How we reduced AWS spend by 30% for a fintech client
- Zero Trust Architecture - Secure your Kubernetes clusters with Zero Trust principles
Our Infrastructure Services
Need help scaling your Kubernetes infrastructure? Our Cloud Solutions team has managed 50+ production clusters. We also offer Managed Services for 24/7 monitoring and incident response.
Let's talk infrastructure. Contact us for a free consultation.