Build the operating system for clusters
We build infrastructure for a world where clusters are personal, sovereign, and developer-controlled.
Open Roles
We're looking for a frontend engineer to own the interface through which the world will interact with distributed systems. This isn't just another dashboard — it's an entire OS. Who invented the mouse and cursor? Who thought of the "x" button to close a window? The personal cluster revolution will require frontend innovation on the same level! These sparks of inspiration reframe our relationship with the hardware from something scary and complex into something accessible and useful to everyone, not just a select few highly technical users.
- Design and build the ClusterdOS interface, including cluster provisioning flows, natural language interfaces, and real-time proactive system insights.
- Create visualization systems for distributed state that make complex infrastructure legible at a glance.
- Work directly with Kubernetes APIs, GitOps controllers, gRPC/cRPC backends, and distributed system primitives to surface the right information at the right time. (No prior experience required—just curiosity and a willingness to learn.)
- Help design elegant protocols in collaboration with backend and platform teams for use in the frontend codebase.
- Build real-time reactive UIs using Convex that work equally well for individual tinkerers and large teams shipping to production.
- Prototype new interaction patterns for infrastructure management, testing assumptions quickly and iterating based on feedback.
- Establish frontend architecture patterns and best practices as an early engineering team member.
- You write maintainable, well-structured frontend code.
- You have basic experience or knowledge of Kubernetes and understand pods, deployments, services, and how distributed systems behave.
- You have a deep understanding of performance, accessibility, and frontend best practices.
- You are comfortable working with complex state management in systems that reflect real-world distributed infrastructure.
- You can translate technical complexity into clear user interfaces without oversimplifying.
- You have experience working with APIs and backend services, with bonus points if you’re comfortable contributing to backend logic when needed.
- You have strong communication skills and the ability to collaborate across product, design, and backend teams.
- You are self-directed with strong prioritization instincts and can identify what matters and execute accordingly.
- You are able to build an end-to-end Next.js application using React, Tailwind, and shadcn/ui.
- Background in systems programming or distributed systems.
- Experience working with Convex.
- Experience with PostHog or similar analytics tooling.
- Experience with observability tools such as Grafana or other monitoring and visualization platforms.
- Experience designing and implementing elegant, compute-efficient UI animations.
- A track record of shipping user-facing infrastructure or developer tools.
- Experience with GitOps workflows or infrastructure-as-code tooling.
We're looking for a Platform Engineer who will be instrumental in building and evolving ClusterdOS. You'll work directly with our founding team to design systems that abstract away Kubernetes complexity while preserving its power. You'll design GitOps workflows, build Kubernetes operators and controllers, and create the automation that makes cluster management invisible to end users.
- Build and extend ClusterdOS core features using Go, including custom Kubernetes operators and controllers
- Design and implement GitOps workflows with ArgoCD that make continuous deployment feel automatic
- Develop infrastructure-as-code patterns using Terraform and Helm that provision and manage clusters seamlessly
- Work on distributed storage solutions using Ceph and WEKA for high-performance, scalable cluster storage
- Create observability and monitoring systems using Prometheus and Grafana to surface cluster health and performance
- Build and optimize container networking with Cilium for network security and observability
- Design and implement federated Kubernetes architectures for multi-cluster management
- Build automation tooling that reduces operational overhead for developers running production workloads
- Contribute to open source components
- Establish platform architecture patterns and practices as an early engineering team member
- Go — Strong proficiency building system-level tooling, controllers, and distributed systems
- ArgoCD — Deep experience with GitOps workflows, declarative infrastructure, and continuous deployment patterns
- Kubernetes — Production experience with Kubernetes internals, custom resources (CRDs), operators, cluster architecture, etcd clusters, including backup/restore procedures and troubleshooting cluster health issues
- Infrastructure as Code — Experience with Terraform, Helm, or similar tools for automating infrastructure provisioning. Experience with Ansible, Kubespray, or similar orchestration frameworks
- Observability — Familiarity with monitoring and logging systems like Prometheus, Grafana, or similar platforms
- Strong understanding of containerization, networking, and cloud-native architectures
- Experience with CI/CD pipelines and automation workflows
- Ability to design systems that are both powerful and simple to use
- Self-directed with good prioritization instincts — you can identify what matters and execute accordingly
- Experience building developer tools or infrastructure products
- Background with service mesh technologies (Istio, Linkerd) or CNI plugins
- Contributions to projects or Kubernetes ecosystem tools
- Familiarity with cloud platforms (AWS, GCP, Azure) and their managed Kubernetes offerings
- Experience with policy-as-code and security tooling for Kubernetes
- Track record shipping infrastructure products that developers love
- Understanding of FinOps and infrastructure cost optimization
- You stay calm under pressure and know when to escalate versus dig deeper yourself
- You communicate clearly with both engineers and non-technical stakeholders, especially when things break
- You're curious about root causes and genuinely interested in preventing problems before they happen
- You work with low ego
We're looking for a Senior Infrastructure/SRE Engineer to join our on-call team for enterprise clients. You'll be the technical expert our clients rely on when things go wrong with extremely valuable clusters—diagnosing complex infrastructure issues, resolving production incidents, and ensuring zero downtime for critical AI workloads. This role requires deep technical knowledge, excellent troubleshooting skills, and the ability to stay calm under pressure.
- Respond to and resolve production incidents across our clients' infrastructure, including Kubernetes clusters, Ceph storage systems, and bare metal servers
- Diagnose complex issues ranging from pod scheduling problems and CNI networking failures to distributed storage performance degradation and hardware issues
- Handle escalations requiring deep expertise in distributed systems, including etcd cluster problems, Ceph RGW authentication issues, and custom networking setups with Cilium and bare metal load balancers
- Work directly with enterprise clients during incidents, providing clear communication about status, timeline, and resolution steps
- Participate in a follow-the-sun on-call rotation with engineers across time zones
- Document incidents thoroughly and improve runbooks based on recurring patterns
- Collaborate with our infrastructure team on long-term reliability improvements and automation to reduce incident frequency
- 5+ years of production experience with Kubernetes in enterprise environments, including deep knowledge of cluster operations, troubleshooting bare metal k8s issues, working with admission controllers, and understanding the control plane architecture
- Strong experience with distributed storage systems—Ceph experience is highly preferred, but deep experience with other systems like Weka, VAST, or similar is highly valuable as well
- Very solid modern Linux systems administration skills
- Experience with bare metal infrastructure management and construction, not just cloud environments. You should be comfortable with IPMI, hardware troubleshooting, networking, etc.
- Experience with at least one CNI plugin in production, preferably Cilium or Calico
- Strong troubleshooting methodology—you know how to systematically narrow down issues across complex distributed systems and can work effectively under pressure during outages
- Production experience with etcd clusters, including backup/restore procedures and troubleshooting cluster health issues
- Experience with GPU infrastructure for AI/ML workloads, including the NVIDIA Kubernetes operator and GPU stack
- Knowledge of infrastructure as code tools like Ansible, Kubespray, or similar orchestration frameworks
- Experience with high-availability load balancers, service mesh technologies, or IPAM systems
- Background working with AI inference or training companies and understanding their unique infrastructure requirements
- You're calm and methodical during incidents, with good judgment about when to escalate versus when to dig deeper yourself
- You communicate clearly with both technical and non-technical audiences, especially during high-stakes situations
- You have intellectual curiosity about root causes and want to prevent problems in advance
About Aranya
We're building ClusterdOS to make production-grade Kubernetes accessible to anyone. Our GitOps-native distributed OS removes the operational complexity so developers and teams can run serious infrastructure without a dedicated platform engineering team.
At Aranya different perspectives aren'tjust welcome, they're essential. The best infrastructure comes from people with diverse experiences and ideas.
.webp)