About clusterdOS

ClusterdOS is an open-source distributed operating system for GPU infrastructure. Spin up 1000+ node bare-metal GPU clusterswith minimal setup - no hyperscaler tax, no babysitting, just production-ready Kubernetes out of the box. The stack does what it needs to: Kubernetes at the core, ArgoCD for GitOps, Helm for deployments, Ceph for storage, and Cilium for networking.

Tailscale keeps clusters connected securely, and it plugs straightinto vLLM for inference. Full control from the hardware up -nothing between you and the metal.

K8s (Kubernetes)

The industry-standard system for running applications across multiple computers.

What it does in ClusterdOS

It’s the foundation ClusterdOS is builton — the “engine under the hood.”Like how an operating system manages files and programs on one computer, Kubernetes manages applications across many computers.

A zero-configuration VPN that securely connects all machines in a cluster.

What it does in ClusterdOS

hink of it as a private, encrypted tunnel that lets all servers talk to each other as if they were on the same local network, no matter where they physically are. Used for secure remote cluster access and mesh networking.

tool that lets you create and manage cloud resources (servers, databases, networks) using the same system you manage applications.

What it does in ClusterdOS

Instead of logging into AWS, GCP, or Azure consoles separately, Crossplane lets ClusterdOS control everything from one place. Think of it as a universal remote for cloud infrastructure.

A modern traffic router (reverse proxy and load balancer).

What it does in ClusterdOS

It automatically directs incoming web requests to the right application and handles HTTPS certificates. The “front door” of any web service running on ClusterdOS — it decides which requests go where.

The GitOps engine that automates infrastructure deployment.

What it does in ClusterdOS

It watches a Git repository for changes and automatically syncs the cluster to match. If someone changes an infrastructure file in Git, ArgoCD applies it to the cluster. If something drifts from the desired state, ArgoCD corrects it. The “autopilot” for infrastructure deployments.

A system that lets you run traditional virtual machines inside Kubernetes alongside containers.

What it does in ClusterdOS

For workloads that can’t be containerized (legacy applications, certain OS-level tasks), KubeVirt bridges the gap. Like running a Windows game on linux, but for servers.

An orchestrator that brings storage systems into Kubernetes.

What it does in ClusterdOS

It automates deploying, configuring, and managing storage. Think of Rook as the “installer and manager” for the storage system — it makes Ceph easy to run on Kubernetes.

A distributed storage system that turns a pool of hard drives across many machines into a single, reliable storage layer.

What it does in ClusterdOS

Like iCloud or Google Drive but self-hosted — files are automatically replicated across machines for safety. Rook deploys Ceph; together they provide block storage, file storage, and S3-compatible object storage.

A programming language created by Google,known for performance and simplicity.

What it does in ClusterdOS

Used to build ClusterdOS’s backend services and CLI tools. The language of choice for most cloud-native infrastructure software because it’s fast, efficient, and compiles to a single binary.

A programming language created by Google,known for performance and simplicity.

What it does in ClusterdOS

Used to build ClusterdOS’s backend services and CLI tools. The language of choice for most cloud-native infrastructure software because it’s fast, efficient, and compiles to a single binary.

A modern framework for building APIs that services use to communicate with each other.

What it does in ClusterdOS

Think of it as the standardized “language” ClusterdOS’s internal services use to talk to each other efficiently. Built on top of gRPC/Protocol Buffers for speed and type safety.

A platform for running machine learning workflows on Kubernetes.

What it does in ClusterdOS

It handles the full ML lifecycle: data preparation, model training, and serving predictions. ClusterdOS includes it to support AI/ML workloads natively — users can train and deploy models on their clusters without managing complex ML infrastructure.

Grafana’s Kubernetes monitoring stack bundles Grafana, Prometheus-compatible metrics, Loki (logs), Tempo (tracing), and Mimir (long-term metrics) into a single Helm deployment.

What it does in ClusterdOS

ClusterdOS deploys the LGTM stack as one unit. Grafana acts as the cluster’s dashboard while Alloy automatically collects metrics and logs and sends them to Grafana Cloud or a self-hosted backend.

A monitoring system that collects, stores, and queries time-series metrics from applications and infrastructure.

What it does in ClusterdOS

Collects real-time metrics from clusters and services to monitor system health. ClusterdOS uses Prometheus to track infrastructure performance, resource usage, and service reliability across the cluster.

A backend platform that provides reactive databases, server functions, and real-time data synchronization for modern web applications.

What it does in ClusterdOS

Used for application state, backend logic, and real-time data synchronization. Convex powers parts of the ClusterdOS interface that require reactive data updates and persistent backend functionality.

A developer platform that provides APIs and tools for adding enterprise features like authentication, SSO, and directory sync to applications.

What it does in ClusterdOS

Provides enterprise-grade authentication such as SSO, directory sync, and identity management, allowing organizations to securely integrate ClusterdOS with their existing identity providers.

clusterdOS Tech Stack Reference

ClusterdOS

Opinionated GUI

Spot Clusters

Cluster Autoscaling

k8s

k8s

k8s

Federated K8s

Unified Control
Plane

Managed Gitops

Hybrid Cloud

Compute recycling

k8s

k8s

Multi-Cluster

a

B

C

D

E

F

Independent
clusters

Kube-State-Metrics

An exporter that generates metrics about the state of Kubernetes objects (deployments, pods, nodes, etc.).

What it does in ClusterdOS

While Prometheus scrapes runtime metrics like CPU and memory, Kube-State-Metrics provides a complementary view: how many pods are running, are deployments healthy, are jobs completing? It’s the “inventory report” that feeds into Grafana dashboards.

A high-performance, vendor-neutral data pipeline for logs, metrics, and traces.

What it does in ClusterdOS

Vector runs as both an Agent (on every node, collecting logs/metrics at the source) and an Aggregator (centralizing and transforming data before sending it to storage). Think of it as the cluster’s “postal service” — it picks up observability data from every machine and delivers it where it needs to go.

A Kubernetes add-on that automates the creation, renewal, and management of TLS certificates.

What it does in ClusterdOS

It automatically obtains HTTPS certificates (e.g., from Let’s Encrypt) and renews them before they expire. Without cert-manager, you’d have to manually manage certificates for every service — it’s the “locksmith” that keeps all the cluster’s HTTPS locks up to date.

NVIDIA GPU Operator

An operator that automates the management of all NVIDIA software components needed to run GPU workloads on Kubernetes.

What it does in ClusterdOS

It installs and manages GPU drivers, container runtimes, device plugins, and monitoring tools automatically. Instead of manually configuring each node with GPU drivers, the GPU Operator handles it all — like a “plug and play” system for GPUs in the cluster.

Node Feature Discovery (NFD)

A Kubernetes add-on that detects hardware features and capabilities on each node (GPUs, CPU flags, storage types, etc.).

What it does in ClusterdOS

It automatically labels nodes with their hardware capabilities so the scheduler can place workloads on the right machines. For example, it ensures GPU workloads land on GPU-equipped nodes. The “hardware census taker” for the cluster.

A high-throughput, memory-efficient inference engine for large language models (LLMs).

What it does in ClusterdOS

It serves LLM models (like MiniMax-M2) with an OpenAI-compatible API, using advanced techniques like PagedAttention for efficient GPU memory usage. ClusterdOS deploys it as a ready-to-use inference service — users can run their own AI models on-cluster without managing the complexity of model serving.

A full-stack PostgreSQL distribution and operator for Kubernetes.

What it does in ClusterdOS

It manages the complete lifecycle of PostgreSQL databases: provisioning, backups, high availability, monitoring, and connection pooling. Instead of manually operating database servers, StackGres handles it — like a “managed database service” but running on your own cluster. Includes TimescaleDB for time-series data by default.

A CSI (Container Storage Interface) driver for Weka’s high-performance parallel file system.

What it does in ClusterdOS

For workloads that need extreme I/O performance (AI training, HPC, large datasets), WekaFS provides a POSIX-compliant file system with much higher throughput than traditional storage. It integrates with Kubernetes via CSI so pods can mount Weka volumes like any other storage. The “sports car” storage option for I/O-intensive workloads.

A lightweight, in-cluster component that provides resource usage metrics (CPU and memory) for nodes and pods.

What it does in ClusterdOS

It powers Kubernetes features like kubectl top, Horizontal Pod Autoscaling (HPA), and Vertical Pod Autoscaling (VPA). Without it, the cluster can’t automatically scale workloads based on real-time resource usage. The “speedometer” that autoscalers read from.