About ClusterdOS

ClusterdOS is an open-source distributed operating system for GPU infrastructure. Spin up 1000+ node bare-metal GPU clusterswith minimal setup - no hyperscaler tax, no babysitting, just production-ready Kubernetes out of the box. The stack does what it needs to: Kubernetes at the core, ArgoCD for GitOps, Helm for deployments, Ceph for storage, and Cilium for networking.

Tailscale keeps clusters connected securely, and it plugs straightinto vLLM for inference. Full control from the hardware up -nothing between you and the metal.

ClusterdOS Tech Stack Reference

ClusterdOS
Opinionated GUI
Spot Clusters
Cluster Autoscaling
k8s
k8s
k8s
Federated K8s
Unified Control
Plane
Managed Gitops
Hybrid Cloud
Compute recycling
k8s
k8s
Multi-Cluster
a
B
C
D
E
F
Independent
clusters
Kube-State-Metrics
13.
What it is

An exporter that generates metrics about the state of Kubernetes objects (deployments, pods, nodes, etc.).

What it does in ClusterdOS

While Prometheus scrapes runtime metrics like CPU and memory, Kube-State-Metrics provides a complementary view: how many pods are running, are deployments healthy, are jobs completing? It’s the “inventory report” that feeds into Grafana dashboards.

Vector
14.
What it is

A high-performance, vendor-neutral data pipeline for logs, metrics, and traces.

What it does in ClusterdOS

Vector runs as both an Agent (on every node, collecting logs/metrics at the source) and an Aggregator (centralizing and transforming data before sending it to storage). Think of it as the cluster’s “postal service” — it picks up observability data from every machine and delivers it where it needs to go.

Cert-Manager
15.
What it is

A Kubernetes add-on that automates the creation, renewal, and management of TLS certificates.

What it does in ClusterdOS

It automatically obtains HTTPS certificates (e.g., from Let’s Encrypt) and renews them before they expire. Without cert-manager, you’d have to manually manage certificates for every service — it’s the “locksmith” that keeps all the cluster’s HTTPS locks up to date.

NVIDIA GPU Operator
16.
What it is

An operator that automates the management of all NVIDIA software components needed to run GPU workloads on Kubernetes.

What it does in ClusterdOS

It installs and manages GPU drivers, container runtimes, device plugins, and monitoring tools automatically. Instead of manually configuring each node with GPU drivers, the GPU Operator handles it all — like a “plug and play” system for GPUs in the cluster.

Node Feature Discovery (NFD)
17.
What it is

A Kubernetes add-on that detects hardware features and capabilities on each node (GPUs, CPU flags, storage types, etc.).

What it does in ClusterdOS

It automatically labels nodes with their hardware capabilities so the scheduler can place workloads on the right machines. For example, it ensures GPU workloads land on GPU-equipped nodes. The “hardware census taker” for the cluster.

vLLM
18.
What it is

A high-throughput, memory-efficient inference engine for large language models (LLMs).

What it does in ClusterdOS

It serves LLM models (like MiniMax-M2) with an OpenAI-compatible API, using advanced techniques like PagedAttention for efficient GPU memory usage. ClusterdOS deploys it as a ready-to-use inference service — users can run their own AI models on-cluster without managing the complexity of model serving.

StackGres
19.
What it is

A full-stack PostgreSQL distribution and operator for Kubernetes.

What it does in ClusterdOS

It manages the complete lifecycle of PostgreSQL databases: provisioning, backups, high availability, monitoring, and connection pooling. Instead of manually operating database servers, StackGres handles it — like a “managed database service” but running on your own cluster. Includes TimescaleDB for time-series data by default.

WekaFS
20.
What it is

A CSI (Container Storage Interface) driver for Weka’s high-performance parallel file system.

What it does in ClusterdOS

For workloads that need extreme I/O performance (AI training, HPC, large datasets), WekaFS provides a POSIX-compliant file system with much higher throughput than traditional storage. It integrates with Kubernetes via CSI so pods can mount Weka volumes like any other storage. The “sports car” storage option for I/O-intensive workloads.

Metrics Server
21.
What it is

A lightweight, in-cluster component that provides resource usage metrics (CPU and memory) for nodes and pods.

What it does in ClusterdOS

It powers Kubernetes features like kubectl top, Horizontal Pod Autoscaling (HPA), and Vertical Pod Autoscaling (VPA). Without it, the cluster can’t automatically scale workloads based on real-time resource usage. The “speedometer” that autoscalers read from.