Skip to main content

How to Run Multiple GPU KAI Schedulers in Kubernetes Using vCluster

· 7 min read

Kubernetes Clusters

Photo by Growtika on Unsplash

In today's cloud-native landscape, GPU workloads are becoming increasingly critical. From training large language models to running inference APIs, organizations are investing heavily in GPU infrastructure. But with this investment comes a challenge: how do you safely test and deploy new GPU schedulers without risking your entire production environment?

Related talks: Watch my SREDay Paris Q4 2025 talk on this topic. Also presenting at Conf42 Kube Native 2025. Check the talks page for more details.

The GPU Scheduling Challenge

Let me paint a picture of what most teams face today. You're running a Kubernetes cluster with precious GPU resources. Multiple teams depend on these GPUs for everything from model training to real-time inference. Your current scheduler works, but you've heard about NVIDIA's KAI Scheduler and its promise of fractional GPU allocation and better resource utilization.

The problem? Testing a new scheduler in production is like performing surgery on yourself - one mistake and everything stops working.

Understanding GPU Workloads

Before we dive into the solution, let's understand what actually runs on GPUs in modern infrastructure:

Workload TypeReal-World ExamplesGPU Utilization
Model TrainingFine-tuning LLMs, Deep Learning100% for hours/days
Stable DiffusionImage generation services~50% GPU
LLM InferenceChatGPT-like APIs25-75% depending on model
Video ProcessingTranscoding, streamingVariable 20-80%
CUDA DevelopmentJupyter notebooks, testingOften < 20%
Batch ProcessingScientific computingSpikes to 100%

Notice something? Most workloads don't use 100% of a GPU all the time. Yet traditional Kubernetes scheduling treats GPUs as indivisible resources. This is where KAI Scheduler shines - but how do you test it safely?

What is NVIDIA KAI Scheduler?

In January 2025, NVIDIA open-sourced their KAI (Kubernetes AI) Scheduler, bringing enterprise-grade GPU management to the community. It's an advanced Kubernetes scheduler designed specifically for GPU workload optimization.

Key capabilities:

Fractional GPU allocation
Benefit: Share single GPU between workloads
Queue-based scheduling
Benefit: Hierarchical resource management
Topology awareness
Benefit: Optimize for hardware layout
Fair sharing
Benefit: Prevent resource monopolization

As a smart traffic controller for your GPUs, KAI ensures maximum utilization without causing collisions.

The Production Scheduler Dilemma

Here's the reality of upgrading schedulers in production:

Current challenges:

  • Single scheduler controls entire cluster
  • Any changes affect all workloads
  • No isolation between teams
  • Rollback procedures take hours

The impact:

Failure ModeImpactRecovery TimeBusiness Cost
Scheduler bugAll pods pending2-4 hoursHigh
CRD conflictsNamespace corruption6+ hoursCritical
Version mismatchRandom pod failures1-2 daysVery High
Resource leakGPU exhaustion4-8 hoursCritical

According to New Relic's 2024 data, enterprise downtime costs between $100k-1M+ per hour. Can you afford to take that risk?

Solution: vCluster for Isolated Testing

vCluster creates a fully functional Kubernetes cluster inside a namespace of your existing cluster. It's not a new EKS cluster or GKE cluster - it's a virtual cluster running inside your current infrastructure.

Key characteristics:

Solution: vCluster for Isolated Testing

The architecture consists of these components:

  • API Server: Handles all Kubernetes API calls independently
  • Syncer: Bi-directional resource synchronization with host
  • SQLite/etcd: Complete state isolation
  • Virtual Scheduler: Independent scheduling decisions

This architecture enables running a Kubernetes cluster inside Kubernetes, with complete isolation but shared underlying resources.

The Syncer: vCluster's Core Component

The Syncer: vCluster&#39;s Core Component

The syncer is the component that makes vCluster work seamlessly. It's responsible for:

  • Synchronizing resources between virtual and host cluster
  • Translating virtual resources to host resources
  • Managing resource lifecycle
  • Ensuring isolation boundaries

This means your GPU workloads scheduled by KAI inside the vCluster actually run on real GPU nodes in your host cluster, but all scheduling decisions are isolated.

The Solution: Isolated Testing with vCluster

Here's how you can safely test KAI Scheduler without risking production:

vCluster Isolation

The workflow:

  1. Create a vCluster with virtual scheduler enabled
  2. Install KAI Scheduler inside the vCluster
  3. Deploy test workloads with fractional GPU requests
  4. Observe behavior in complete isolation
  5. If something fails? Delete the vCluster in 30 seconds

Benefits achieved:

CapabilityTime SavedRisk Reduced
Test scheduler upgrades4 hours → 5 min100% → 0%
Rollback bad changes2 hours → 30 secCritical → None
A/B test versionsNot possible → EasyHigh → Zero
Per-team schedulersDays → MinutesComplex → Simple
GPU sharing validationWeeks → HoursHigh → None

Supporting Multiple Teams

Consider this scenario: Your ML team wants to test KAI v0.9.3 for its new features, while your Research team requires the stable v0.7.11 version. With traditional approaches, teams must coordinate, wait, and compromise on a single version.

With vCluster, each team operates their own virtual cluster with their own KAI scheduler version, providing complete autonomy without interference.

Parallel scheduler deployments:

team-ml
Scheduler Version: KAI v0.9.3
Purpose: Testing new features
team-research
Scheduler Version: KAI v0.7.11
Purpose: Stable version
team-dev
Scheduler Version: Default scheduler
Purpose: Standard workloads

Architecture benefits:

  • Virtual Scheduler: ENABLED in each vCluster
  • KAI Location: Inside each vCluster
  • Scheduling: Independent per team
  • Host Impact: NONE
  • Isolation: COMPLETE

Each team can iterate at their own pace, test different configurations, and only promote to production when they're confident.

Real-World Impact

Based on typical enterprise deployment scenarios, here's what you can achieve:

Time savings:

  • Setup to first test: 5 minutes instead of 4+ hours
  • Version switching: 30 seconds instead of 2+ hours
  • Team onboarding: Minutes instead of days

Risk reduction:

  • Blast radius: Single namespace instead of entire cluster
  • Rollback complexity: Delete command instead of complex procedures
  • Testing freedom: Complete instead of severely limited

Getting Started

Want to try this approach? I've created a complete hands-on guide with all the technical details, configurations, and scripts you need:

Technical Resources:

The guide includes:

  • vCluster configuration with virtual scheduler
  • KAI Scheduler installation
  • Sample GPU workloads with fractional allocation
  • Multi-team setup examples
  • Troubleshooting tips

Closing Thoughts

The combination of vCluster and NVIDIA KAI Scheduler represents a paradigm shift in how we can approach GPU workload management in Kubernetes. Instead of choosing between innovation and stability, you can have both.

vCluster provides the safety net that enables rapid experimentation. KAI Scheduler provides the advanced GPU management capabilities modern workloads demand. Together, they enable you to:

  • Test scheduler upgrades without fear
  • Give teams autonomy over their GPU scheduling
  • Maximize GPU utilization through fractional allocation
  • Reduce operational complexity and risk

The question isn't whether you should adopt this approach - it's what you'll build once you're no longer held back by fear of breaking production.

What GPU scheduling challenges are you facing? How could vCluster help your team move faster?