Your Kubernetes Cluster is Unbalanced (And the Scheduler Won't Fix It)

Photo by Paul Hanaoka on Unsplash
The Kubernetes scheduler is lazy. It places pods when they're created, picks the best node at that moment, and never thinks about it again. Weeks later, one node is at 75% memory while another sits at 40%. The scheduler doesn't care - its job was done the moment the pod started.
Descheduler fixes this. It runs every few minutes, finds imbalanced nodes, and evicts pods so the scheduler gets another chance to place them better. Set it up once, cluster self-balances automatically.
Here's how to configure it for a homelab. Takes about 20 minutes to get right.
Why This Matters
Kubernetes scheduling is a point-in-time decision. When a pod is created, the scheduler picks the best available node based on current conditions. But clusters are dynamic - nodes get added, pods come and go, resource requests change with updates.
After a few weeks, you end up with hot nodes and cold nodes. Here's what my homelab looked like:
NAME CPU(cores) MEMORY%
kube-worker1 438m 75% <-- carrying the load
kube-worker4 266m 41% <-- sitting idle
kube-worker6 171m 42% <-- sitting idle
Worker1 was handling most of the work while worker4 and worker6 did nothing.
Requests vs Actual Usage
Before configuring descheduler, you need to understand what it actually measures. This is where most people get it wrong.
Descheduler uses REQUESTED resources, not actual usage.
| Metric | Command | Descheduler Uses? |
|---|---|---|
| Actual CPU/Memory | kubectl top nodes | No |
| Requested CPU/Memory | kubectl describe node | Yes |
A node might show 5% actual CPU usage but 70% requested. From a scheduling perspective, that node is "full" - the scheduler reserved that capacity even if the pods aren't using it.
Check what descheduler sees:
kubectl describe node kube-worker1 | grep -A5 "Allocated resources"
Allocated resources:
Resource Requests Limits
-------- -------- ------
cpu 2368m (69%) 5490m (161%)
memory 4769334Ki (65%) 9343689984 (126%)
Those percentages (69% CPU, 65% memory) are what descheduler uses. Not kubectl top nodes.
Install Descheduler
Descheduler runs as a CronJob. The Helm chart is the easiest way:
# values.yaml
kind: CronJob
schedule: "*/5 * * * *"
deschedulerPolicy:
profiles:
- name: default
pluginConfig:
- name: DefaultEvictor
args:
ignorePvcPods: true
evictLocalStoragePods: false
- name: LowNodeUtilization
args:
thresholds:
cpu: 55
memory: 30
pods: 30
targetThresholds:
cpu: 70
memory: 70
pods: 50
plugins:
balance:
enabled:
- LowNodeUtilization
Deploy:
helm install descheduler descheduler \
--repo https://kubernetes-sigs.github.io/descheduler/ \
--namespace descheduler \
--create-namespace \
-f values.yaml
Understanding Thresholds
This is where I wasted an hour. LowNodeUtilization has two threshold sets that work differently:
thresholds - defines UNDERUTILIZED nodes
- A node is underutilized when ALL metrics are BELOW these values
- These nodes receive evicted pods
targetThresholds - defines OVERUTILIZED nodes
- A node is overutilized when ANY metric is ABOVE these values
- Pods get evicted FROM these nodes
thresholds targetThresholds
| |
UNDERUTILIZED | BALANCED | OVERUTILIZED
(receives pods) | | (evicts pods)
<---------------------|-------------------|-------------------->
0% 55% 70% 100%
Descheduler evicts pods from overutilized nodes. The scheduler then places them on underutilized nodes.
Critical: If no nodes qualify as underutilized, descheduler does nothing. If no nodes qualify as overutilized, descheduler does nothing. Both conditions must be true.
Tuning for Your Cluster
The default thresholds (cpu: 20, memory: 20) assume your nodes have low resource requests. Most real clusters have higher utilization - mine certainly did.
Check your actual request percentages first:
for node in $(kubectl get nodes -o name | cut -d/ -f2); do
echo "=== $node ==="
kubectl describe node $node | grep -A3 "Allocated resources" | grep -E "cpu|memory"
done
Then set thresholds based on what you see:
- If your coldest node is at 45% CPU requests, set
thresholds.cpu: 55(above it) - If your hottest node is at 75% memory requests, set
targetThresholds.memory: 70(below it)
I started with the defaults and descheduler did nothing. My coldest node had 45% CPU requests - above the 20% threshold. No node qualified as "underutilized" so there was nowhere to put evicted pods.
After checking my actual cluster state:
thresholds:
cpu: 55 # nodes below 55% CPU are underutilized
memory: 30 # nodes below 30% memory are underutilized
pods: 30
targetThresholds:
cpu: 70 # nodes above 70% CPU are overutilized
memory: 70 # nodes above 70% memory are overutilized
pods: 50
Protecting PVC Pods
Not all pods should be evicted. This is important:
- name: DefaultEvictor
args:
ignorePvcPods: true # Don't evict pods with PVCs
evictLocalStoragePods: false # Don't evict pods with emptyDir
Why protect PVC pods? Longhorn and similar storage use ReadWriteOnce volumes. Evicting a pod means:
- Old pod terminates
- Volume detaches from old node
- New pod schedules on different node
- Volume attaches to new node
- Pod starts
If step 2 doesn't complete before step 4, you get Multi-Attach errors. The new pod hangs waiting for the volume while the old node still has it attached.
I had ignorePvcPods: false initially. Got several Multi-Attach errors before I figured this out.
Verify It's Working
After deploying, watch for LowNodeUtilization events:
kubectl get events -A | grep -i "LowNodeUtilization"
prometheus Normal LowNodeUtilization pod/kube-state-metrics-7c8b8bf58c-88qqq
pod eviction from kube-worker1 node by sigs.k8s.io/descheduler
vpa Normal LowNodeUtilization pod/vpa-updater-f59cccc88-fp2t6
pod eviction from kube-worker1 node by sigs.k8s.io/descheduler
Check node balance:
kubectl top nodes
Results
| Node | Before | After |
|---|---|---|
| worker1 (hot) | 75% / 77% | 69% / 65% |
| worker4 (cold) | 50% / 25% | 50% / 25% |
| worker5 | 45% / 30% | 48% / 35% |
Worker1's request percentage dropped from 75% to 65%. The cluster is balanced now, and descheduler runs every 5 minutes to keep it that way.
Gotchas
-
Thresholds are percentages of REQUESTS - Don't use
kubectl top nodesto set thresholds. Usekubectl describe nodeto see request percentages. -
ALL metrics must be below thresholds for underutilized - If you set cpu: 20 but your coldest node has 45% CPU requests, no node qualifies as underutilized. Descheduler does nothing.
-
The comparison is greater-than, not greater-than-or-equal - A node at exactly 70% CPU with
targetThresholds.cpu: 70is NOT overutilized. 70 is not greater than 70. -
Jobs get auto-deleted - CronJob default history limits are low. If debugging, set
successfulJobsHistoryLimit: 3to keep completed jobs around for log inspection. -
DaemonSet pods can't be evicted - Don't count them when calculating node utilization. A node with 8 DaemonSet pods still has room for workloads.
Resources: