Kubernetes CPU Limits: Under the hood

Tania Duggal

Founder

February 25, 2026

You have seen this before; your Kubernetes cluster looks perfect - pods are running, CPU is low, memory is fine, no alerts. But your application feels slower. Latency is higher. Users start noticing.

You check your dashboards again - no red flags. The node is half idle. Still, requests are taking longer than usual.

We faced this same problem once. Our APIs were healthy, but P99 latency had doubled. We checked everything - load balancer, GC, network, autoscaler. Nothing worked. In the end, we found the real problem right in front of us: how Kubernetes handles CPU time.

Most teams set CPU and memory resources by default, thinking they’re protecting stability.But under the hood, Kubernetes doesn’t behave the way many assume. Sometimes, those same settings that look safe are actually slowing your containers down.

TL;DR ⚡

Kubernetes CPU limits can throttle your containers. Even when the node has free CPU, the kernel may pause your pods once they hit their CPU quota.

Kubernetes controls CPU use with the Linux CFS scheduler. If a container uses all its allowed CPU time too quickly, the kernel stops it until the next CFS period.

What You’ll Learn

How requests and limits actually work inside Kubernetes
Why CPU behaves differently from memory
How CPU throttling happens and how to see it in metrics
Best practices to avoid performance traps

What Are CPU & Memory Requests and Limits?

Let's start with the basics; most people think they already know but don't really. In Kubernetes, every container can ask for and control its use of two key resources: CPU and memory. Let's discuss these:

Requests

A request tells Kubernetes how much of a resource your container needs to run smoothly.

When you deploy a Pod, the scheduler checks the CPU and memory requests and finds a node that can handle them. If your Pod requests 500m CPU (0.5 core) and 512Mi memory, Kubernetes won’t place it on a node without that much free capacity.

Think of it as your minimum guaranteed budget - the scheduler ensures your Pod is placed on a node that can provide at least this much.

‍
CPU Request: determines CPU weight at runtime and scheduling placement.

Memory Request: determines how much memory must be available on the node before scheduling.

Note: 1 CPU = 1 vCPU or 1 core, 1000m (millicores) = 1 CPU, 512Mi ≈ 0.5 GiB.

Limits

A limit is different. It doesn’t affect scheduling; it controls what happens at runtime. The kubelet passes the limit to Linux cgroups, which use the CFS (Completely Fair Scheduler is the default Linux CPU scheduler that distributes CPU time proportionally among runnable processes) quota system to enforce it.

CPU limit: if the container hits its CPU limit, it will be throttled. We will go under the hood in the upcoming section.

Memory limit: Memory is not throttled. If your container crosses its limit, the kernel’s OOM killer terminates it. This is a hard stop.

Example: Requests and Limits in YAML

apiVersion: v1
kind: Pod
metadata:
  name: demo-app
spec:
  containers:
    - name: web
      image: nginx
      resources:
        requests:
          cpu: "500m"
          memory: "512Mi"
        limits:
          cpu: "1"
          memory: "512Mi"

In this example:

Kubernetes schedules the pod as if it needs 0.5 CPU and 512Mi.
At runtime, it cannot use more than 1 CPU.
Memory is capped at 512Mi; exceeding this triggers an OOM kill.

QoS Classes - How Kubernetes Treats Different Pods?

Kubernetes groups pods into Quality-of-Service (QoS) classes based on how you set requests and limits. Kubernetes relies on this classification to decide which Pods to evict when there are not enough available resources on a Node.

QoS Class	Condition	Behavior
Guaranteed	Every container has request = limit for both CPU and memory	Highest eviction priority (evicted last).
Burstable	Requests < Limits OR only some resources have requests/limits	Can burst above CPU request when node has free CPU. Evicted before Guaranteed pods
BestEffort	No CPU or memory requests/limits	Lowest priority. Evicted first. Gets CPU only if others leave the CPU unused.

Note: QoS does not affect throttling - throttling still depends on CPU limits.

How Kubernetes Schedules and Enforces Resources Under the Hood?

When we talk about “resources” in Kubernetes, we are talking about two layers working together:

Kubernetes itself - decides where a Pod should run.
The Linux kernel - decides how much CPU or memory each container actually gets once it’s running.

Let’s open that black box step by step.

1. The Scheduler’s Role

When you create a Pod, the scheduler looks at all available nodes and checks:

The CPU and memory that each node can use (after system processes have used some).
The total number of requests that have already been scheduled there.
If the node still has enough free space to handle your Pod's request.

The scheduler binds the Pod to that node if it fits. No CPU or memory is given yet, only planned. At this point, limits don't mean anything. The scheduler only looks at requests.

2. From Scheduler to Kubelet

Once a node is selected, the kubelet (the agent running on that node) takes over. It’s responsible for actually starting the Pod’s containers and applying the resource constraints.

Here’s what happens:

The kubelet receives your Pod spec.
It uses the Container Runtime Interface (CRI) (e.g. containerd) to start containers.
The runtime creates Linux cgroups for each container.
The following values are given to those cgroups:
- cpu.weight ➜ from your Pod’s CPU request
- cpu.max ➜ from your Pod's CPU limit
- memory.max ➜ from your memory limit

At this point, Kubernetes gives control to the Linux kernel.

3. Linux cgroups

Once your Pod starts running, Kubernetes steps aside and the Linux kernel takes over. This is where cgroups come in. They decide how much CPU and memory your container can actually use.

Let's look at how CPU and memory work in different ways:

CPU

There are two cgroup settings that manage how much CPU is used:

Cpu.weight: controls fairness (relative weight when multiple containers want CPU). When the node is under CPU pressure, workloads with larger CPU requests are typically allocated more CPU time than workloads with small requests (because request influences weighting).

Important: cpu.weight does not reserve CPU – it only defines priority.

cpu.max: controls CPU limit (hard cap inside a time window)
Let's say:

cpu.max = <quota> <period>

This is the format of the CPU limit setting in cgroup v2.

Quota: This is the total amount of CPU time the cgroup can use within a given period (in microseconds).
Example: if you set a CPU limit of 200m (0.2 CPU) and your period is 100000µs (100ms), the quota will be 20000µs (20ms) per 100ms period, so:
cpu.max = 20000 100000

Period: This is the duration of that “CPU period” window (in microseconds). A very common default is 100000µs (100ms). This period is the time frame within which the CPU quota is enforced.

If the container consumes that 20ms early in the 100ms window, it will be paused for the rest of the window and can run again only when the next 100ms period starts, even when the node has free CPU available. This pause is known as throttling, and that is why CPU limits often create unpredictable latency.

Note for Multi-Threaded Workloads: In this article, we were considering single-threaded workloads. But for multi-threaded workloads, multiple threads can run at the same time and consume that quota faster, so throttling shows up more often and latency gets worse.

When No CPU Limit is set; CFS skips the quota mechanism entirely.

CPU request → cpu.weight → controls fairness
CPU limit   → cpu.max    → enforces a hard cap (throttling)

Memory

Memory doesn’t work like CPU. There’s no “quota window” - it’s just a cap.

The cgroup file memory.max defines the absolute limit. If a container tries to go over it, the kernel doesn’t throttle - it kills the process using the OOM (Out Of Memory) killer.

That’s why memory limits are strict and must match your application’s actual usage pattern. Unlike CPU, memory can’t be “borrowed” from idle pods.

Note: with MemoryQoS, the kernel can also apply reclaim pressure / throttling behavior using memory.high before you hit memory.max.

                               THE WHOLE FLOW

                     +----------------------------------+
                     |         Kubernetes API            |
                     +----------------------------------+
                                   |
                                   ▼
                     +----------------------------------+
                     |            Scheduler             |
                     |  - Uses resource requests        |
                     |  - Selects a node                |
                     +----------------------------------+
                                   |
                                   ▼
                     +----------------------------------+
                     |             Kubelet              |
                     |  - Starts containers             |
                     |  - Creates cgroups               |
                     |  - Passes CPU/memory settings    |
                     |    to container runtime          |
                     +----------------------------------+
                                   |
                                   ▼
                     +----------------------------------+
                     |          Linux cgroups           |
                     |  - cpu.weight  (fair share)      |
                     |  - cpu.max     (CPU limit)       |
                     |  - memory.max  (hard cap)        |
                     +----------------------------------+
                                   |
                                   ▼
                     +----------------------------------+
                     |            Container             |
                     |  - Gets fair CPU time            |
                     |  - Throttled if CPU exceeded     |
                     |  - OOMKilled if memory exceeded  |
                     +----------------------------------+

Why “Limit = Request” Is the Rule for Memory?

This rule stops most common production problems because of how Kubernetes works internally and how the kernel handles memory.

1. It prevents scheduling mistakes

If request < limit, the scheduler assumes your pod needs only the request amount. Later, when the pod grows toward the limit, the node may not have enough memory(OOM kills). Keeping the request equal to the limit fixes this mismatch.

2. Guaranteed QoS (most stable class)

If all containers in the pod have equal memory request and limit, Kubernetes classifies it as Guaranteed QoS.

Benefits:

lowest eviction risk
more stable memory behavior
ideal for production services
fewer surprises during node pressure

3. Predictable memory usage

You know exactly how much memory the pod can use.
This makes:

capacity planning easier
node sizing cleaner
debugging memory issues faster

No unexpected memory spikes beyond the configured limit, so failures become predictable.

Why do CPU limits backfire in real clusters?

We will walk through the experiment phase by phase and observe what actually happens.

Phase 1 - When Two BestEffort Pods Run Alone?

We started with the simplest possible setup with two BestEffort pods - no CPU request, no CPU limit.

CPU Distribution

Once both pods were running, we applied continuous CPU load inside them.

CPU Usage Per Pod - CPU Limits — CPU Usage Per Pod

If you look at the CPU usage graph, the behavior is very clear; be-1 and be-2 stayed around 450-460 millicores.

Together, they consumed almost the entire CPU capacity available on the node. The split was nearly equal.

Why?

This happens because neither pod defined a CPU request. Therefore, Kubernetes assigned both of them the lowest possible cpu.weight. Since both weights were equal, the Linux scheduler divided CPU (or CPU time) proportionally, which in this case meant evenly.

There was no reservation and no prioritization. Just fair sharing.

Note for Throttling: When no limit is set in Kubernetes, kubelet does not configure a cpu.max quota for the container’s cgroup. Without cpu.max, the kernel has no quota window to enforce. And without a quota, there is nothing to throttle.

Phase 2 - What Happens When We Add Two Guaranteed Pods?

After observing how BestEffort pods behave alone, we introduced two Guaranteed pods - request = 200m, limit = 200m.

CPU Redistribution

The moment the Guaranteed pods started running under load.

The CPU usage graph shifted. Before this, both BestEffort pods were sitting around 450–460m each. After adding g-1 and g-2, the changes are clear:

g-1 and g-2 stayed almost exactly at 200m
be-1 and be-2 dropped to around 350–360m each

Why?

This happens because the Guaranteed pods requested 200m, while BestEffort pods requested nothing. So the scheduler gave the Guaranteed pods their proportional share first, and the remaining CPU was distributed to the BestEffort pods.

Throttling

Now comes the critical observation.

As you can see both Guaranteed pods showed consistent throttling; g-1 and g-2 had ~ 10 throttled periods per second and the BestEffort pods showed nothing.

Why?

This happens because Guaranteed pods define a CPU limit, and CPU limits are enforced by CPU throttling as we discussed before.

The Backfire Moment

Node CPU Usage - CPU Limits — Node CPU Usage

Now look at the node CPU usage graph. The node running these workloads was hovering around 35–42% CPU usage.

Not 90%. Not 100%. Around 40%.

And yet, the Guaranteed pods were being throttled.

The node still had idle CPU available, but the Guaranteed pods could not use it because their CPU limit enforced a strict quota through cpu.max.

This is where CPU limits start to backfire.

Phase 3 - What Happens When We Add Two Burstable Pods?

In the final step of the experiment, we added two burstable pods - request = 300m, no CPU limit.

CPU Redistribution

As soon as the Burstable pods started running under load.

The CPU usage graph shifted dramatically.

b-2 climbed to around 690–700m and b-1 stayed around 620–650m
g-1 and g-2 stayed around ~200m
be-1 and be-2 dropped to almost nothing (~10m each)

As you can see the Burstable pods were now dominating the CPU.

Why?

This happens because Burstable pods requested 300m - higher than the 200m requested by the Guaranteed pods. Under contention, higher weight means a larger proportional share of CPU.

That is why Burstable pods moved to the top, Guaranteed pods stayed in the middle, and BestEffort pods were pushed to the bottom.

Note for throttling: The Burstable pods are consuming far more CPU than their request, yet they are not throttled - because they have no limits.cpu configured.

Best Practices

The following are the best practices to do in production:

1. Always define CPU requests

If you skip CPU requests, your pod becomes BestEffort and gets the lowest CPU priority. Setting requests makes your app more stable and avoids surprise slowdowns because CPU requests translate into cpu.weight, which decide how much CPU your pod gets when multiple pods want CPU at the same time.

2. Avoid CPU limits unless you truly need them

CPU limits often slow applications because CFS throttles the container’s cgroup when it hits the limit, even when the node has plenty of free CPU. Most production workloads run better with CPU requests only, without limits.

3. Memory: Always set request = limit

Equal memory request and limit make your pod predictable and stable. It lowers the risk of OOM kills caused by scheduler misplacement. This also simplifies debugging and capacity planning. (we can also mention why see above section )

4. Monitor real CPU behavior (especially if limits are used)

Monitoring helps you detect performance issues before users notice them. It keeps resource settings aligned with real workload needs. A small amount of observation saves hours of debugging later.

5. Use policies only when the cluster needs boundaries

Policies like LimitRange or ResourceQuota are useful only in shared environments. They prevent teams or workloads from consuming too many resources. If your cluster is used by a single team, you may not need these guardrails.

‍

When CPU Limits Are Still Useful?

There are a few situations where CPU limits genuinely make sense:

1. Multi-tenant clusters (shared by many teams)

CPU limits help protect teams from each other when many workloads share the same cluster. Limits prevent one team’s pod from consuming all CPU and starving others.

2. Benchmarking or performance-testing environments

Limits create a stable, predictable CPU cap during tests. This helps you repeat the same test with the same conditions every time. Without limits, results change based on how busy the node is, making tests inconsistent.

3. Workloads that require strict CPU budgets (internal billing / cost control)

Some companies track CPU usage per team or per customer. In these cases, limits enforce hard boundaries so workloads never exceed their assigned budget. This makes cost planning predictable, although it does not guarantee consistent performance.

4. Product platforms offering fixed CPU plans (SaaS / customer compute tiers)

If your product sells fixed CPU tiers (like 1 vCPU plan, 2 vCPU plan), you need CPU limits. This ensures customers only use the CPU they paid for, and higher tiers get what they expect. Limits guarantee consistent performance per plan.

5. Managed clusters that force limits (AutoPilot, enterprise governed clusters)

Some managed Kubernetes platforms enforce CPU limits automatically. They do this to guarantee predictable resource isolation and cost behavior, not for performance reasons.

Conclusion

Tim Hockin (Kubernetes co-creator) summed it up perfectly in this famous X reply:

In the end, I would say CPU limits can quietly become a performance bottleneck if we don’t understand how they work under the hood.

What looks like a safe configuration can sometimes introduce hidden throttling and unexpected latency. When we understand the mechanics behind it, we stop guessing and start making deliberate, performance-aware decisions for our clusters.

‍

Table of Contents

Example H2