The Data Engineer’s guide to optimizing Kubernetes
18.04.2025
•
Niels Claeys
At Conveyor, we’ve spent over five years building and operating a batch data platform on top of Kubernetes.
While Kubernetes provides a solid foundation, its default configuration isn’t well-suited for handling batch workloads.
In this blog post, I’ll walk through several adjustments that we have made to improve the performance as well as reduce the cost of our Kubernetes clusters. These techniques apply at any scale, but their impact grows as the number of jobs increase. For context, we currently support customers running more than 500,000 core hours per month. This is equivalent to running more than 170 of our standard nodes non stop for a month. If you’re operating or planning to support batch workloads at a similar scale, this blogpost is for you.

What makes batch workloads different?
Batch workloads are inherently different from long running jobs, which is what Kubernetes historically was made for. The main characteristics that set batch workloads apart are:
A high volume of short-lived jobs (e.g. > 50000 jobs/day) that last from less than a minute up to a couple of hours. This results in a lot more scheduling decisions being made.
The resource requirements of batch jobs varies greatly. We see jobs running anywhere between 200Mb and 500Gb of memory at our customers.
Batch jobs are triggered on a schedule and in most organisations primarily run during the night (e.g. 12 until 7 am) with a lower load during the day.
These differences make it necessary to adjust how Kubernetes is configured. In the sections below, I’ll cover three areas where tuning is beneficial and explain how we tackled these challenges.
Maximizing node efficiency in your cluster
In Kubernetes you pay for entire nodes rather than just the workloads they host (unlike services such as ECS or Fargate). As your platform grows and you begin tracking cluster efficiency, it quickly becomes clear that a big portion of the available compute resources is being wasted. Our initial results showed that about 50% of the resources were not being used. Before improving on this efficiency we need to understand how the resources of a node in Kubernetes are divided:
Node overhead: this contains OS resources, Kubelet resources as well as eviction thresholds
Daemonset overhead: the resources required forthe kubernetes components that have to be running on each node (e.g. CNI, CSI, monitoring components)
Actual workload: the resources available for running your job.

The Kubernetes overhead is the sum of the node overhead and the daemonset overhead. You should minimize this overhead as much as possible such that you have more space available for workloads running on your node. For some cloud providers, the node overhead can be tweaked (e.g. AWS) when you opt for self managed nodes. Make sure to test any configuration changes under high load to make sure your cluster still behaves as expected.
The Daemonset overhead can be minimized by limiting the number components installed on every node as well as restricting the resources they require. This requires your platform team to decide what is important and not blindly implement every request from the usecase teams. Secondly, we thoroughly tweak the resources required by every Daemonset, which resulted in us switching from Fluentd to Fluentbit for logging as well as running a custom monitoring agent on Azure instead of using the (legacy) OMS agent.
Choosing the right node size
When running multiple jobs on one node you have to choose the right size for your nodes:
Small nodes have a higher ratio of overhead vs workload resources which decreases their efficiency. Due to this Kubernetes nodes with less than 8Gb of RAM and 2 vCPU are not useful.
Big nodes increase the likelihood of wasting resources and thus also decreases efficiency.
Because of both issues, there is no perfect node size for every workload and you have to look at what works best for your workloads. Most companies end up in a mixed node scenario: you pick a standard size (e.g. 4 vCPU and 16Gb RAM) suitable for most of your workloads but also add a couple of Nodepools with bigger nodes for those workloads that need it.
Use the available resources efficiently
The previous section covered the theoretical optimal configuration for your nodes. Once the appropriate NodePools are in place, the next step is to measure how efficiently they handle real workloads. In our case, we found that our standard NodePools achieved only 40–50% efficiency, wasting almost half of the node resources. Further investigation revealed that the default Kubernetes scheduler was the main cause, as it isn’t well-suited for batch processing workloads.
By default the Kubernetes scheduler prefers to schedules pods on nodes that have the most capacity available and thus spreads out workloads as much as possible across nodes. This is problematic when you combine it with spiky workloads because it prevents the cluster from scaling down quickly. After adding 50 nodes for a big job, the scheduler will distribute new jobs evenly over the 50 nodes as illustrated in the figure below. This results in all nodes being (partially) used and thus wastes a lot of resources.
For batch workloads it is optimal to schedule jobs on one node until the node’s capacity is reached and then move to the next node. Kubernetes supports this by chaning the scoring strategy from LeastAllocated to MostAllocated. Unfortunately, many cloud providers (e.g. AWS, Azure) do not support changing the scoring strategy. The solution is to run a second Kubernetes scheduler which runs on the data-plane next to the default scheduler. You can specify which scheduler to use for a given workload by specifying it’s name in the scheduler property of your Job, Pod, Deployment object.

After migrating to the new scheduler, we increased the efficiency with 15–20% as shown in the following image. This corresponds to a 15–20% reduction in the number of nodes running for those nodepools.

Illustration of the increase in node efficiency after turning on Binpacking.
Strategies for scaling spiky workloads
Batch workloads naturally fluctuate throughout the day. At many of our customers, we observe that production workloads primarily run at night, while daytime usage is limited to developers testing their work. This leads to a 10x difference between the cluster’s peak and minimum utilization, which is visualized in the graph below.

In order to handle these fluctuations, you need a component that can increases/decreases the nodes of your cluster. The two most popular components for doing this are:
Cluster autoscaler: which has existed for years and is supported on almost all big cloud providers. Its main downside is that scaling up and down is slow.
Karpenter: which was released 2 years ago by AWS and drastically increases the speed of scaling up and down. At the moment it is only production ready on AWS but they are making progress on an Azure implementation.
We use the cluster autoscaler on Azure clusters and configure it as follows:
We want to scale up and down quickly, which is why we lower the scale-down-unneeded-time to 5 minutes as well as set the scale-down-delay-after-… to 0 minutes.
We configure StatusTaints for the Azure scheduled events (e.g. freeze, preempt,…) such that scaling up a Nodepool from 0 nodes works.
We manage our own cluster-autoscaler as Azure does not support specifying StartUpTaints. We need them as we use Cilium as our CNI and Cilium taints a node until it is ready.
For AWS clusters we run Karpenter and use the following configuration:
Because we do not want to interrupt running workloads as this will fail the job or require an expensive recalculation we only consolidate nodes when they are empty.
We set the consolidation to 5 minutes in order to scale down quickly.
Our Nodepools use a custom AMI in order to speed up running spinning up new nodes. These AMIs package custom components required to run our workloads.
While these settings work well in our setup, it’s important to measure their effect on your workloads before adopting them in production.
Creating reliable workloads on unreliable instances
A third factor to consider for batch workloads is the use of spot instances. These instances offer cost savings of up to 70% compared to on-demand nodes, but they can be reclaimed by the cloud provider at any time. Given the large number of batch jobs and their typically short running times, spot instances present a practical and cost effective solution. The only catch is that you need to be able to deal with unreliable instances.
We observe that our customers rely on spot instances in over 97% of cases. I believe this high adoption rate is largely due to the improvements we’ve in handling spot interruptions. This results in only 0.01–0.2% of jobs being affected by spot terminations. To achieve this by following a three step approach:
Reduce the likelihood of spot interruptions
The first step is to minimize the number of spot interruptions. From our experience, there are three characteristics of your job/environment that influence this:
The job duration: short jobs have a lower likelihood for being interrupted compared to long jobs. If your job takes less than an hour, we see a low chance of it being interrupted.
The resource requirements: the higher the resource requirements for a job and thus also the node it runs on, the more likely it gets interrupted. In this way it is better to have a job requesting 4 vCPU and 16Gb RAM compared to 32 vCPU and 128Gb RAM
Region popularity: popular regions have more spot capacity and thus have a lower likelihood of interruptions. In a more niche region we see 10–15 times more spot interruptions compared to a popular region.
Additionally, AWS provides the Spot Placement Score API, which estimates the likelihood of spot interruptions for a given resource request. In our experience, if the score is below 6, interruptions occur frequently. We use this API to reduce the likelihood of spot interruptions in our Spark jobs. Each Spark job on Conveyor is scheduled in one of the three possible availability zones (AZ) of the Kubernetes cluster. To select the most suitable zone, we query the Spot Placement Score API and assign the driver and all executors for that job to the selected AZ.
Minimize the impact of an interruption
The second step in handling spot interruptions is to minimize their impact when they occur. One way we achieve this is by running the Spark driver on on-demand and using spot instances for the executors. As long as the driver remains running, Spark can deal with executors failing as it will automatically recreate them. Additionally, since the driver typically requires significantly fewer resources than the executors this setup allows us to retain most of the cost benefits for using spot instances.
Another way to reduce the impact of spot interruptions on Spark jobs is by enabling the Spark decommissioning feature. This feature transfers the shuffle state from an executor that’s about to be terminated to another active executor, avoiding the need to recompute intermediate results.
Make spot interruptions visible to users
When both previous approaches fail, your last resort is to make spot interruptions visible to your users. This allows them to (automatically) retry these jobs as the job will most likely succeed next time round.
This can be implemented using the AWS or Azure node termination handler. This component runs on each node and applies a taint when a spot interruption notification is received from the cloud provider. These notifications are typically sent 1–2 minutes before the instance is terminated, allowing time for a clean shutdown. If a job fails with a SIGKILL, we check whether the node was tainted, and if so, classify the job failure as caused by a spot interruption.
Conclusion
In this blogpost, I highlighted 3 important aspects when using Kubernetes for batch workloads, namely:
Optimize your node efficiency in order to minimize your compute costs
Quickly scale your cluster up and down in order to increase performance and reduce the compute costs. This can be achieved by tweaking the Karpenter and the Cluster autoscaler settings.
Spot instances can be a great way to reduce the costs of your workloads if you know how to deal with spot interruptions. I showed 3 ways to do this: lowering the chances of them occurring, limiting their impact and making them visible to users.
Try these improvements on your own workloads and let me know how it goes. Clap if you found this post insightful and leave a comment if you think of other improvements.
Latest
Cloud Independence: Testing a European Cloud Provider Against the Giants
Can a European cloud provider like Ionos replace AWS or Azure? We test it—and find surprising advantages in cost, control, and independence.
Vermeide schlechte Daten von Anfang an
Das Erfassen aller Daten ohne Qualitätsprüfungen führt zu wiederkehrenden Problemen. Priorisieren Sie die Datenqualität von Anfang an, um nachgelagerte Probleme zu vermeiden.
Ein 5-Schritte-Ansatz zur Verbesserung der Datenplattform-Erfahrung
Verbessern Sie die UX der Datenplattform mit einem 5-Schritte-Prozess: Feedback sammeln, Benutzerreisen kartieren, Reibung reduzieren und kontinuierlich durch Iteration verbessern.