Monitoring thousands of Spark applications without losing your cool

12.12.2024

•

Niels Claeys

Monitor Spark apps at scale with CPU efficiency to cut costs. Use Dataflint for insights and track potential monthly savings.

Apache Spark continues to be the go-to framework for big data processing. As Spark usage grows within your organization, so does the diversity of people writing Spark code. Alongside seasoned engineers, you now have junior engineers, data analysts, and data scientists working with Spark. Despite significant advancements over the past decade to make Spark more accessible (e.g., Spark SQL), mastering it remains a challenge. A key skill is the ability to effectively diagnose why certain applications aren’t performing as expected.

In this blog post, I will discuss the available tools for analyzing individual Spark applications. Additionally, I will introduce custom tooling that we’ve developed to monitor and improve a large number of Spark applications. Improving these applications is mainly driven from a cost perspective.

Midjourney generated image for an engineer tackling Spark.

One Spark at a Time: Investigating a single application

When building Spark applications, you will encounter various challenges, such as:

Spark executors failing due to out-of-memory errors.
A Spark job taking twice as long after adding an extra join.
A Spark task running for hours without making meaningful progress.

Beyond these obvious issues, there are also subtler problems where a Spark job completes successfully but uses resources inefficiently. These inefficiencies can lead to increased costs, making it important to minimize them for a more performant execution platform.

When facing an issue, the first step is to ensure your job is running on at least Spark 3.2.0. In this version, Adaptive Query Execution (AQE) is enabled by default. AQE addresses many common issues in Spark code by dynamically changing the execution plan in order to repartition your data after joining or to convert a standard join into a broadcast join,...

Spark history server

If you’re already using Spark 3.2.0 or higher, the next step is to analyze your Spark application through the Spark History UI. This web interface provides detailed insights into your application, including information about executors, the execution plan (e.g., how your code was broken down into jobs, stages, and tasks), and the memory, disk, and CPU usage for each task.

The overview of a Spark application using the Spark history server

While the Spark History Server offers all the existing information of your job, the challenge lies in pinpointing the exact issue — the proverbial needle in a haystack. The tool requires you look for clues to identify what’s going wrong, rather than offering direct hints.

When your job fails, a good starting point is examining the failing tasks. If no obvious failures are present, identifying issues relies heavily on the user’s experience to differentiate between normal and abnormal behavior.

Dataflint

Dataflint addresses these challenges. It’s a library that integrates with the Spark History Server, adding a new tab to the Spark UI. Dataflint offers a more user-friendly visualization while also providing alerts for common issues in Spark applications, such as wasted CPU, excessive small files, and partition skew.

The overview of a Spark application using Dataflint

These alerts are actionable, offering specific insights into what might be going wrong in your job. Over the past year, Dataflint has included more alerts (e.g. related to table formats), making it increasingly valuable. Furthermore, as the library is open source also allows you to add custom alerts tailored to your needs.

Alerts identified by Dataflint for a Spark application

However, when managing a large number of Spark applications, it’s impractical to manually review each one for potential issues. A common approach is to focus on analyzing the largest and most expensive jobs, as optimizing these is likely to yield the biggest cost savings.

While this strategy is a good starting point, in practice, it often proves suboptimal because:

We’ve observed that up to 10% of Spark jobs have straightforward configuration issues (e.g. requesting too many resources).
The heaviest jobs are often created by the most experienced engineers, which is why they are already quite efficient.

As a consequence, we wanted to analyze all Spark jobs automatically and detect whether they are running correctly.

Keeping the Fire Under Control: Bulk analysis made easy

Our goal is to flag jobs that are running suboptimally and require further analysis. We focus on efficient resource usage rather than optimizing for performance (wall-clock time). For example, if you aim to improve performance by adding more executors, it may reduce the job’s overall execution time but likely will also lead to higher costs.

As a Spark execution platform, our primary focus is on optimizing resource usage, since performance issues are typically addressed by the use-case teams (e.g., if a job fails to meet the business SLA). We run Spark on Kubernetes, spinning up resources such as drivers and executors for each application individually. By reducing the resources used by our applications, we can directly lower the overall cost of running Spark for our customers.

Keeping the Fire Under Control: Bulk analysis made easy

After analyzing thousands of Spark jobs and reviewing best practices from others, we decided to calculate CPU efficiency for each job as our key indicator for “efficient” resource utilization.

CPU Efficiency = time tasks are executed on the executor /                  total time the executors were running

This produces a metric ranging from 0 to 1, where 0 indicates the executors did no work, and 1 means they were consistently performing useful tasks. We then categorized the jobs into three groups:

0%-35% is bad
35%-70% is mediocre
> 70% is a good score

Spark efficiency is a good indicator since most issues in Spark jobs result in wasted CPU cycles. When a job is flagged as inefficient, you can use the Spark History UI or the Dataflint UI to investigate the root cause. In addition to displaying the efficiency of the most recent job run, we also track the evolution of the “efficiency” over the past 30 days. This helps identify which changes have had a positive or negative impact on efficiency.This approach is effective for identifying “obvious” configuration issues (e.g., when someone copies and pastes code from a previous job and forgets to update its settings). However, it often wasn’t sufficient to motivate people to address more complex issues.

Sparking business enthusiasm

Ideally, we could automatically fix Spark issues, but that seemed unlikely. The next best approach was to estimate how long it would take to resolve the issue and how much money could be saved. Since doing both seemed challenging, which is why we focused on estimating the potential cost savings. With this, you can time-box the investigation of your application.We calculate the average monthly cost of a Spark application by multiplying the number of runs by the cost of the resources used. In practice, this calculation is a bit more complex, as we only consider jobs using the latest version of a given application. This way we immediately show the impact of any improvements made in the latest recent version.

Illustration of Conveyor Data indicating monthly cost and efficiency of Spark applications

By combining the monthly cost of an application with its average efficiency, we can estimate potential cost savings. For example, if a job costs €1000 per month and has an efficiency of 25%, improving the efficiency to 75% (a realistic target for most jobs) could save around €500 per month. While this isn’t perfectly accurate, it has proven to be a good rule of thumb.

Conclusion

In this blog post, I explored various approaches for monitor Spark applications. We began by discussing how to monitor individual applications using the Spark History UI and Dataflint. Personally, I prefer Dataflint because it offers actionable insights and is easier to understand.

Next, we looked at how to automate this analysis when scaling to 1000+ Spark applications using CPU efficiency as our key indicator. The main driver for analyzing suboptimal applications is to reduce costs. To help convince the business to allocate time for improving a Spark job, it’s also important to provide an estimate of the potential monthly cost savings.

—

Update! On March 20th, 2025 I will be speaking about optimizing batch processing on Kubernetes at a free online event, you can register here.