Are your AKS logging costs too high? Here’s how to reduce them
10.03.2025
•
Niels Claeys
At Conveyor, we’ve used Azure Log Analytics for 3+ years to store logs from our Kubernetes workloads, both batch and long-running apps.
One of the challenges we notice at customers is the high cost of Analytics tables in Azure. In some cases, logging costs account for up to 20% of their total cloud bill.
In this blog post, we’ll walk through the improvements we’ve implemented to optimize log storage costs.

Reduce the amount of logs
Begin by identifying which namespaces and applications produce the highest volume of logs. Since the logging cost on Azure is primarily dominated by their storage cost, the best way to reduce costs is to log less.
You can quickly analyze which application logs the most using the summarize function in Azure Log Analytics as follows:
At Conveyor, we manage the Kubernetes infrastructure for our customers and are thus in control of the kube-system namespace. Initially, this namespace was responsible for +- 50% of all logs at our customers. There were 2 category of applications that accounted for most of the logs.

The first category consists of our components that we install on the cluster. This can be custom developed applications as well as third party components that are in our control (e.g. cilium-agent, cluster-autoscaler…). Reducing the logging for these components is trivial, as it only requires changing an environment variable.
The second category were daemonsets installed by Microsoft as part of the AKS cluster (e.g. microsoft-defender, cni,…). We found these components too verbose and also not relevant for troubleshooting these components. As a results, we chose to not push these logs to Azure Log Analytics. This way we did not need to pay for them but we could still see the logs when connecting to our Kubernetes cluster. We only did this after the AKS cluster was running stable for more than a month in production.
Filtering out these components can be done using a grep filter in Fluentbit as follows:
In addition to optimizing our own components, we also guided our customers on logging best practices:
use a logging framework instead of plain print statements
change the log level depending on your environment
only log information relevant for debugging
With these simple changes we were able to reduce the logging cost of our customers by 30%.
The Azure logs ingestion API
Reducing how much you log should be your starting point for getting your logging cost under control. However, we observed that as the amount of jobs increased over time, our customers were again spending 20% of their total cloud bill on logging. In these situations the customer pays as much for virtual machines as for their logs, which seems crazy to me.
Last year Azure introduced the logs ingestion API, which will replace the Data Collector API for processing logs and metrics. The new API requires you to explicitly define the schema for ingested logs, a significant improvement over the previous dynamic schema inference. In our log custom table, we have several unnecessary columns simply because a handful of logs included extra fields.
A second benefit of the new API is that it supports Basic tables, which are about five times cheaper to store compared to Analytics tables. For the full details look here. The migration path towards the new API is well described in the docs. Unfortunately for us, there are several resources that are not yet supported in Terraform:
Support defining a schema when creating custom tables. The status is tracked in the following Github issue.
Creating a DCR (data collection rule) for a custom log table.
We can workaround these issues by using the azurerm_resource_group_template_deployment in Terraform until the issues are fixed.
How to work with the new API
In order to use the new log ingestion API, we need to replace the Fluentbit output plugin. This means switching from the azure output plugin towards the azurelogsingestion plugin. Unfortunately, the new plugin only supports static credentials for applications registered in Microsoft Entra Id.
This is a problematic because:
Relying on a static client_id and client_secret pair instead of using temporary credentials is a bad practice. Furthermore, it is also unnecessary since Azure workload identity was introduced 3 years ago.
In order to register applications in Microsoft Entra Id, you need to permissions to the Entra Id tenant, which we (rightfully) do not have at our customers.
Because of these issues, we cannot use the azurelogsingestion output plugin. If your situation or concerns differ from ours, you can use the azurelogsingestion output plugin. For us the following options remained:
Use Fluentd instead of Fluentbit as they have an output plugin that does work with Azure Workload Identity. We did not go for this approach as we migrated away from Fluentd a couple of years back because it requires more resources to run and is less stable.
Contribute to Fluentbit and implement the missing functionality. I would love to do this if I was proficient in writing C.
Write a Fluentbit output plugin in Golang. Since all our code is in Go, I chose for this option.
Writing a custom Fluentbit output plugin in Go
Fluentbit supports writing output plugins in Go and expose them as C shared libraries. The documentation explains quite well how to write your own plugin. The main Gotcha is that your output plugin should be in a package main and it should also contain an empty function main.
Writing the output plugin is not hard, the main difficulty is in understanding how the fluentbit data looks like that you receive. The data types you receive are always of type interface, which is not very helpful in the begging. After a while I noticed that some properties came in as byte arrays and others as strings. The interaction with the new Azure ingestion API is also straightforward. All in all the output plugin only contains +- 300 lines of code and is available here.
The plugin uses Azure Workload Identity to retrieve temporary credentials to access the log ingestion API. In the code we use the DefaultAzureCredential chain which will select the right mechanism to retrieving temporary credentials. In order to let Fluentbit push logs to the new API, we need to assign it the Monitoring Metrics Publisher role with as scope the data collection resource (DCR). To configure workload identity for our container, we need to specify the following environment variables:
AZURE_CLIENT_ID: the client id of the Fluentbit user managed identity
AZURE_AUTHORITY_HOST: https://login.microsoftonline.com/
AZURE_TENANT_ID: the id of the Entra Id tenant of your organisation
To generate a binary file (.so) of the output plugin exposing the Go functions as a C-style API, you need to compile your Go code using buildmode=c-shared. The resulting binary is quite large as it packages the Go runtime and dependent libraries as well.
Deploying our custom output plugin
Before testing the output plugin, we need to set up the necessary infrastructure. Since Terraform doesn’t yet support all the required resources, I created a not so simple Bash script to handle the setup.
Next to specifying the necessary environment variables in order to authenticate with Azure, there are two additional configuration files required:
plugins.conf: This file specifies the path to the binary file containing our output plugin
output section in fluent-bit.conf: here you can configure the additional settings of our output plugin:
Installation instructions for our custom Fluentbit plugin on Kubernetes can be found on my Github. The custom docker image of Fluentbit is available on Dockerhub. The plugin is still a work in progress, and we’re actively testing it, so we can’t provide definitive cost reduction figures yet.
Conclusion
In this blogpost we explored strategies for reducing your logging costs on Azure. The first step is ensuring that your applications only log what’s necessary. As a starting point you can analyze which applications produce the most logs and see whether this can be optimized.
Secondly, we discussed Azure’s new logs ingestion API and how it can improve both workflow efficiency and cost reduction. Lastly, we explained how we implemented a custom Fluentbit output plugin in Golang in order to work the new API.
—
I will be covering key strategies to tweak Kubernetes for running short lived applications during our upcoming live webinar on March 20th 2025, register here if you’d like to join.
Latest
Cloud Independence: Testing a European Cloud Provider Against the Giants
Can a European cloud provider like Ionos replace AWS or Azure? We test it—and find surprising advantages in cost, control, and independence.
Vermeide schlechte Daten von Anfang an
Das Erfassen aller Daten ohne Qualitätsprüfungen führt zu wiederkehrenden Problemen. Priorisieren Sie die Datenqualität von Anfang an, um nachgelagerte Probleme zu vermeiden.
Ein 5-Schritte-Ansatz zur Verbesserung der Datenplattform-Erfahrung
Verbessern Sie die UX der Datenplattform mit einem 5-Schritte-Prozess: Feedback sammeln, Benutzerreisen kartieren, Reibung reduzieren und kontinuierlich durch Iteration verbessern.