When writing SQL isn't enough: debugging PostgreSQL in production
Jul 22, 2025
•
Niels Claeys
SQL alone won’t fix broken data. Debugging pipelines requires context, lineage, and collaborationnot just queries.
At Conveyor, PostgreSQL has been the crucial backbone for persisting data. We chose PostgreSQL because of it’s reliability, rich feature set and high performance. While writing efficient SQL queries is essential, it is not enough to operate a database at scale.
As your data grows, query patterns shift and edge cases emerge. To solve them you need to understand the database internals like: indexes, query planning and how data is represented on disk. I will illustrate some of the PostgreSQL internals using three production issues. Whether you are responsible for keeping a SQL database running at scale or simply curious about the kind of problems that can arise, this blogpost is for you.
For context, our large tables contain over 5 million records and are +- 10Gb in size with multiple updates every second. At this point you need to be careful how to express your queries in order to preserve a responsive user interface.

A human migration error
When making changes to one of our large tables, we must be careful that a schema migration only changes the table schema (DDL) and not updates every record (DML). In the beginning of this year we rolled out what looked like an innocent change. Instead of only showing a execution_timestamp
we wanted to show the execution interval with a start an end date. Our migration script looked as follows:
The root cause
While rolling out this change, we noticed that our new API did not come up as it was stuck on the migration. When troubleshooting the issue we noticed that our application_run
table had doubled in size and was locked and didn’t handle any new changes.
We quickly figured out that the migration was problematic due to the update statements as they require updates on every row of our table and it is applied in one transaction. Because this table contained 5-10 million of rows, this operation took a long time.
Mitigating the impact
We cancelled rolling out the latest version of our API. Because we use a blue-green strategy when rolling out new versions of our API, the old version of our API was still running. Removing the latest release ensures that the lock on the application_run
table will be released and the old API could continue serving requests.
The fix
We remove the update statements as well as the not null
constraints from the migration. Both changes could be handled in code without issues. After making these changes we rolled out a hotfix release.
Updating/deleting rows is slow
This case started with a routine check of our slow query logs. The person on support, periodically checks these logs to see whether they can be optimized. We noticed that operations on our nodes
table were taking several seconds where they used to take 50-100ms.
The root cause
After a quick analysis, we identified that the create and update operations were slow because there was no index using the tenant_id
field. As both operations always use the tenant_id
, a simple index will ensure that these operations will be go a lot quicker.
As we are changing a large table, we need to create the index without locking the full table for a long time. Luckily PostgreSQL allows you to create an index concurrently using the following syntax:
💡 Creating an index concurrently is only allowed outside a transaction.
Bulk delete operations take a long time
Another problematic query was being used by our cleanup job. The query removes all nodes that are older than a year, which looks as follows:
As a first step, we again added an index using the created timestamp. Unfortunately, adding the index did not improve the bulk delete operation as the index was not used. This can be validated by running explain
for a SQL statement. The result looked something like:
The root cause
Apparently unbounded queries do not use an index but a bounded query will. To test this out we converted the query as follows:
Running the new query on our table reduced the query time from 25 seconds to 500ms.
EBS byte balance drops to zero
This metric is specific to RDS and means that you are reading and writing more than the instance quota of your database allows. We were encountering this drop once a day around 1 o’clock at night, which is around the same time that we see the highest load on our database. As a result the database operations during this hour go slower.
The root cause
When looking at the RDS metrics as well as the performance insights, we noticed that it was caused by the autovacuum process in PostgreSQL.

The autovacuum process is responsible for freeing up dead tuples
, which are old versions of updated rows or rows to be deleted. This is PostgreSQL specific process and is a consequence of PostgreSQL implementing updates as a delete followed by a create. Since our two biggest tables incur frequent updates for every row, the autovacuum process has to remove a lot of dead tuples
.
Note: If you do not have RDS performance insights enabled, you can use the following query to find out when the vacuum process ran on your table:
The fix
After understanding the vacuum process, we believed it would be beneficial to run it more frequently (e.g. once every hour). We implemented this as a simple cron job that runs:
The analyze statement inspects the table and collects statistics on the distribution of values for every column. The vacuum process gives us the following benefits:
Spread out the vacuum process over the day instead of only during peak hours.
Reduce the disk space and I/O operations required for the vacuum process
Increase the query performance of our table as it will be smaller
Conclusion
In this post, I walked through three real-world PostgreSQL issues we encountered over the past year. These incidents reinforced a few key lessons:
Robust monitoring is essential: it’s the difference between early detection and user-facing impact.
Deep database knowledge matters: understanding PostgreSQL’s internals is critical for diagnosing and resolving (performance) issues effectively.
PostgreSQL remains my favorite database: it is reliable, easy to manage and can easily scale to tables with 10+ million of rows.
Building a reliable product isn’t just about writing good code, it’s about putting the right guardrails, tools and processes in place to support it in production. The more experience I gain with PostgreSQL, the more confident I am that it’s the right tool for the job.
Latest
When writing SQL isn't enough: debugging PostgreSQL in production
SQL alone won’t fix broken data. Debugging pipelines requires context, lineage, and collaborationnot just queries.
Portable by design: Rethinking data platforms in the age of digital sovereignty
Build a portable, EU-compliant data platform and avoid vendor lock-in—discover our cloud-neutral stack in this deep-dive blog.
Cloud Independence: Testing a European Cloud Provider Against the Giants
Can a European cloud provider like Ionos replace AWS or Azure? We test it—and find surprising advantages in cost, control, and independence.