Stop loading bad quality data
May 19, 2025
•
Kris Peeters
Ingesting all data without quality checks leads to recurring issues. Prioritize data quality upfront to prevent downstream problems.
Rule number one of having good quality data: Stop loading bad quality data. Really, it’s that simple. I see so many companies make this mistake, and then wonder why their data quality is so poor.
The Fast Food of Data: Capture All the Data!
“Ingest first, ask questions later.” It’s basically the slogan of Big Data from a decade ago. In a way, it’s even worse than fast food, because with fast food, at least you know how many calories you’re ingesting — even if you choose to ignore it. With this strategy, you simply don’t know what’s in the data. That’s the whole idea. Ingest first, ask questions later.

Ingest all the data, push it to end users. Happy days.
It makes a lot of sense that companies start off their data initiatives by simply ingesting all they can. Many companies have been starved for data for decades. And no, the data warehouses of the 90s and 00s don’t count. That’s the equivalent of a North Korean starvation regime where only the happy few got access to the data and they decided how it was being used by whom. So when cloud and big data technologies came around, it made a lot of sense to use them to the fullest extent possible. Just like with fast food, this approach creates instant gratification. Your first use case is live in no time.
But the negative effects come later. The root cause of most data quality issues is that source systems change without notice. Their schemas change, their underlying technologies change, the infrastructure on which they are deployed changes, their security settings change, the whole purpose of their existence change… all the time. Each outage takes a couple of hours to troubleshoot, debug and fix if you’re lucky. Even stable sources that only change once per year cause plenty of outages. It’s not unusual that data departments ingest 1,000+ tables. If each table changes once a year on average, that means you have 1,000 changes per year, or 2–3 changes every day. This results in data teams spending the majority of their time in breakfix mode.

Any change in an upstream source will fail your pipelines
Don’t Just Treat the Symptoms!
Once it becomes painfully obvious that data quality is a problem, data teams usually take the initiative to fix things. Here are some things that only combat the symptoms, but don’t tackle the root cause:
Schema evolution: Ah, the source changes? No worries, we reflect all the changes in our data lake. Or better yet, we are doing data vaulting, so we can just add more satellites to capture the changed data. This doesn’t work because, as smart as you design your data pipeline, it can never be ready for future changes. A currency column changes from Text to Decimal, the
cust
column is being renamed tocustomerid
, thecountry
field is dropped, or worse, is just left empty, or entire tables just vanish into thin air.Install (data quality) monitoring: Great that you know when things have crashed, but they can’t prevent the crash in the first place. Data quality monitoring has its place, but it doesn’t solve bad quality data. It only makes it more visible. More about that, later!
Install data governance tooling: You know what, if we identify who owns each source, what’s in that source, and we ask them to document everything, then they will become more accountable! No, this generates a ton of (digital) paperwork that nobody reads. Like data quality monitoring, data governance tooling has its place, but it doesn’t solve this problem.
Install a change advisory board: We’re going to fix the process! Everyone that wants to bring any change to production must first announce it at the feared CAB or Change Advisory Board, so we can discuss this with everyone who is dependent on your system. We go back to 4 big releases every year to keep it manageable. This doesn’t work because often the source systems don’t — or barely — know that their data is being used in analytics. It’s just an afterthought. And even in the cases that the data team is made aware on time of the change, and they get the right new test data to prepare the migration of their data pipelines, it still often goes wrong. Because the test data never resembles the actual production data. And you’re still working overtime to recover from the outages.
GenAI will solve all my problems: This should really go without saying, but I have to say it anyway, as some companies are 100% convinced this is the way: No, you can’t solve this issue with GenAI. I know data departments who basically stopped their data governance initiatives because “now you can just ask a chatbot any question about your data and it will answer it.” Oh, you will get an answer. But based on what? If your data is such a mess that an expert human can’t solve it manually, it won’t be answered by ChatGPT either.
You Need an Interface to Your Source Systems
Software engineers have this one trick that they’ve used for decades. It’s called an API. Whenever software teams want to interact with each other, whether that’s inside the company or to the outside world, you communicate to each other through those APIs. What a team does behind each of those APIs is none of your business. They might change technologies, they might change underlying infrastructure, or security settings. They might even change schemas. As long as they respect their APIs, you can consume them in your app.
But what about breaking changes? Well, they have a process for that. It’s called “making an API deprecated.” It’s still annoying as hell. But change does happen. And some of it is breaking. When you’re on a v1
of any API, they can release a v2
and give you 6 months to a year to move over to v2, when you’re ready for it. Is it perfect? No. But it works.
Does that mean we all need to start hosting APIs now in the data world? Not at all. But we can start agreeing on interfaces with source systems. That means that they prepare a bunch of tables that are a good representation of the data that is in their systems, without exposing each internal table. The source team decides on the format. They can make that schema available in their own database or they can push that data to a common database/lakehouse that the whole company uses for analytics. If they ever have breaking changes, they can publish a v2
of those tables in a different schema, while still publishing v1
in parallel. Once all consumers are migrated to v2
, you can kill the v1
interface.
If you do this well, you design your interface so it’s easy to consume the data. Operational systems typically store data in normalised format. So you need to do 300 exotic joins to get any kind of insight. You can prepare the consumption of your data by publishing 4 wide tables instead of 100s of smaller ones. It keeps your interface simple and understandable.

Clean separation of source systems and data products
Congratulations, you have just made your first Source Aligned Data Product
. We blogged about it before here: https://medium.com/datamindedbe/source-aligned-data-products-the-foundation-of-a-scalable-data-mesh-228528720bd1
Obstacles: Why This Is Not Commonplace Yet?
Virtually everyone in the data department that you talk to thinks this is a fantastic idea. And it has a clear precedence of success in the software world. So why are we not all doing it all the time? There are a few obstacles:
This causes extra work and responsibility for the teams operating the source systems. Those are different budgets, different backlogs, different departments… Why would they invest in this? They don’t really have an incentive that others use their data. And if they want to do analytics themselves, they are the ones knowing their data the best, so they have no issue just connecting to it directly and running some queries.
It’s more difficult to set up than just copying the source data: There is nothing simpler than a JDBC connection and a
SELECT *
command that runs every night. If you’re lucky, you don’t even need to talk to the engineer on the other side. You can copy all the data you want, when you want it, in the format that you like the best. You feel super productive. Some companies even have metrics about how many data sources are already ingested into the lake.These ideas are relatively new: While commonplace in software, this is relatively new in data. The concept of a data mesh was launched only a few years ago. Data product thinking is still a young concept. Data contracts mostly live in blogs and LinkedIn posts — not so much in production.
How to Start Adopting Data Product Thinking for Source Data?
First Comes Value
When you have no data use case in production yet, delivering business value, do that first. You will not get any more budgets, or convince any outside teams to do something for you, until you’ve proven significant value to the organisation. So hack the first use case together. Ingest data from source directly. Forget everything I wrote above. Create value. How to do that is a subject for another post.
Work on Business-Critical Use Cases
The hard truth is that businesses generally don’t change unless they have a real reason to change. You might be creating tons of value with a few simple Tableau dashboards that need to be updated every month. Congratulations. This is a good place to be in. This probably means you only have a few data sources that are important, and they break once in a while. No worries, you fix it next week. Once you start doing active trading based on the data that you’ve ingested, or you build a customer-facing AI bot using data from operational systems, data quality issues can start to cost a lot of money, reputation, or both. Believe me, now you have the ear of senior leaders in the organisation, which is great, because you need it for the next step.
Get Leadership Buy-In
Jeff Bezos famously wrote an email to the whole of Amazon, basically decreeing the use of APIs to talk to each other. Chances are that your leadership is less tech-savvy than Jeff Bezos. Still, now is the time to ask for their help. Because, remember the first obstacle? This will cause work for operational teams. They will need budgets to do this. And besides budgets, they need to see the point. If you’ve done step 1 and 2, it becomes easy to convince them of the value of the work you’re doing. But even then, there will be departments that are eager to jump onboard, and departments that resist every change. Don’t try to change the whole organisation at once. Make your biggest believers successful. If other business leaders see this, soon they will turn around as well. And for the few laggards that inevitably stay behind, the company leadership can simply overrule them. The order is important here. Don’t start off being the data police forcing everyone to do work for you before value is delivered. The onus is on you to convince them of the value you bring.
Install the Right Tooling to Monitor Progress and Bring Governance
If you want to give more responsibilities to the domains, you need to put some things in place. Diving into the specifics would be material for yet another blog, but here are the usual suspects:
Data quality tooling: Yes, you do need them. Because, even with the best intentions, you will still create poor quality data. You need to detect those issues early, and ideally stop the spread of poor quality downstream.
Data contracts: What APIs are for the software world, data contracts are for the data world. You need to agree on which data you make available for downstream consumption. A data contract can be a simple documentation page. But it’s better if it’s a solution that actively enforces those contracts, so you don’t accidentally break your own contract.
Data catalog: It’s also helpful if you document at least your interfaces in a data catalog, so people can find out what each dataset means and who owns it.
Data product marketplace: If you work in a decentralised way, multiple teams will create data products. Where a data warehouse is similar to centralised communist production, data products are the equivalent of a free market. You want teams to be able to publish their data products, document its purpose, its contents, and its owners. You want other teams to be able to consume data products in a governed way. You want to see a high-level lineage of how your data is being used in the organisation. Check out our very own Data Product Marketplace here: https://github.com/conveyordata/data-product-portal. It’s open source and it’s being used by companies who are already mature in their data product journey.
Do the Legwork for the Federated Teams
Don’t forget they need help — even if they have the best intentions. One approach that has worked for us is that the data department takes on this responsibility for the most important data sources. That means the central data team creates the source-aligned data products. In those cases, it’s important that someone from the source domain becomes the Data Product Owner. Even if it means nothing more than a weekly meeting with the team, it’s critical that the business ownership of the data that is published is with someone who understands the data. And surprisingly for some, that is almost never the data team. We usually have no clue what the data means that we’re ingesting. And we shouldn’t have to. We can’t fit the data of a dozen different departments in our heads.
Gradually Push More Responsibilities to the Departments
You can’t truly scale data adoption in an organisation without each department taking responsibility for their own data. I’ve met teams that tried to do everything centrally and they run into inevitable bottlenecks. For one case I met a genius data analyst — let’s call him Frank. Frank is genuinely one of the smartest people I ever met. He worked in the data department of the company for over a decade. He knew the data of the data warehouse inside and out. And he was diligent in documenting his findings, which resulted in literally 100s of pages of data descriptions. In meetings, he could answer questions in 5 minutes which a junior analyst would need 3 months to figure out. “Ah, you want this insight? Then you need to join Table A from database X with Tables B and C from the mainframe. Careful to ignore the custid
column. I know it’s still in there, but since the change to a new system 4 years ago, that field can have inaccuracies if you try to do what you want to do. It’s better to get the custid
from database Y. But my memory is a bit rusty; I would have to check the documentation again to see exactly how to do that.” “Wow thanks Frank, we would never have figured this out without you. Not in a million years.” The thing is, I know only a few organisations who have a Frank. And even the ones who do, Frank needs holidays as well. And Frank might be getting close to retirement. Also, as genius as Frank is, he can’t keep having all the company data in his head. Frank is usually the first to admit this.
So sooner or later, responsibilities should move to the departments. For power-hungry data leaders, this might be a bitter pill to swallow. Because you have less direct control. You need to ask yourself the question: Am I here to implement all the data use cases? And the answer to that question is No. Your company also doesn’t have an email department that sends out all the emails. You are there to enable the organisation to use data to the fullest. The enabling work should be done centrally, in the data team. The more you can push the individual use cases to the departments, the more impact you can create. Think of it this way: Instead of a central data team of 30 people, you now have a distributed data team of 300 people. And the best part of it is, while they all work to create value from data, only 10% of them are doing so on your budget

Domain ownership of data products
Conclusion
Solving data quality issues can’t be done with tooling alone. It needs organisational change. Here’s a path that has been successful for us:
First comes value
Work on business-critical use cases
Get leadership buy-in
Install the right tooling to monitor progress and bring governance
Do the legwork for the federated teams
Gradually push more responsibilities to the departments
Latest
Stop loading bad quality data
Ingesting all data without quality checks leads to recurring issues. Prioritize data quality upfront to prevent downstream problems.
A 5-step approach to improve data platform experience
Boost data platform UX with a 5-step process:gather feedback, map user journeys, reduce friction, and continuously improve through iteration
From Good AI to Good Data Engineering. Or how Responsible AI interplays with High Data Quality
Responsible AI depends on high-quality data engineering to ensure ethical, fair, and transparent AI systems.