Everyone to the data dance floor: a story of trust

23.01.2024

Wannes Rosiers

Who am I to argue? In fact, I’ve had the privilege of guiding some companies towards this very ambition.

However, when this narrative is complemented with the claim of “making data available for everyone”, I dare to challenge. What often gets made available, are pre-assembled insights, marking the era of insights democratization, but not quite the one of data democratization.

No worries: the era of data democratization will be arriving soon. Recent developments are all about empowering individuals to interact with data. Efforts are ongoing to lower the technical barrier, from the rise of self-service BI tools, to the shift towards SQL with DBT, and even the advent of no-code/low-code ETL tools. The latest hype around AI/LLM might just allow to eliminate most technical skill requirements. However, as data becomes truly available to everyone, a new challenge arises: how to properly govern this widespread data interactions effectively.

What we need to govern

I believe it is fairly clear why the broader access to raw data needs to be governed. I’m not talking about raw data as the foundational layer of your analytical data landscape. When I say raw data, I mean the asset previously utilized to create those pre-packaged insights. Even when that asset is actually a data product. But let’s pause for a moment before diving into how to govern and discuss what actually needs governing.

Governance requirements, in this context, include ensuring that data is utilized for a rightful purpose by someone entitled to fulfill that purpose. Legal and compliance considerations aside, also the ethical aspect is crucial to be taken into account by companies. To make this happen, data must be not only made available but also discoverable. And when people use it, you need to reassure the different users that it remains trustworthy.

For those familiar with a typical data language: you need to enable purpose-based access control, you need to establish a data catalog and you require to embrace data observability. Today I want to elaborate upon trustworthy data and its relationship with data observability.

When does data earn trust?

The quest for a definition will lead us towards the concept of reliable data. Data should be complete, consistent, and accurate, reflecting the actual state without gaps and remain the same over time. But how do we know and measure this? And is reliable data synonymous with trustworthy data? Trustworthy data builds confidence in its users, encouraging them to rely on it for extracting insights and decision making. Trust and confidence are soft-skills and fortunately so. There are other ways to build a trust-relationship than measuring its accuracy, which is close to impossible.

  • Know the publisher: When you investigate something, you check your sources. An article with an author is more trustworthy; a headshot adds a human touch and you might even be inclined to trust the article more. (Note: this psychological behavior might change with the latest generative AI possibilities.)

  • Trust the publisher: Even if you know the author, you want assurance that they are knowledgeable on the subject. Most football coaches are ex-players for a reason: it’s easier to trust them because they’ve been there, they know how it works.

  • Trust the delivery guy: Ever received an envelop sealed with adhesive tape? I guess your first thought was: “Why was my mail opened?” If you don’t trust the delivery process, you won’t trust the content neither: it could be tampered with by someone you don’t know!



This is not so different from the quest for complete, consistent and accurate data. You want to know the data owner, be convinced that they can create an accurate data product reflecting the business process they know well. The business expert here is akin to the ex-professional football player. And you need assurance about the delivery process: your publisher sends the complete dataset repeatedly, but are you sure you receive everything?

I’ve written multiple articles on federated data ownership, mostly in the context of data mesh. This aligns with knowing and trusting the publisher. But how do you establish trust in the delivery guy?

The data delivery dance

In the setting of data, it boils down to trusting that your business expert created their data product themselves — hello again data democratization — and that it remains unaltered before your interaction. This process must be repeatable.

Rather than crafting the most intelligent data quality measures, one can partially cover this by monitoring access and changes: when was data last updated and by whom. In the context of repeatable processes for data products, that ‘whom’ is likely a data pipeline. So, this translates to: which version are we on, and who has upgraded to this version? It’s quite similar to verifying if your business expert created the data product himself.

Don’t get me wrong; I’m implicitly I am referring to a latency check as a data quality measure, overlooking completeness checks and many others. There is definitely value in those, yet as a starting point, I would prefer to monitor changes in the process. Adherence to the process is far easier to define than pinpointing stellar data quality measures.

The key things to monitor initially are hence:

  • When did the pipeline run?

  • Which version of the pipeline did it run?

  • Who built that version?

Let’s walk through it from top to bottom. Every orchestration tool includes monitoring when a pipeline has run, in its core functionalities. Think of Airflow, Dagster, Prefect — they all offer an overview of pipeline runs. Versioning however is not yet a core functionality. What to think of this response from the Airflow Core Committers team for example? It shows that the team deems it valuable within an orchestrator. Next to this, orchestrators lack an overarching concept of DAGs, you are either confined to getting the entire pipeline in a single DAG, or miss a full version overview.

At the heart of trusting the data delivery process hence lies the ability of business experts to be proficient to build their own data products and a more heavy focus on data pipeline observability rather than data observability.

Wrapping it up

The democratization of insights has laid the foundation, but now true data democratization is knocking on the doors. As the gates data slowly open, a critical concern emerges — how do we govern this newfound data democracy effectively? The answer lies not just in enabling access but in fostering trust. Trust in both the origin as the delivery process. It’s a dance with data, where knowing the dance partners — publishers and pipelines — becomes as crucial as the dance itself.


Rather than focussing on data observability, a good starting point might be to introduce pipeline observability. Let’s keep an eye on the core principles: when did the pipeline run, which version marked its stride, and who crafted that version? This will allow you to close your eyes and be guided by your dance partners. We are not completely there, but soon those lacking confidence on the dance floor, will be completing the most beautiful data tango.

Latest

Why not to build your own data platform

A round-table discussion summary on imec’s approach to their data platform

Securely use Snowflake from VS Code in the browser
Securely use Snowflake from VS Code in the browser
Securely use Snowflake from VS Code in the browser

Securely use Snowflake from VS Code in the browser

A primary activity among our users involves utilizing dbt within the IDE environment.

The benefits of a data platform team
The benefits of a data platform team
The benefits of a data platform team

The benefits of a data platform team

For years, organizations have been building and using data platforms to get value out of data.

Hinterlasse deine E-Mail-Adresse, um den Dataminded-Newsletter zu abonnieren.

Hinterlasse deine E-Mail-Adresse, um den Dataminded-Newsletter zu abonnieren.

Hinterlasse deine E-Mail-Adresse, um den Dataminded-Newsletter zu abonnieren.

Belgien

Vismarkt 17, 3000 Leuven - HQ
Borsbeeksebrug 34, 2600 Antwerpen


USt-IdNr. DE.0667.976.246

Deutschland

Spaces Kennedydamm,
Kaiserswerther Strasse 135, 40474 Düsseldorf, Deutschland


© 2025 Dataminded. Alle Rechte vorbehalten.


Vismarkt 17, 3000 Leuven - HQ
Borsbeeksebrug 34, 2600 Antwerpen

USt-IdNr. DE.0667.976.246

Deutschland

Spaces Kennedydamm, Kaiserswerther Strasse 135, 40474 Düsseldorf, Deutschland

© 2025 Dataminded. Alle Rechte vorbehalten.


Vismarkt 17, 3000 Leuven - HQ
Borsbeeksebrug 34, 2600 Antwerpen

USt-IdNr. DE.0667.976.246

Deutschland

Spaces Kennedydamm, Kaiserswerther Strasse 135, 40474 Düsseldorf, Deutschland

© 2025 Dataminded. Alle Rechte vorbehalten.