Data Modelling In A Data Product World

Dec 7, 2024

•

Kris Peeters

Central DWHs hit scaling limits. Data products offer a modular, federated solution—flexible, reusable, and closer to business reality.

Many organisations are hitting the limits of data warehousing, especially as they grow in size.They often see adopting data products as a solution. But just declaring data products is not getting them then answers they are looking for. In this blog we dive into what the problem really is, and how organisations can work on a solution.

Central data warehouses reach limits as they grow

A DWH usually starts small. It has one fact table and a few dimension tables. It serves a few reports. And life is good. However, as new reports are added, the DWH grows as well. New dimensions and facts are added. Some refactoring is needed because reality turned out to me more complex as initially assumed. And the IT landscape has changed so we need to rewrite a few pipelines. Changes to the DWH take more and more time. Up to a point that business loses its patience.

Here are a few signs your DWH is grinding to a halt:

Onboarding a new engineer takes 3–6 months
The person who originally designed the warehouse has left and nobody really understands some design decisions that were made
Nobody dares to touch existing pipelines because they are afraid to break things
Adding a new item takes months of discussions, before the development has even started
Most modeling is done AFTER the DWH, in PowerBI or Tableau, or in excel sheets on Sharepoint.

Sounds familiar? Well, then I’m sure you’ve heard the following excuses as well, which never resulted in any real change.

Superficial fixes don’t solve the underlying complexity

I’ve heard many bullshit reasons why DWH are slow. Let’s first debunk those myths:

We need to migrate to shiny new tech X because the current stack has reached its limits. This is sometimes true on the surface. But when you look behind the scenes, DWH jobs are often horrible layers upon layers of fixes and patches which do fundamentally simple things, resulting in bloated execution. This complexity comes from adding new elements and changing elements over time.
We need to rewrite it from scratch. The previous team (never me) has failed in following good DWH principles: Fact tables are the wrong granularity, dimensions are too fragmented, the data quality is poor, … Why is this a bullshit reason? Because everyone gives this as a reason. If the failure rate to design a good DWH is so high, maybe it’s not about the previous team, but it’s about the complexity of building large data warehouses.
We need to add a Data Vault: It is true that Kimball star schemas are denormalised by nature. And as your data warehouse grows, adding a source means touching many fact and dimension tables. But Data Vaulting, although it has its merits, really only adds another layer of complexity. And you need a team of hyper-specialised experts to deal with that complexity. Even if you succeed, which I am yet to see in real life, that team immediately becomes the bottleneck in growing the DWH.
It’s the fault of the business. They changed requirements: I have news for you. Business will ALWAYS change their requirements. It’s a fact of life. Engineers will write bugs. Systems will fail. Analysts will miss key insights. Business will change their mind. This complexity will be present in any system. Your solution needs to be resilient to this kind of complexity.

Superficial fixes don’t solve the underlying complexity

See the common pattern? Complexity. Where is this complexity really coming from?

The root cause is that business is more complex than the centralised nature of a DWH

In a data warehouse, it’s important to align on conformed dimensions. So if we talk about a Customer, that is the same Customer across different fact tables. That’s where much of the reuse comes from. But the business reality is often more complex than that. If you are working in an organisation of any reasonable size, you realise that the business, by its nature, is federated. Here are some real-life discussions I’ve witnessed:

A supermarket chain not agreeing on what a cash register ticket is
An airliner struggling to define what a flight is
A train company having different definitions of a train
A large education institution arguing over what a school is
An energy company having 3 definitions of Customer

Each of these discussions took months and often never saw a clear resolution. What is obviously true for one department is not necessarily true for another department. Take that train example. Sales could sell two Trains to go from A to B for two different clients. But Operations can make it one physical train. The business sold 2 trains, but they operated 1 train.

The complexity of understanding these nuances and modeling them appropriately in the DWH, taking into account 100 similar nuances, makes it so that DWH development grinds to a halt at some point. There is not a single person or team that can keep all of that complexity in their head.

So the only sustainble solution is to somehow federate the data warehouse. But it’s hard to split a DWH across multiple teams, because there is tight coupling between its different components.

Data products to the rescue

Wait, federating a data warehouse? I can do that. I’ve heard all the data mesh hype. Let’s just build a bunch of federated data products, so we remove the central bottleneck. Mission accomplished, right?

Well, we haven’t done any modeling yet. Just declaring you do data products haven’t brought us any closer to a solution.

Modeling pitfalls of data products

A classic mistake is that we don’t do any data modeling whatsoever. The so-called “source aligned data products” are just a copy of source and dashboards are built directly on top of these sources. Congrats, we are now “agile” because the business can do whatever they want however they want.

This is bad because any change in the source can break any dashboard. There is zero reusability of logic and all the work has to be done in PowerBI.

The next iteration typically sees making actual source-aligned data products, which serve as an interface to source data. On top of these data products, you can build data marts that are purpose-built for each dashboard. This is a big step forward for many companies. Because there is a separation from source to dashboard. But there is still not a lot of room for reuse. As the number of dashboards grows, the same logic has to be implemented over and over again.

As the number of dashboards grows, the same logic has to be implemented over and over again.

Let’s add in a reusable layer then!

Oops back to square one. As your data landscape evolves, pushing everything through a central data warehouse will become the bottleneck.

Build a decentralised data product landscape

The key is to create reusable data products while not having any central bottlenecks. For each use case, you can decide if you want to reuse a data product or not.

A Customer360 is a great example of a reusable data product.

Build a decentralised data product landscape

There are many nuances in how to design data products. The smart people from Thoughtworks recently published a blog on exactly this topic and it is well worth the read: https://martinfowler.com/articles/designing-data-products.html

Visualise your data product landscape

One of the main drawbacks of a growing data warehouse is that everything started to depend on everything, and it was very hard to understand which changes had impact on which data marts.

You want to avoid this same mistake in the data product world. You want to have a good overview of which data products are being created, who are the owners, what is their status, and how do they depend on each other.

In its most basic form, you can do that by keeping a good Confluence page up-to-date. Make sure you use business terms so that business people can find their way in this landscape as well.

We’ve developed an open-source tool to do exactly that. The Data Product Portal creates visibility on all data products in your landscape, regardless of technology stack used. On top of visualising your data products, it can also automate data products: Create data products following a certain template, manage access between data products, add users to data products, provision tooling like Snowflake or Databricks, …

Join us in building the Data Product Portal! Visit our product page; check out our open repo and be our next star ️on Github; join the community conversation on our Slack.

Monitor and evolve your data product landscape

Your business needs will change over time. So you have to adapt as well. Having a modular architecture allows you to do so. It’s important to keep identifying opportunities for reuse and for optimisation. Don’t build complex artifacts for a single dashboard. Abstract common logic in separate data products only when you see the need.

Over time, hopefully you can build more advanced data products as well:

You can move on from simple dashboards to AI use cases. This will require you to include more advanced technologies. But the fundamentals should remain the same. Make sure you have a common storage layer, using Open Table Formats such as Iceberg.
Once you’re ready to deliver business-critical use cases, you can start thinking about enforceable data contracts and SLAs.
If you want everyone in the business to be able to build data products, you’ll need to make your platform much more self-service.

It’s important to keep in mind that you don’t try to take all this complexity at once. You need to grow over time.

Monitor and evolve your data product landscape

Did we solve the original problems with a DWH?

Looking back at what we said in the beginning, let’s see where we are. First let’s look at the excuses:

Migrating to a new tech: This is something you can do on a case-by-case basis. No need to do a huge migration project. Of course, it’s important that you somewhat restrict the amount of technologies you support.
Rewrite from scratch: Again, this can be decided on a case-by-case basis. Some of your data products will turn out to be horrible dragons. But that should have a much smaller impact on the overall landscape
The need for data vault: If you have a particular need for that in a particular part of the organisation, nobody stops you from modeling a data vault as an internal model within a data product. This is actually a good hybrid approach, if you migrate from data warehousing to data products. However, my guess is, the need will go away together with the need to centralise everything
Business changes requirements: Sure, we can change as well. Even better yet, some data products should be owned and operated by business, and can do the changes themselves.

And what about the actual complexity: “The root cause is that business is more complex than the centralised nature of a DWH”. Did we take that complexity away? Well, yes and no. We don’t try to solve it all in once place. Instead, each domain and each data product can be explicit about what their definitions of a certain object are. It is critical that it is made transparent to the consumer. Just letting every department do whatever they want, is not the answer. This approach requires good governance and close alignment between parties.

Conclusion

Central data warehouses reach limits as they grow. Superficial fixes don’t solve the underlying complexity. The root cause is that business is more complex than the centralised nature of a DWH
Data products to the rescue: Build a decentralised data product landscape, Visualise your data product landscape, Monitor and evolve your data product landscape

Latest

When writing SQL isn't enough: debugging PostgreSQL in production

When writing SQL isn't enough: debugging PostgreSQL in production

SQL alone won’t fix broken data. Debugging pipelines requires context, lineage, and collaborationnot just queries.

Portable by design: Rethinking data platforms in the age of digital sovereignty

Portable by design: Rethinking data platforms in the age of digital sovereignty

Build a portable, EU-compliant data platform and avoid vendor lock-in—discover our cloud-neutral stack in this deep-dive blog.

Cloud Independence: Testing a European Cloud Provider Against the Giants

Cloud Independence: Testing a European Cloud Provider Against the Giants

Can a European cloud provider like Ionos replace AWS or Azure? We test it—and find surprising advantages in cost, control, and independence.

Leave your email address to subscribe to the Dataminded newsletter

What we do

Resources

Cases

About us

Belgium

Vismarkt 17, 3000 Leuven - HQ
Borsbeeksebrug 34, 2600 Antwerpen

info@dataminded.com

Vat. BE.0667.976.246

Germany

Spaces Tower One,
Brüsseler Strasse 1-3, Frankfurt 60327, Germany

What we do

Resources

Cases

About us

Belgium

Vismarkt 17, 3000 Leuven - HQ
Borsbeeksebrug 34, 2600 Antwerpen

info@dataminded.com

Vat. BE.0667.976.246

Germany

Spaces Tower One, Brüsseler Strasse 1-3, Frankfurt 60327, Germany

What we do

Resources

Cases

About us

Belgium

Vismarkt 17, 3000 Leuven - HQ
Borsbeeksebrug 34, 2600 Antwerpen

info@dataminded.com

Vat. BE.0667.976.246

Germany

Spaces Tower One, Brüsseler Strasse 1-3, Frankfurt 60327, Germany

What we do

Resources

Cases

About us

Select Language