Portable by design: Rethinking data platforms in the age of digital sovereignty

01.07.2025

•

Niels Claeys

Build a portable, EU-compliant data platform and avoid vendor lock-in—discover our cloud-neutral stack in this deep-dive blog.

Recent geo-political and past legal rulings about data Privacy, such as Schrems, have raised concerns about relying exclusively on US-based hyperscalers. At Dataminded, this triggered us to investigate how data platforms can be designed for portability across non US cloud providers. In this blogpost, we present a portable data platform and demonstrate that it can run on multiple EU-based cloud providers.

The risks of depending on US cloud providers

To begin it is important to understand why some companies are considering moving away from US hyperscalers, despite their mature service offering. Increasing regulatory pressure and geopolitical uncertainty means that data platform strategy is not only a technical concern but also becomes a business concern.

The legal risk

5 years ago, there was the invalidation of the Privacy Shield framework by the Schrems II ruling. In a nutshell it states that for EU companies:

Sensitive EU data cannot be legally processed or stored by U.S. based cloud providers without significant legal risk

Despite ongoing negotiations, no successor agreement exists to provide true safety for EU data. At the same time the second presidency of Trump gave rise to more geopolitical uncertainty. The likelihood that US. authorities could try to access EU data through one of the US surveillance programs has only increased.

Secondly, for financial institutions the EU’s Digital Operational Resilience Act (DORA) came into effect in the beginning of 2025. It mandates stricter controls over ICT risks, with a focus on reducing the dependency on a small number of third-party technology providers.

Risk of vendor lock-in

With the large amount of convenient, easy-to-use services offered by US hyperscalers, we’re increasingly locking ourselves into their ecosystems. In doing so, we seem to be overlooking the downsides of becoming too dependent on proprietary technologies. Have we forgotten the lessons of the past? In many ways, these hyperscalers are the modern, cloud-native version of Oracle, IBM, and Terradata in the 1990s.

The US hyperscalers became more and more crucial in building and running data platforms within organisations. Can we go back and build a modern data platform without the US hyperscalers and without losing all the benefits that they brought (e.g. infinite scalability, fine-grained security, support for AI workloads,…).

Designing for portability: fewer dependencies means greater flexibility

If we cannot rely on all the services provided by the US hyperscalers, the question becomes what are the minimal blocks that we need to build such a platform? Most of the time, the data platform team does not want to manage the complete stack, from hardware up to the platform services, as that makes our scope too broad. Ideally, we would want that these core services are managed by a third party provider or another team. Additionally, we would like to keep our development flow (e.g. use the cloud native stack, distributed data products,… ). According to us, you need the following core services for your data platform:

Blob Storage: The foundation of your data lake as it stores your structured and unstructured data.
Block Storage: A way to attach disks to your virtual machines and databases.
Kubernetes: A standardized orchestration layer for deploying and managing containerized workloads.
A Managed Database: For operational or metadata storage. If necessary we could also manage this ourselves, but as data is critical, we prefer to have this supported out of the box.
Load Balancer: To ensure reliable access to our platform and support scaling our applications.

The good thing is that multiple providers support these core services like EU-based cloud providers (e.g. Scaleway, OVH,…) but also Openshift, which can be used on-premise. By relying on only these services, we keep our options open to switch to different providers if we need to. Depending on the risk to your organisation, you could use this as a fallback scenario, simplify your migration or even in a hybrid setup between US/EU-based cloud providers.

Portable, queryable, and ready: our SQL-first data platform

With the foundational components in place, the next step is to build a modern, enterprise-grade data platform. Based on our experience in previous projects, a platform centered around scheduling SQL jobs can be highly effective. By exposing only two interfaces, namely job scheduling and a SQL endpoint, the platform is easy to secure. Additionally, this simplifies the migration as long as you keep these 2 interfaces stables, users will not notice anything. Our stack looks as follows:

A data platform that can easily run on US/European based clouds as well as on-premise.

The core components are:

Trino as the an open source distributed SQL engine. It is designed to handle both small and large datasets and together with blob storage acts as our cloud agnostic data warehouse. This is one of the crucial components of our stack as most data warehouses are managed by one specific provider.
Apache Airflow is the de-facto-standard for scheduling jobs and building data pipelines. If you prefer to use another scheduler, you can replace Airflow with your preferred scheduler.
Open policy agent as our general-purpose policy engine. We want to enforce fine-grained access control over who can read and write datasets. To do this, OPA integrates with Trino and evaluates the data access policies in order to decide whether a user or an application is allowed to execute a given query.
Lakekeeper is our data catalog. This is the central place where we store the metadata information for our iceberg datasets. Using a separate data catalog is required when using Trino because the build-in JDBC catalog implementation is not production ready. A data catalog is the entrypoint to your data and therefore crucial to make portable across providers. We chose lakekeeper for our data catalog due to it’s deep integration with Trino and OPA for fine-grained data access, more on this in a future blogpost.
Traefik: as our Proxy in front of every component of our data platform that is exposed to users. We like that Traefik is so user-friendly and can discover Kubernetes services automatically but you might as well replace it by Nginx or HaProxy.
Zitadel to perform authentication management for our users as well as our applications and integrate with your company’s identity provider. We like that Zitadel is easy to use and has a good Terraform provider but you could also use Keycloak.

From theory to practice: deploying our stack on multiple providers

As always “the proof of the pudding is in the eating”, which is why we deployed this stack to evaluate its portability in practice.

Because Dataminded is not a hardware company, we do not have any servers to test Openshift. Luckily, we can deploy our stack onto multiple EU cloud providers. If you are interested in the code or replicating our setup, take a look at the following Github repository.

We successfully ran the proposed stack on OVH, ExoScale, Scaleway and UpCloud. Only on Hetzner we were not able to run our full stack out of the box as it does not support managed Kubernetes. After deploying our stack, we ran some basic e2e pipelines to verify:

The stack is portable across different environments
OPA can be used to restrict the access of applications towards certain datasets using Trino.
Users can deploy write dags to execute dbt jobs.

Our initial experience with EU cloud providers are:

From theory to practice: deploying our stack on multiple providers

We know that running proof-of-concept is only a first step, there are many aspects to investigate further like: performance, tightening security, testing the stability, supporting non SQL jobs,…

In further blogposts we will present our experiences with the different EU cloud providers as well as discuss the next steps for our platform, so stay tuned.

Conclusion

We’ve demonstrated that our data platform can be deployed successfully across multiple EU-based cloud providers. This confirms that, by limiting external dependencies, it’s possible to retain full control over where your platform runs.

In upcoming posts, we’ll take a closer look at the EU cloud providers and explore current platform limitations: performance benchmarks, implementing fine-grained permissions for non-SQL jobs and evaluating enterprise-level features.

If you’re working on similar challenges or have a different perspective, leave a comment or reach out directly. To explore the platform yourself, the code is available on our Github repository.

Latest

When writing SQL isn't enough: debugging PostgreSQL in production

When writing SQL isn't enough: debugging PostgreSQL in production

SQL alone won’t fix broken data. Debugging pipelines requires context, lineage, and collaborationnot just queries.

Portable by design: Rethinking data platforms in the age of digital sovereignty

Portable by design: Rethinking data platforms in the age of digital sovereignty

Build a portable, EU-compliant data platform and avoid vendor lock-in—discover our cloud-neutral stack in this deep-dive blog.

Cloud-Unabhängigkeit: Test eines europäischen Cloud-Anbieters gegen die Giganten

Cloud-Unabhängigkeit: Test eines europäischen Cloud-Anbieters gegen die Giganten

Kann ein europäischer Cloud-Anbieter wie Ionos AWS oder Azure ersetzen? Wir testen es – und finden überraschende Vorteile in Bezug auf Kosten, Kontrolle und Unabhängigkeit.

Hinterlasse deine E-Mail-Adresse, um den Dataminded-Newsletter zu abonnieren.

Was wir tun

Ressourcen

Fälle

Über uns

Kontaktiere uns

Belgien

Vismarkt 17, 3000 Leuven - HQ
Borsbeeksebrug 34, 2600 Antwerpen

info@dataminded.com

USt-IdNr. DE.0667.976.246

Deutschland

Spaces Kennedydamm,
Kaiserswerther Strasse 135, 40474 Düsseldorf, Deutschland

Datenschutzbestimmungen

Was wir tun

Ressourcen

Fälle

Über uns

Kontaktiere uns

Vismarkt 17, 3000 Leuven - HQ
Borsbeeksebrug 34, 2600 Antwerpen

info@dataminded.com

USt-IdNr. DE.0667.976.246

Deutschland

Spaces Kennedydamm, Kaiserswerther Strasse 135, 40474 Düsseldorf, Deutschland

Datenschutzbestimmungen

Was wir tun

Ressourcen

Fälle

Über uns

Kontaktiere uns

Vismarkt 17, 3000 Leuven - HQ
Borsbeeksebrug 34, 2600 Antwerpen

info@dataminded.com

USt-IdNr. DE.0667.976.246

Deutschland

Spaces Kennedydamm, Kaiserswerther Strasse 135, 40474 Düsseldorf, Deutschland

Datenschutzbestimmungen

Was wir tun

Ressourcen

Fälle

Über uns

Select Language