Locking down your data: fine-grained data access on EU Clouds

08.09.2025

•

Niels Claeys

Secure Iceberg data on EU clouds with fine-grained access for SQL, Python & Spark using Lakekeeper, Zitadel, and remote signing.

This post continues on my earlier blog “Building a Portable Data Platform on Top of the EU Clouds”. In that blogpost, I focussed on building a data platform that supports SQL for both ad-hoc analysis as well as scheduled jobs. The platform used Trino as a query engine and stored the data on S3 compatible storage using Iceberg.

In this blogpost, I want to go a step further by investigating how to restrict data access for both SQL and Spark/Python applications on EU clouds. Most enterprises run from ten to thousands of applications, which makes controlling data access for each application crucial. An often used approach is to implement the principle of least privilege in order to:

Comply with regulatory requirements
Reduce the risk of unintentional mistakes and deliberate misuse of data.

Fine-grained data access on AWS

When deploying applications on AWS, it’s common to attach a dedicated IAM role with only the required permissions. The main drawback is that it is AWS-specific and it requires us to configure application permissions in two places: Lakekeeper and AWS IAM.

An alternative, which is portable across clouds, is to treat the data catalog as the central authority for access control. As a first step, let’s explore how this works when the cloud provider supports STS (e.g. AWS, MinIO). STS allows you to generate temporary credentials that are scoped down to only allow access to the requested Iceberg tables.

Here’s what happens in practice: when an application requests access to an Iceberg table, the request first goes to Lakekeeper. It checks whether that user or application is actually allowed to access the Iceberg table. If the answer is no, the request stops right there.

If access is granted, Lakekeeper replies with a bunch of information about the Iceberg table: the path to the latest metadata file, statistics, snapshot information and most importantly the vended credentials.
These credentials are temporary STS tokens that the application can later use to fetch the metadata and read the underlying Parquet files from S3. An example response for S3 is shown below, the full specification is defined in the Iceberg REST API spec.

{  "metadata-location": "s3://examples/initial-warehouse/0197e6ac-4d95-7fd2-803f-f52d4b22411e/products-170f691a6e784e45b65dd78b8fea895e/metadata/00002-0197e6ac-717a-73a2-9a98-08dcbe6f3945.gz.metadata.json",  "metadata": {    "table-uuid": "0197e6ac-6dd8-7583-9ce7-fdf571525762",    "location": "s3://examples/initial-warehouse/0197e6ac-4d95-7fd2-803f-f52d4b22411e/products-170f691a6e784e45b65dd78b8fea895e",    "snapshots": []    "statistics": []  },  "config": {    "client.region": "us-west-1",    "s3.secret-access-key": "",    "s3.access-key-id": "",    "s3.session-token": "<session-token>",    "s3.region": "us-west-1"  },  "storage-credentials": [    {      "prefix": "s3://examples/initial-warehouse/0197e6ac-4d95-7fd2-803f-f52d4b22411e/products-170f691a6e784e45b65dd78b8fea895e",      "config": {        "s3.session-token": "<session-token>",        "s3.access-key-id": "",        "s3.secret-access-key": ""      }    }  ]}

The response includes a session token that is scoped down to the exact S3 path of the requested table to ensures that the application can only access that table and nothing else. The access rules are captured in the sessionPolicy attribute of the session token, which is a base64 encoded string.

By first authorizing the request in Lakekeeper and only then issuing the required credentials, we make sure that each application can access only the data it has the rights to. This mechanism works the same way whether you’re running SQL queries through Trino or working with Spark and Python applications.

Fine grained data access on EU clouds

The challenge increases when building a data platform on top of an EU cloud provider, since they typically lack support for STS or equivalent functionality. On top of that, their mechanisms for fine-grained access key permissions are often quite limited. This means we can’t rely on the same mechanism as presented in the previous section. To work around these limitations, we could:

Create dedicated credentials for every Python or Spark application and use the catalog to restrict SQL queries on Trino. Since EU cloud providers only allow coarse-grained access permissions, each application will still be able to read or write data it shouldn’t.
An even riskier option is to reuse the catalog’s S3 access credentials. That would give every Python or Spark application full bucket access, which is a clear no-go from a security perspective.

In my previous blog post, I kept things simple by exposing only a SQL endpoint. The limitations around S3 access applied there as well, but the impact was much smaller. Why? Because the S3 credentials lived entirely within Trino. We treat Trino as a trusted service because it is operated and controlled by the platform team and thus we are comfortable with Trino having full bucket access. The key point is that these credentials never leave Trino. Users never see them, which means they can’t tamper with or misuse them.

Managing access from untrusted clients

In our current setup, we can’t securely support Python or Spark applications on the platform. The issue is twofold: these applications don’t have built-in access control, and without STS in Lakekeeper, we aren’t able to restrict their S3 access.

A potential workaround is to let Spark or Python jobs write only into a dedicated bucket and separate it from processing buckets. While this reduces the blast radius, it doesn’t really solve the fundamental issue as applications still end up with broad access to nearly everything inside the allowed buckets.

Not satisfied with this limitation, I dug into the Iceberg specification and discovered an alternative to credential-vending namely remote-signing. Instead of giving applications direct credentials, remote-signing uses AWS sigv4 (Signature v4) to sign S3 requests on behalf of the client.
Here’s how it works: the application sends its request to a signing service, which is Lakekeeper in our case. This service authorizes the request and if allowed, signs the request with it’s own access credentials. The signed request can then be used by the application to query the data on S3. Crucially, the client never has access to the credentials used to sign the request. The following diagram shows the different steps:

Query an Iceberg table using an untrusted client.

This approach was originally designed for services like Athena, which need to access S3 data without holding credentials themselves, and thus delegate the signing to AWS Glue for example.

For this to work, the EU cloud provider’s S3-compatible storage must support Sigv4, which is the case for all providers that we tested. Next to that also the metastore, Lakeeper in our case, and our Iceberg client libraries (e.g. pyIceberg) must support remote signing.

In fact, while testing, I noticed that we were already using remote signing in our EU cloud setup without even realizing it. This is the default in Lakekeeper when STS is disabled on a Lakekeeper warehouse. Here’s what the Iceberg table response looks like in this situation:

{  "metadata-location": "s3://examples/initial-warehouse/0197e6ac-4d95-7fd2-803f-f52d4b22411e/products-170f691a6e784e45b65dd78b8fea895e/metadata/00002-0197e6ac-717a-73a2-9a98-08dcbe6f3945.gz.metadata.json",  "metadata": {    "table-uuid": "0197e6ac-6dd8-7583-9ce7-fdf571525762",    "location": "s3://examples/initial-warehouse/0197e6ac-4d95-7fd2-803f-f52d4b22411e/products-170f691a6e784e45b65dd78b8fea895e",    "snapshots": []    "statistics": []  },  "config": {    "s3.signer.endpoint": "v1/signer/a7c1f746-5b74-11f0-8ca8-3b458b9fd9a8/v1/aws/s3/sign",    "s3.region": "nl-ams",    "client.region": "nl-ams",    "s3.endpoint": "https://s3.nl-ams.scw.cloud/",    "region": "nl-ams",    "s3.remote-signing-enabled": "true",    "s3.signer.uri": "https://lakekeeper.scaleway.playground.dataminded.cloud/catalog/"  }}

This mechanism solves our data access problem as long as we can identify the application and have given it the correct permissions in Lakekeeper.

Configure data access for your applications

To add a new user or application and set up data access on our platform, you’ll need to do two things:

Register the application in Zitadel using the service user concept. This allows you to configure the client credentials flow, identifying the application with a client_id and client_secret. If you also want to run Python jobs locally or in Jupyter notebooks, you can additionally configure the device code flow for the application.
Set up access permissions in Lakekeeper for your application. These permissions will govern not only Python or Spark applications but also SQL queries executed via Trino is it will use the opa-bridge.

Conclusion

In this blogpost we explained how to secure your data on S3 compatible storage of an EU cloud provider. We showed that the platform supports secure data access for SQL queries as well as any other Iceberg client, that supports remote-signing.

By combining Zitadel for user/application authentication with Lakekeeper for data access authorization, you can achieve full control over tabular (Iceberg) data on EU cloud providers. However, for raw data or other data types (such as images), a different solution is still required.

If you are interested in more details, leave a comment or reach out to me. If you want to try it out yourself, you can start from the following Github repository.