Microsoft Fabric’s Migration Hurdles: My Experience

08.08.2024

•

Raghid Bsat

Migrating to Microsoft Fabric?My experience shows it’s not ideal for modular platforms yet limited flexibility,IaC gaps & performance issues

It has been more than a year since Microsoft announced their new all-mighty Fabric data platform. It offers a wide range of capabilities, including data science, data warehousing, real-time analytics, Lakehouse, data workflows, and Power BI integration. Made generally available back in November 2023, IT managers and data enthusiasts ask this question: Should my organisation dive into the Fabric ecosystem, or should we hold off?

Image source

Before trying to answer this question, it is probably better to think about why we are considering it in the first place. Do you already have a data platform that follows a modular approach and would like to integrate some of Fabric’s capabilities? Or are you starting from scratch and want to adopt it as a whole? In this blog post, I will talk about my experience with Microsoft Fabric, looking at it from a modular point of view: shifting our tables using the Microsoft Synapse SQL engine to the Fabric SQL engine, and can we configure the fabric control plane with our Infrastructure as Code (IaC) Terraform codebase?

Querying Unmanaged Tables:

I am currently working with a client on their data platform, where we use Synapse Serverless SQL Pool as our main SQL engine to read data from delta tables. Delta tables are made up of parquet files that are open-source column-oriented data format and a delta log file that describes transactions that occurred on those parquet files and column statistics.

In our use case, these tables are external and are being updated by a general-purpose data pipeline that pushes parquet files to Azure’s Data Lake Storage (ADLS). It is possible to allow Synapse to read from your storage by creating external views with its own managed identity:

SELECT *FROM OPENROWSET( BULK 'path/to/delta_table', DATA_SOURCE = 'adls_data_source', FORMAT = 'DELTA') AS [result]

You achieve cost-effective queries compared to traditional SQL database models by storing your data in ADLS. Additionally, you gain the advantage of on-demand scaling, ensuring optimal performance without the need for manual resource management.

The reason we started looking into Fabric is because Microsoft decided to stop its development on Azure Synapse Analytics. This puts us in a spot where we need to find a different SQL engine to query our tables.

Can you replicate this setup on Fabric? It appears to be a straightforward migration, at least from an infrastructure and implementation standpoint. Replicating a similar Synapse data architecture isn’t feasible with Fabric. OPENROWSET() is not a supported feature in the Fabric ecosystem, and it is not part of the Fabric roadmap. While Fabric enables the creation of shortcuts that point to storage locations (e.g., ADLS), they ultimately manifest as managed tables. These managed tables are completely managed by Fabric, both the schema metadata and the data files. This is a bottleneck since it deviates from the idea of an external table, where you would want your SQL engine to only manage the schema metadata.

With managed tables, Microsoft Fabric’s Lake-Warehouse employs a query engine for generating SQL execution plans. The query optimiser assesses potential plans and estimates the workload for each operation. This estimation heavily relies on critical data insights derived from regularly updated statistics, typically calculated whenever the delta log undergoes changes. This means that whenever the delta log changes in your delta table, the statistics and query plan have to be updated. Note that the update of the statistics only happens when a manual query of the newly created delta table occurs.

If you worked with delta tables before, you might have already noticed that the delta logs already contain statistical information about the columns. However, the Fabric engine calculates additional statistics that will be used in the query plan. This might be due to Fabric utilising a proprietary Pyspark image within their environment. For tables updated through Fabric DataFactory or Fabric DataPipelines, this optimization might be essential. Moreover, this proprietary version of Pyspark has a feature called V-Order, a delta table optimiser known for dramatically speeding up query performance. What I am trying to say is that there seems to be high support for the tables that are updated through the Fabric environment; not those being updated from external sources.

Since I could not find any proper explanation about this on Microsoft documentation, this is what I believe is happening under the hood:

To test my theory, I added a table shortcut with an external connection to ADLS for a 4 billion row-sized delta table. It took 33 minutes for the initial query to pass through, after which queries were very quick (even quicker than Synapse Analytics!). After the delta table had been updated from our pipeline with almost 4 million rows, the initial query took 20 minutes, after which the rest of the queries were quick. What happened in these 20 minutes is that the delta log was not in sync anymore with the ‘manifest’ file that Fabric keeps with the new statistics calculated. It recalculates statistics to update its manifest file when an initial query is targeted towards that table.

Is there a solution around this? Possibly, but not recommended. I spoke with Microsoft support engineers about whether or not we can disable the statistics calculations or control them through the external pipeline itself. They mention that it is possible with their help, only on a workspace level. They also said that this is generally not recommended practice, as they do not entitle this feature to be controlled by the consumer for now.

Finally, if your table is not updated frequently — say daily, weekly, or monthly — it might make sense to try using table shortcuts with Fabric. The downtime caused by statistics calculation would be negligible compared to the time needed for table updates.

Incomplete Infrastructure as Code — Yet

If you are a fan of streamlined engineering processes, you also prefer to automate these different access controls, workspace configurations, managed identities, and associated infrastructure rules. It is quite impressive that the Microsoft Fabric team has already released their API (in preview), but there is still more work to be done before it can be adopted as IaC.

For starters, a Terraform resource for handling Fabric-related tasks is yet to be released. While there’s a Python Fabric SDK developed by DaSenf1860, the Fabric API still lacks essential features. One significant limitation is the inability to manage connections and share them between users for access control through the API. A connection is an authentication method to let a lakehouse load the data files (through an organisational account, SAS token or Access key). These connections can be re-used for different tables and can also be shared with groups, service principals, and users. This process can be done via the UI only, which provides an associated connection Id that can then be used to create these shortcuts. Using connections can be important for IaC: You can streamline the process of governing the access of schemas, tables, or views.

While interacting with the API, I noticed that the “connection_id” can be leaked with some GET requests.

POST https://api.fabric.microsoft.com/v1/workspaces/workspace_id/items/lakehouse_id/shortcutsbody:{  "path": "Tables",  "name": "demo_table2",  "target": {    "adlsGen2": {             "connectionId": "connection_id", //needs to be created            "location": "https://<storageaccountname>.dfs.core.windows.net",            "subpath": "/<containername>

There is no POST request that can be sent to the API to create a new connection. I was able to get my connection Id that was already created in the UI only. This suggests that the Fabric API is still unreliable to be used as part of an IaC codebase.

Bonus: Navigating Uncharted Errors

When I was trying to test out the Fabric Lakehouse capabilties, I encountered numerous “Internal Errors”. Given that Fabric is a relatively new product, ongoing bug discovery is expected, which can impede developer experience.

For example, I experienced extended query durations (>8 minutes) when using special characters like the pipe (|) symbol within a delta partition:

path/to/delta/partition=54|A/*.parquet.

This occurs because the delta log URL-encodes the special character as “%7C”. Subsequently, Fabric performs internal transformations, converting the “%” symbol to ‘%25’, resulting in a string of %25%7C:

path/to/delta/partition=54%25%7CA/*.parquet.

This intricacy causes the SQL engine to query for a long time before returning a result which in theory should not take more than a second. This case was not a problem for us with our Synapse views, since the engine was partition pruning directly from the delta logs. In the case of Fabric, an extra manifest file is created in between which is used for partition pruning. This causes a mismatch of file name. between the manifest file and what is truly in the storage location.

This error is what led me to believe that the new Fabric SQL engine is not relying on the delta logs anymore after it performs its calculations and creates its own manifest file. When it does get stuck in this process, it goes back to the delta log after 10 minutes or throws an internal error.

Conclusion

In our case, looking at other SQL engines other than Fabric’s made the most sense. My general impression is that Fabric appeals to organisations that do not have a data platform yet. It should be adopted as a whole platform. It also makes sense for organisations that heavily rely on Microsoft Data Factory as their cloud based ETL solution.

For organisations that follow a modular approach in their data platform design, there might not be any kind of benefit to Microsoft Fabric yet. By a modular architecture, I mean adopting Microsoft Fabric as a SaaS to provide some functionalities while managing everything centrally.

In conclusion, my experience led me to believe that the Fabric SQL engine might not be a good feature to integrate into your already existing data platform. There could be other Fabric features that can integrate well however — to be explored more.

Latest

When writing SQL isn't enough: debugging PostgreSQL in production

When writing SQL isn't enough: debugging PostgreSQL in production

SQL alone won’t fix broken data. Debugging pipelines requires context, lineage, and collaborationnot just queries.

Portable by design: Rethinking data platforms in the age of digital sovereignty

Portable by design: Rethinking data platforms in the age of digital sovereignty

Build a portable, EU-compliant data platform and avoid vendor lock-in—discover our cloud-neutral stack in this deep-dive blog.

Cloud-Unabhängigkeit: Test eines europäischen Cloud-Anbieters gegen die Giganten

Cloud-Unabhängigkeit: Test eines europäischen Cloud-Anbieters gegen die Giganten

Kann ein europäischer Cloud-Anbieter wie Ionos AWS oder Azure ersetzen? Wir testen es – und finden überraschende Vorteile in Bezug auf Kosten, Kontrolle und Unabhängigkeit.