Conveyor vs Databricks

Databricks is one of the most popular tools for building and running SQL, python and R notebooks. It provides a great way to get started and experiment with your first data pipelines. Conveyor focuses primarily on delivering high quality data products which is not possible through only a notebook environment.

Notebooks vs high quality code

Notebooks are great for experimenting and creating a first version of your code due to it's interactive environment but they have several drawbacks for writing production code:

Code_quality_DB_600dpi.png
  • No modular code
    Difficult to share code as well as as navigate across multiple notebooks in the Databricks UI
     

  • No tests
    No easy way to write tests and thus protect against regression
     

  • Not reproducible
    Dependent versions of python, Spark are not specified in the notebook but in the Databricks cluster
     

  • No configuration parameters/files
    Managing configuration is difficult, Databricks only provides the dbutils package but this is not portable outside

Code_quality_DMC_600dpi.png

Conveyor

  • Support both notebooks and IDE
    Use notebooks for experimentation but your IDE to write modular and easy to maintain code
     

  • Add tests to your code
     

  • Docker image
    Code is packaged with all dependencies to make it truly build once, deploy anywhere
     

  • Airflow dags
    Airflow configuration can use environment variables, extra arguments to customize your code

Databricks and creating data pipelines

Databricks has poor support for creating data pipelines:

Databricks and project governance

Managing tens or a hundred of data projects with Databricks is both challenging as well as costly:​

Governance_DB_600dpi.png

databricks

  • Data access on a workspace
    All notebooks in the same workspace can access the same data. To separate them by using different workspaces
     

  • Databricks clusters per team
    In order to make teams independent, they need their own cluster as it defines the python, spark,... versions 
     

  • Databricks licence fee of 50-80%
    On top of the raw compute cost of your cloud provider
     

Data_pipelines_DMC_600dpi.png

Conveyor

Data_pipelines_DB_600dpi.png

databricks

  • Use latest source code in job
    A job in Databricks runs the latest version of your notebook, which is not necessarily the latest committed code
     

  • Job dependencies
    Limited support to express depencencies between notebooks
     

  • Notebook dependencies
    Many library, framework versions are defined on cluster level instead of in your notebook 
     

  • No overview on all jobs
    Difficult to monitor all jobs of a given day

  • Docker image
    All code and depencencies are packaged in a docker image such that you know exactly which version is being executed by your job.
     

  • Airflow
    Airflow has extensive support for defining complex Dags as well as a UI with an overview of jobs
     

  • Notebook dependencies
    Many library, framework versions are defined on cluster level instead of in your notebook 

Governance_DMC_600dpi.png

Conveyor

  • Data access per project
    We support the principle of least privilege by linking data access to project/job
     

  • RBAC (role-based access control)
    is used to define permissions for users on projects/environments
     

  • Cost dashboards
    Give insights into costs in order to reduce them 

Databricks notebooks and collaboration

Databricks added support for Repositories which is a big improvement to manage multiple notebooks. There are however still 2 major issues when collaborating with multiple people:

Collaboration_DB_600dpi.png

databricks

  • git is not a first class citizen
    You can not make changes to the same file with multiple people at the same time 
     

  • Cannot develop notebooks from your IDE
    When working with multiple people, you need a premium cluster which does not support connecting to it from your IDE

Conveyor

Collaboration_DMC_600dpi.png
  • Native GIT integration
    Create feature branches and merge changes when working with multiple people on the same files