Conveyor vs EMR

EMR is the default way to run Spark Jobs on AWS. It's a stable environment with a genuinely fast runtime. The main differences with Conveyor are: ​

Learning curve

Learning_curve_EMR_600dpi.png

Amazon
EMR

  • Know and configure AWS infrastructure details
    Manually configure VPC's, IAM roles etc
     

  • No 1 way to use notebooks
    EMR studio, EMR notebooks, Sagemaker are candidates
     

  • Configure a tool to manage workloads
    Use managed workflows for AIrflow, Step functions or a homebrew solution
     

Learning_curve_DMC_600dpi.png

Conveyor

  • Conveyor environments with managed Airflow
    Schedule containers or spark jobs on the cluster
     

  • Notebooks for experimentation
    Explore data or test ML algorithms with one command
     

  • Conveyor run command
    Start Spark or container jobs on the cluster from your local environment

The management model

In EMR, you have to create and manage a cluster to schedule your jobs.​

Job types

Management_model_EMR_600dpi.png

Amazon
EMR

  • IAM roles are linked to a cluster
    Use 1 cluster per job if you want to use different IAM roles
     

  • Clusters do not autoscale by default
     

  • Clusters do not update automatically
     

  • Clusters are bound to spark/hadoop versions
    When sharing a cluster, all applications need to be updated at the same time
     

  • Creating a cluster takes up to 15 minutes
     

  • EMR on EKS
    Manage the EKS data-plane with all components yourself

Management_model_DMC_600dpi.png

Conveyor

  • Conveyor manages clusters for you
     

  • One cluster for all jobs
    Each job can use a separate IAM role
     

  • We run containers
    Run the same container locally or anywhere else
     

  • Mix spark/hadoop versions on one cluster
    All dependencies are packaged in docker containers

Joy_types_EMR_600dpi.png

Amazon
EMR

  • Any job type existing in the Hadoop ecosystem
    Spark, Pig, Hive, etc are supported
     

  • (Too) many ways to package code
    Jars, pyfiles, pex distributions, containers,...
     

  • No support for non Hadoop jobs
    Cannot run simple python code

Job_types_DMC_600dpi.png

Conveyor

  • Any container can be run non-distributed
    Use your favorite programming language
     

  • Use spark for distributed jobs
    For processing large volumes of data
     

  • DBT for data warehouse transformations
    lower the barrier for data analysts to use and process data