Conveyor vs EMR
EMR is the default way to run Spark Jobs on AWS. It's a stable environment with a genuinely fast runtime. The main differences with Conveyor are:
Learning curve

Amazon
EMR
-
Know and configure AWS infrastructure details
Manually configure VPC's, IAM roles etc
-
No 1 way to use notebooks
EMR studio, EMR notebooks, Sagemaker are candidates
-
Configure a tool to manage workloads
Use managed workflows for AIrflow, Step functions or a homebrew solution

Conveyor
-
Conveyor environments with managed Airflow
Schedule containers or spark jobs on the cluster
-
Notebooks for experimentation
Explore data or test ML algorithms with one command
-
Conveyor run command
Start Spark or container jobs on the cluster from your local environment
The management model
In EMR, you have to create and manage a cluster to schedule your jobs.
Job types

Amazon
EMR
-
IAM roles are linked to a cluster
Use 1 cluster per job if you want to use different IAM roles
-
Clusters do not autoscale by default
-
Clusters do not update automatically
-
Clusters are bound to spark/hadoop versions
When sharing a cluster, all applications need to be updated at the same time
-
Creating a cluster takes up to 15 minutes
-
EMR on EKS
Manage the EKS data-plane with all components yourself

Conveyor
-
Conveyor manages clusters for you
-
One cluster for all jobs
Each job can use a separate IAM role
-
We run containers
Run the same container locally or anywhere else
-
Mix spark/hadoop versions on one cluster
All dependencies are packaged in docker containers

Amazon
EMR
-
Any job type existing in the Hadoop ecosystem
Spark, Pig, Hive, etc are supported
-
(Too) many ways to package code
Jars, pyfiles, pex distributions, containers,...
-
No support for non Hadoop jobs
Cannot run simple python code

Conveyor
-
Any container can be run non-distributed
Use your favorite programming language
-
Use spark for distributed jobs
For processing large volumes of data
-
DBT for data warehouse transformations
lower the barrier for data analysts to use and process data