Conveyor Product & Features
Can we do better?
Cloud providers like AWS, Azure and GCP provide an overwhelming amount of great building blocks. When delivering data projects for our customers, we noticed the proposed stacks often:
require a lot of glue code, resulting in workable yet sub-optimal user experiences
aren’t actively encouraging software engineering practices
require a steep learning curve to configure for your needs
In our opinion, one of the key dimensions for a better solution is high affinity with software engineering best-practices. Software engineering best-practices and automation leading to smooth deliveries are what make digital winners what they are today.
By using containers as the main medium to share softwares with the execution environments, software engineering best-practices and CICD are easier to put in place. So we created a k8s-based, multi-cloud, multi-cluster productivity tool build by developers for developers.
How does it work?
End-users interact with Conveyor through a command-line interface (CLI) and a web-based user interface (UI). The CLI is used throughout the building and deploying of data projects, while the UI is mostly used in the run phase.
The control plane is hosted by Data Minded. It is middleware between the users and the data plane. It is responsible for cross cutting concerns like user authentication and authorization, state management, cost aggregation, ... .
The data plane is the Kubernetes-based managed infrastructure that runs within your own cloud environment. It consists of multiple services that takes care of provisioning and scaling the needed infrastructure and everything related to the scheduling of the execution of data projects.
Conveyor has two simple concepts: projects and environments. A project is code, a unit of deployment (architectural quantum) that describes the what and how it needs to be executed.
Environments are isolated segments in the infrastructure where a data project can be deployed and executed.
Create a batch pipeline often used for analytics to periodically collect, transform and move data to a data warehouse according to business needs.
Templates for various technologies and use case, get you started with just a couple of key strokes. Using the remote execution `run` command, you can execute your code remotely in the right context.
Create persistent or throw-away environment. Select a resource size and a security context. Deploy and promote data project with ease.
Once your data project is deployed, you want to follow-up on resource utilization, cost and troubleshoot potential failures.
Multi-cluster & Multi-cloud
An environments links to a Kubernetes clusters deployed as part of the customers data plane. Customers can have multiple clusters spread over homogeneous or heterogenous cloud accounts.
Create paved roads, take care of boilerplate and encourage the use of best practices across teams. Clients can create their own templates based on their needs, and the systems they integrate with.
Jupyter notebooks are well known for their ease in data exploration and experimentation. Data Minded cloud notebooks are build on the same containerization foundation as data project code. This enables other use cases e.g. using notebooks to debug and facilitate iterative industrialization of experiments.
Each environment has a dedicated Apache Airflow instance for batch workload orchestration. Automated client-side DAG validation.
Some projects need more power that can be provided by a single node. Data Minded Cloud offers Apache Spark (batch and streaming) as a first class citizen.
Monitoring & Logging
To operate data projects, logs and metrics are centralized and available in real-time. For some integrations, access to the technology native UIs are available e.g. Apache Spark History Server.
Follow your cloud cost per project over time. Gain insight into cost distribution across projects and environments.
Single Sing-on & RBAC
Authenticate users with your own identity provider. Control the actions any user can perform on projects and environments using role-based mechanisms.
Data Access Management
Each project can be linked with cloud specific IAM credentials. This link is enforced so that each job can only access those resources it was granted access to. Combined with the RBAC model on environment and projects, data access management is in your hands.