- The Finish Slime
- Data Orchestrators 101: Everything You Need To Know To Get Started
Your data team wants to start using a data orchestrator. Great! Here’s a quick primer to get you started - to prevent you from spending dozens of hours wrestling up orchestrator infrastructure you don’t need.
Let’s get the biggie out of the way up front: Every modern data stack should use a data orchestrator at some point!
Every modern data stack should use a data orchestrator at some point!
But data orchestrators promise the world, making it hard to understand what they are and what they do for you.
So let’s get this straight: Data orchestrators do one thing exceptionally well: enabling you to manage (scheduled) directed acyclic graphs (DAGs) for your data pipelines.
Data orchestrators do one thing exceptionally well: enabling you to manage (scheduled) directed acyclic graphs (DAGs) for your data pipelines.
What’s a DAG for data pipelines? This is one.
That’s how data teams work—lots of dependent things. Data ingestion happens first; afterward, you want to trigger notebooks, train ML models, or run dbt. After that, you might need to refresh some caches or some other models to get your final output.
If I put this on schedule, it’s scheduled. If I also get some monitoring and logging around it in one unified setting, we got a data orchestrator.
You ask, but why? Why do you need this?
Let’s explain the why so you know when you want to consider a data orchestrator. Then I will point you to the most critical resources on data orchestrators so that at the end of reading through them, you’re ready to set up your PoCs and choose an orchestrator for your data team.
Why do you need scheduled DAGs?
If you’re part of a data team, you already have scheduled DAGs. They might hide inside dbt, cron, or GitHub Actions. Data orchestrators provide an accessible framework for efficiently working with these types of DAGs.
There are two points at which you want to switch over from whatever you have to a consistent framework for managing your schedules DAGs, and that’s:
When you’re fed up with cron. Or whatever other scheduling you use. Another team member created a critical data pipeline that you need to change the schedule of. But you don’t know where it is scheduled? Then you need something better to manage schedules.
When you need to create advanced DAGs, you can always chain up bash commands for cron or create simple bash scripts that fire linear “do X, do Y, do Z” things. But you'll need something else once you want to branch or rerun.
There’s a huge BUT here if you’re on dbt, you can potentially get away without an orchestrator for longer. So if you’re already running on dbt, and are not yet fed up with your scheduling solution for it, consider the following two things:
Do you have data pipelines outside of dbt you also have to manage? Are there any notebooks you run or dashboarding tools you need to refresh the cache after running dbt? If so, you should combine this into one unified data orchestrator.
Do you need to orchestrate dbt itself? Are you doing a “dbt seed” followed by a “dbt run” followed by a “dbt test”? Then you should also consider switching to a data orchestrator. These workflows only get more complicated with time, and it’s always great to have them in one place that collects logs and lets you monitor the execution.
With that in mind, let’s dive into the key articles around data orchestration.
What, Why, When Resources
The What, Why, And When of Data Orchestration: While this article is from 2021, it gives you a quick and dirty introduction to the world of data orchestrators, their usual capabilities, and their reason for existence. It’s a quick read.
Workflow Orchestration vs. Data Orchestration — Are Those Different?: This article was written by Anna Geller when she still worked for prefect, a new data orchestrator. It introduces you in broader terms to different terminologies used for data orchestrators, focusing on what perfect does differently.
Why We Switched Our Data Orchestration Service: Spotify pre 2019 ran all of their pipelines on two data orchestration frameworks, Luigi and Flo. This article describes their setup and the reason Spotify switched to another framework called Flyte. A must read if you’re about to make a decision, not for the frameworks, but for the process and thoughts.
About choosing tools
Data Orchestration — A Primer: This article is a must-read for everyone exploring data orchestration. It introduces you to the history of data orchestrators and explains the three generations of data orchestrators. Both are very helpful in understanding your requirements for orchestration.
5 Popular Open Source Data Pipeline Orchestration Tools in 2023: Atlan has a decent overview of open-source orchestrators; it’s a quick read again, so go over these. Then follow up with the next article.
9 Popular Data Pipeline Orchestration Tools in 2023: This is a basic overview of popular data orchestrators, including some pros and cons.
Interesting and advanced resources
Orchestration isn’t going anywhere: Nick Schrock, the founder of dagster, another modern data orchestrator, recently wrote this article. He makes a good point on the need for orchestration vs. the orchestrator as a tool to achieve this. It’s more high-level but a must-read if you’re choosing your orchestrator.
Data Orchestration Trends: Simon from Airbyte provides a great introduction to the modern era of data orchestration, in particular, the shift towards the outputs of the data orchestrator over the pipeline itself.
Lessons from Scaling Apache Airflow: Apache Airflow is still the biggest player in town. So you should read about scaling Apache Airflow before putting it on your list (and yes, as the biggest player, it should almost always make it onto your list - unless you’re 100% on Java and are in need of some Flo)
Nick Schrock once said
“Orchestration to a data engineer is an implementation detail to making a data asset.”
“Nobody cares about your data pipeline”
That mindset is still very valuable when selecting your data orchestrator. Read through all the resources, and you’ll have a good data orchestrator for your data team running within days.