Functional Data Engineering 101

The Finish Slime
Pages
Functional Data Engineering 101

Are you a data engineer looking to level up your game? By bringing hot software engineering practices into your life? Then you’ve come to the right place. Functional Data Engineering (FDE), pioneered by Maxime Beauchemin, is a vast topic - we’re here so you don’t waste any time getting started.

Are your data systems troubled by… a lack of reproducibility? No service continuity? Lacking testability? Or aren’t your data pipelines as disposable as you’d like?

Functional data engineering (FDE) might be your way out.

Functional data engineering is a set of practices curated by Maxime Beauchemin, creator of Apache Airflow, Superset, and CEO of the company preset. He observed what the big tech giants with advanced data practices were doing to get reproducibility…

And what others were missing. - He coined it functional data engineering because reproducibility is one of the cornerstone promises of functional programming.

Enough preamble; let’s get started!

The Five-Second Version

Functional #DataEngineering Basics in 10 tweets #moderndatastack (inspired by @mistercrunch) : 1/10
— sbalnojan (@sbalnojan)
1:09 PM • May 3, 2022

Here’s a good infographic helping you to understand functional data engineering quickly:

Let's talk about functional data engineering ( @mistercrunch) again! Just created a short infographic. The idea was to enrich the perspective because the functional stuff brings us not just reproducibility but also service continuity, testability, and much more! #moderndatastack
— sbalnojan (@sbalnojan)
12:21 PM • Jun 23, 2022

What problems does FDE solve?

FDE only makes sense if you are an advanced data team with more than one senior data engineer (or if you’re alone, sorry!). It’s going to help you solve these typical problems:

(1) Reproducing an old result. A dashboard looks different today than yesterday, even though no new data was added. Then you need the ability to look at “yesterday's data version!”

(2) Service continuity. Your new data ingestion is corrupted, and as a result the dashboards/machine learning algorithms break. Then you need the ability to “roll back to yesterday's data”.

(3) Testability & Consistency. Do you want to test a new transformation process in isolation? Then your ingestion process must be “idempotent and without side effects.” Otherwise, you’re not even able to write a proper deterministic test.

(4) Disposability. Did your database get corrupted? Then you need to be able to “dispose of” it by deleting it and “spinning up a new version with the same data.”

Depending on your implementation, FDE will help you solve each of these problems.

The must-read resources

Since FDE is an advanced topic, not too many people have written about it. So we recommend going through all the material below; it’s a good mix of practice & theory, text., audio, and video.

Your first stop

Functional Data Engineering - A Modern Paradigm for Batch Data Processing

By Maxime, and it’s your bible, your first stop with many examples and explanations of the big picture.

FDE - A Set of Best Practicers

By Maxime. This is a presentation of the material, and I highly recommend reading and watching both pieces.

Your second stop

GitHub-based tutorial easy functional data engineering

This tutorial is all about learning by doing. It runs inside a jupyter notebook and contains simple examples up to a complete pipeline. It’s all in plain Python, without anything fancy to distract you from learning.

FDE - Airflow Tutorial

Next up is an airflow tutorial version, in particular, working with partitions.

Functional Data Engineering - A Blueprint

Over at DEW, Ananth did a great job providing a boiled-down blueprint; we suggest you read this next. It contains a couple of excellent new examples.

Implementing FDE

This niche medium article discusses implementing FDE, again with a couple of examples. A good read afterward.

Optional

FDE with Sven Balnojan

Finally, you can listen to a 28m discussion on FDE if you fancy. It’s optional, not necessary to understand FDE:

💡Btw. We’re desperately looking for more resources! If you got some, reach out LinkedIn/ Twitter💡

Mini FAQ

Can I do this with dbt as well?

Yes! But dbt isn’t made for this, out of the box. The preset data team has a working version over at preset. But once you went over the material here, you should be able to set up the basics yourself quickly.

A good practice would be to store the latest partition inside a table and then have a macro retrieving it so that you can always use WHERE partition =

Is this for the average data engineer?

The FDE practices need discipline and an understanding of software engineering. As such, we recommend them for senior teams.