Finish Slime #52

Ice Ice Baby! The Icehouse arrived!

Data Engineering, Analytics. No ML, no AI. The weekly dose of the data content you actually want to read!

I’m changing up things: From now on you get a monthly mail with the best recent stuff. It’ll contain a lot more links, but it’s also going to be (even) better filtered. So, let’s dive into the new format!

Great Recent Stuff

dbt best practices in action at Cal-ITP’s data-infra project - Let’s talk dbt again; Cal-ITP has an impressive project with almost 400 dbt models. Take a look at how they use templates and many other best practices in action.

3 types of delivery models for Data and Analytics teams - I like to keep reminding everyone that analytics teams come in a variety of shapes and forms. And you have a substantial part in choosing that form! 

What is and what is NOT a Data Product - Love such articles. The most important consideration usually is what is NOT a data product (like most database tables, you might’ve figured, are not!). 

Good event data - event data and time series data are the future! Gotta read this.

The Best Way To Learn Data Engineering - data engineering is hard, and varies tons from job to job, so better be prepared to learn a lot!

Full Steam Ahead: The 2024 MAD (Machine Learning, AI & Data) Landscape - as always, seeing the MAD landscape is simply overwhelming. It grows every single year. Blows my mind.

Eventify everything - Data modeling for event data - literally everything is an event, so better learn about them.

Practical

The Data ROI Pyramid: A Method for Measuring & Maximizing Your Data Team - measuring data teams is hard, the data ROI pyramid is a nice framing, making data downtime a key component (of course as it’s developed by Montecarlo data).

Building a Streaming Lakehouse: Performance Comparison Between Paimon and Hudi - talking of streaming, this is a practical comparison of Paimon and Hudi from Alibaba.

Apache Hudi: From Zero To One (10/10) - and of course, following up here’s a complete 10 part series getting you set up on Apache Hudi.

Best Data Observability tools 2024: RANKED - Monte Carlo may lead the enterprise data observability sector, but there’s a ton of tools out there!

How we Built a 19 PiB Logging Platform with ClickHouse and Saved Millions - ok the title got me, but it’s a pretty in-depth article, worth a look.

Database schema changes: the key to holistic Data PM PRDs. - yes database schema changes matter for product managers, duh!

DuckDB as the New jq - more DuckDB fandom, now as a jq replacement.

Guides

What They Forgot to Teach You About R - this is like a small book about R wonders.

Semantic Layers: A Buyers Guide - semantic layers are still around, so I guess we better get used to understanding them in detail. The guide evaluates just two (dbt and Cube) but provides you with a good list of criteria to do your own evaluation.

5 real-time data processing and analytics technologies – and where you can implement them - real-time processing is always a hard topic to tackle, and yet IMHO it’s also one of the most promising applications of analytics. So I recommend you to get some serious practice implementing all of those technologies.

Tools 

Going Meta - this is a great introduction to the dbt package dbt_project_evaluator giving you an idea on best practices, testing, documentation and performance of your project.

DuckDB Meets Apache Arrow - is an interesting look into how GoodData uses both duckDB to execute SQL queries against caches and Apache Arrow to build an analytics lake.

Binjr - is a time series browser that is standalone and has no fluff. 

Introducing Beam YAML: Apache Beam's First No-code SDK - declarative definitions of data processing steps. Mmmmmh.

Analyze with data canvas - This is a new cool tool for BigQuery comparable to count.co, a kind of storytelling tool that pieces together different charts and plots.

Apache Superset 4.0 Release Notes - some big changes up! However, none of them seem to have made big changes to performance. Tags, alerts, monitoring, good but boring stuff.

Sparrow - It looks like a fun OS tool for data processing with ML and LLM.  

Deep Thoughts

The Icehouse Manifesto: Building an Open Lakehouse - The icehouse is an open lakehouse with trino and iceberg as open table format. I like the direction! 

Reflections on Strong Momentum and Category Leadership in Data Observability - I’m quite happy about the success the Monte Carlo team had so far. They have been leading the data observability efforts for quite some time now, and this is a nice piece of information from CEO and co-founder Barr Moses reflecting on their journey so far.

What Does a Statistical Method Assume? - That's just what the title promises: a refreshing read on statistical thinking.

How is the state of analytics engineering? - Thoughts on the state of analytics engineering from the inventors of analytics engineering.