- The Finish Slime
- Posts
- Finish Slime #52
Finish Slime #52
Ice Ice Baby! The Icehouse arrived!
Data Engineering, Analytics. No ML, no AI. The weekly dose of the data content you actually want to read!
I’m changing up things: From now on you get a monthly mail with the best recent stuff. It’ll contain a lot more links, but it’s also going to be (even) better filtered. So, let’s dive into the new format!
Great Recent Stuff
dbt best practices in action at Cal-ITP’s data-infra project - Let’s talk dbt again; Cal-ITP has an impressive project with almost 400 dbt models. Take a look at how they use templates and many other best practices in action.
3 types of delivery models for Data and Analytics teams - I like to keep reminding everyone that analytics teams come in a variety of shapes and forms. And you have a substantial part in choosing that form!
What is and what is NOT a Data Product - Love such articles. The most important consideration usually is what is NOT a data product (like most database tables, you might’ve figured, are not!).
Good event data - event data and time series data are the future! Gotta read this.
The Best Way To Learn Data Engineering - data engineering is hard, and varies tons from job to job, so better be prepared to learn a lot!
Full Steam Ahead: The 2024 MAD (Machine Learning, AI & Data) Landscape - as always, seeing the MAD landscape is simply overwhelming. It grows every single year. Blows my mind.
Eventify everything - Data modeling for event data - literally everything is an event, so better learn about them.
Practical
The Data ROI Pyramid: A Method for Measuring & Maximizing Your Data Team - measuring data teams is hard, the data ROI pyramid is a nice framing, making data downtime a key component (of course as it’s developed by Montecarlo data).
Building a Streaming Lakehouse: Performance Comparison Between Paimon and Hudi - talking of streaming, this is a practical comparison of Paimon and Hudi from Alibaba.
Apache Hudi: From Zero To One (10/10) - and of course, following up here’s a complete 10 part series getting you set up on Apache Hudi.
Best Data Observability tools 2024: RANKED - Monte Carlo may lead the enterprise data observability sector, but there’s a ton of tools out there!
How we Built a 19 PiB Logging Platform with ClickHouse and Saved Millions - ok the title got me, but it’s a pretty in-depth article, worth a look.
Database schema changes: the key to holistic Data PM PRDs. - yes database schema changes matter for product managers, duh!
DuckDB as the New jq - more DuckDB fandom, now as a jq replacement.
Guides
What They Forgot to Teach You About R - this is like a small book about R wonders.
Semantic Layers: A Buyers Guide - semantic layers are still around, so I guess we better get used to understanding them in detail. The guide evaluates just two (dbt and Cube) but provides you with a good list of criteria to do your own evaluation.
5 real-time data processing and analytics technologies – and where you can implement them - real-time processing is always a hard topic to tackle, and yet IMHO it’s also one of the most promising applications of analytics. So I recommend you to get some serious practice implementing all of those technologies.
Tools
Going Meta - this is a great introduction to the dbt package dbt_project_evaluator giving you an idea on best practices, testing, documentation and performance of your project.
DuckDB Meets Apache Arrow - is an interesting look into how GoodData uses both duckDB to execute SQL queries against caches and Apache Arrow to build an analytics lake.
Binjr - is a time series browser that is standalone and has no fluff.
Introducing Beam YAML: Apache Beam's First No-code SDK - declarative definitions of data processing steps. Mmmmmh.
Analyze with data canvas - This is a new cool tool for BigQuery comparable to count.co, a kind of storytelling tool that pieces together different charts and plots.
Apache Superset 4.0 Release Notes - some big changes up! However, none of them seem to have made big changes to performance. Tags, alerts, monitoring, good but boring stuff.
Sparrow - It looks like a fun OS tool for data processing with ML and LLM.
Deep Thoughts
The Icehouse Manifesto: Building an Open Lakehouse - The icehouse is an open lakehouse with trino and iceberg as open table format. I like the direction!
Reflections on Strong Momentum and Category Leadership in Data Observability - I’m quite happy about the success the Monte Carlo team had so far. They have been leading the data observability efforts for quite some time now, and this is a nice piece of information from CEO and co-founder Barr Moses reflecting on their journey so far.
What Does a Statistical Method Assume? - That's just what the title promises: a refreshing read on statistical thinking.
How is the state of analytics engineering? - Thoughts on the state of analytics engineering from the inventors of analytics engineering.