Finish Slime #7

Malloy, PandasAI, CDC, WAP.

Sven Balnojan
June 07, 2023

Data Engineering, Analytics. No ML, no AI. The weekly dose of the data content you actually want to read!

Tutorials

Malloy - An Experimental Language for Data

Malloy is a new experimental data programming language that breaks the traditional rectangular data paradigm. It allows data engineers to write more natural and expressive code, and it can be used to solve a broader range of data problems.

Choosing a good file format for Pandas

When working with Pandas, the best file format depends on your needs. For example, Parquet is a good option if you need to process large datasets because it is efficient and supports compression. CSV is a good option if you need to process data row-by-row because it is easy to parse.

Breaking Out Of Tutorial Hell

Tutorial hell is where you get stuck in the endless loop of watching tutorials and never actually building anything. You need to start building your projects to break out of tutorial hell. This will help you learn by doing and make you a well-rounded data engineer.

Trust in Numbers: Building Reliable Reporting Through Cross-Functional Collaboration

Building reliable data pipelines is essential for any organization that relies on data to make decisions. There are several things that data engineers can do to ensure that their pipelines are reliable, including…

Techniques

What Is Change Data Capture

Change Data Capture (CDC) is a method for tracking changes to data in a database. This can be useful for various purposes, such as keeping a history of changes, implementing real-time data synchronization, and implementing data pipelines.

Data Engineering How to Implement Write-Audit-Publish (WAP)

The Write-Audit-Publish pattern is a way to ensure data quality in data pipelines. It works by first writing data to a temporary table, then auditing it (checking for errors), and finally publishing it to a production table. This pattern can be implemented using tools like lakeFS, Apache Spark, and Delta Lake.

Tools & Resources

Introducing PandasAI: The Generative AI Python Library

PandasAI is a Python library that adds generative AI capabilities to Pandas, a popular data analysis and manipulation tool. This can help data engineers to automate tasks, detect patterns, and make more informed decisions.

Quickstart SQLMesh

SQLMesh is a tool that helps data engineers build and deploy data pipelines. It's easy to use and can be integrated with various data sources and sinks.