Finish Slime - #3

FerretDB, Pandas, Spark, Polars, Kimbal dimensional modeling, Parquet, Data Lakes, Extracting Data.

May 10, 2023

Simple. Data Engineering, Analytics, no ML, no AI, no opinions. Just great content, shared.

News

Announcing FerretDB 1.0 GA - a truly Open Source MongoDB alternative

FerretDB is now production-ready. It is an open-source MongoDB alternative built on PostgreSQL and released under the Apache 2.0 license. It allows users to run MongoDB workloads on PostgreSQL, and it is available for cloud-based projects and existing PostgreSQL infrastructures.

Tutorials & Show Cases

Pandas, Spark, and Polars — When To Use Which?

There are three popular packages for handling tabular data: Pandas, Polars, and Spark. The article provides details and performance tests based on random data.

Building a Kimball dimensional model with dbt

The article discusses the importance of dimensional modeling. It explains the benefits of dimensional modeling, including simpler data models, faster data retrieval, and a close alignment with actual business processes. The tutorial provides a step-by-step guide to creating a dimensional model using dbt.

Casual data engineering, or: A poor man's Data Lake in the cloud - Part I

How to create a cost-effective data lake using AWS serverless services and DuckDB: Data lakes are centralized repositories that store and manage data of all types and sizes. Modern data lakes follow the separation of storage and compute, allowing organizations to scale each layer independently, significantly reduce costs, and improve flexibility and performance.

Tools & Resources

Parquet: more than just "Turbo CSV"

Parquet is a binary file format that is quicker to read, write and smaller than csv. It has an explicit schema with type information, eliminating the need to infer types. Parquet eliminates character encoding confusion and provides a single way to represent missing data. Parquet is partly row-oriented and partly column-oriented, with data broken up into row chunks and column chunks. The index is at the end of the file, making it impossible to stream, but it has explicit support for splitting data across multiple files.

Doing Data The Hard Way Part 1: Extracting Data

Data is hard. Getting data out of a system is just as hard. Pedram covers common use cases for extracting data from application databases and the two common strategies for saving data. He explains full loads as well as incremental data loads, change data capture (CDC), and key-based replication.