Finish Slime - #3

FerretDB, Pandas, Spark, Polars, Kimbal dimensional modeling, Parquet, Data Lakes, Extracting Data.

Simple. Data Engineering, Analytics, no ML, no AI, no opinions. Just great content, shared. 

News

FerretDB is now production-ready. It is an open-source MongoDB alternative built on PostgreSQL and released under the Apache 2.0 license. It allows users to run MongoDB workloads on PostgreSQL, and it is available for cloud-based projects and existing PostgreSQL infrastructures.

Tutorials & Show Cases

There are three popular packages for handling tabular data: Pandas, Polars, and Spark. The article provides details and performance tests based on random data.

The article discusses the importance of dimensional modeling. It explains the benefits of dimensional modeling, including simpler data models, faster data retrieval, and a close alignment with actual business processes. The tutorial provides a step-by-step guide to creating a dimensional model using dbt.

How to create a cost-effective data lake using AWS serverless services and DuckDB: Data lakes are centralized repositories that store and manage data of all types and sizes. Modern data lakes follow the separation of storage and compute, allowing organizations to scale each layer independently, significantly reduce costs, and improve flexibility and performance.

Tools & Resources

Parquet is a binary file format that is quicker to read, write and smaller than csv. It has an explicit schema with type information, eliminating the need to infer types. Parquet eliminates character encoding confusion and provides a single way to represent missing data. Parquet is partly row-oriented and partly column-oriented, with data broken up into row chunks and column chunks. The index is at the end of the file, making it impossible to stream, but it has explicit support for splitting data across multiple files.

Data is hard. Getting data out of a system is just as hard. Pedram covers common use cases for extracting data from application databases and the two common strategies for saving data. He explains full loads as well as incremental data loads, change data capture (CDC), and key-based replication.