Showing posts 1-10 of 57 from data-engineering topic (Page 1 of 6). Checking out all my favorite topics here.
Apache OpenDAL in Rust to Access Any Kind of Data ServicesOpenDAL is a data access layer that allows users to easily and efficiently retrieve data from various storage services in a unified way such as S3, FTP, FS, Google Drive, HDFS, etc. They has been rewritten in Rust for the Core and have a binding from many various language like Python, Node.js, C, etc..
DuckDBIn this post, I want to explore the features and capabilities of DuckDB, an open-source, in-process SQL OLAP database management system written in C++11 that has been gaining popularity recently. According to what people have said, DuckDB is designed to be easy to use and flexible, allowing you to run complex queries on relational datasets using either local, file-based DuckDB instances or the cloud service MotherDuck.
Airflow control the parallelism and concurrency (draw)How to control parallelism and concurrency
Fossil Data Platform Rewritten in Rust 🦀My data engineering team at Fossil recently released some of Rust-based components of our Data Platform after faced performance and maintenance challenges of the old Python codebase. I would like to share the insights and lessons learned during the process of migrating Fossil's Data Platform from Python to Rust.
Running Spark in GitHub ActionsThis post provides a quick and easy guide on how to run Apache Spark in GitHub Actions for testing purposes
GPT vs Traditional NLP ModelsThe field of Natural Language Processing (NLP) has seen remarkable advancements in recent years, and the emergence of the Generative Pre-trained Transformer (GPT) has revolutionized the way NLP models operate. GPT is a cutting-edge language model that employs deep learning to generate human-like text. Unlike conventional NLP models, which required extensive training on specific tasks, GPT is pre-trained on vast amounts of data and can be fine-tuned for various NLP tasks
Ask ChatGPT about 20 important concepts of Apache SparkI asked ChatGPT to explain 20 important concepts of Apache Spark. Let's see what it has to say!
Rust Data Engineering: Processing Dataframes with PolarsIf you're interested in data engineering with Rust, you might want to check out Polars, a Rust DataFrame library with Pandas-like API.
Data Engineering Tools written in RustThis blog post will provide an overview of the data engineering tools available in Rust, their advantages and benefits, as well as a discussion on why Rust is a great choice for data engineering.
Why ClickHouse Should Be the Go-To Choice for Your Next Data Platform?Recently, I was working on building a new Logs dashboard at Fossil to serve our internal team for log retrieval, and I found ClickHouse to be a very interesting and fast engine for this purpose. In this post, I'll share my experience with using ClickHouse as the foundation of a light-weight data platform and how it compares to another popular choice, Athena. We'll also explore how ClickHouse can be integrated with other tools such as Kafka to create a robust and efficient data pipeline.