Airflow control the parallelism and concurrency (draw)
Airflow configuration to allow for a larger scheduling capacity and frequency:
DAGs have configurations that improve efficiency:
max_active_tasks: Overrides max_active_tasks_per_dag.max_active_runs: Overrides max_active_runs_per_dag.
Operators or tasks also have configurations that improves efficiency and scheduling priority:
max_active_tis_per_dag: This parameter controls the number of concurrent running task instances acrossdag_runsper task.pool: See Pools.priority_weight: See Priority Weights.queue: See Queues for CeleryExecutor deployments only.
Credits
Related Posts
Airflow Dataset (Data-aware scheduling)
Airflow since 2.4, in addition to scheduling DAGs based upon time, they can also be scheduled based upon a task updating a dataset. This will change the way you schedule DAGs.
DuckDB
In this post, I want to explore the features and capabilities of DuckDB, an open-source, in-process SQL OLAP database management system written in C++11 that has been gaining popularity recently. According to what people have said, DuckDB is designed to be easy to use and flexible, allowing you to run complex queries on relational datasets using either local, file-based DuckDB instances or the cloud service MotherDuck.
Running Spark in GitHub Actions
This post provides a quick and easy guide on how to run Apache Spark in GitHub Actions for testing purposes
GPT vs Traditional NLP Models
The field of Natural Language Processing (NLP) has seen remarkable advancements in recent years, and the emergence of the Generative Pre-trained Transformer (GPT) has revolutionized the way NLP models operate. GPT is a cutting-edge language model that employs deep learning to generate human-like text. Unlike conventional NLP models, which required extensive training on specific tasks, GPT is pre-trained on vast amounts of data and can be fine-tuned for various NLP tasks