LogoDuyệtSr. Data Engineer
HomeAboutPhotosInsightsCV

Footer

Logo

Resources

  • Rust Tiếng Việt
  • /archives
  • /series
  • /tags
  • Status

me@duyet.net

  • About
  • LinkedIn
  • Resume
  • Projects

© 2026 duyet.net | Sr. Data Engineer | 2026-02-27

Running Spark in GitHub Actions

This is how to quick and easy guide on how to run Apache Spark in GitHub Actions for testing purposes.

Imagine that I start with an example PySpark app.py script that reads data from a JSON file and performs some basic queries.

import sys

from pyspark.sql import SparkSession

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: spark-submit app.py <input>")
        sys.exit(1)

    input_file_path = sys.argv[1]

    spark = SparkSession.builder.appName("Spark").getOrCreate()
    df = spark.read.json(input_file_path)

    df.createOrReplaceTempView("df")

    spark.sql("SELECT * FROM df LIMIT 10").show()
    spark.sql("SELECT COUNT(*) FROM df").show()
    spark.sql("SELECT cn, COUNT(*) FROM df GROUP BY cn ORDER BY 2 DESC").show()
    spark.sql("SELECT cn, MAX(temp) AS max_temp FROM df GROUP BY cn ORDER BY 2 DESC").show()

In local, we can submit it via

spark-submit app.py <input>

To run this PySpark script in GitHub Actions, I've create a workflow file named spark-submit.yaml in the .github/workflows/ directory. The spark-submit.yaml file defines the steps that GitHub Actions should take to run the PySpark script using the spark-submit command.

File: .github/workflows/spark-submit.yaml

name: Spark Submit

on:
  push:
    branches:
      - 'master'

jobs:
  spark-submit:
    runs-on: ubuntu-latest

    strategy:
      matrix:
        python:
          - '3.10'
          - '3.11'
        spark:
          - 3.3.2
          - 3.4.0

    steps:
      - uses: actions/checkout@v3

      - uses: actions/setup-python@v4
        with:
          python-version: ${{ matrix.python }}

      - uses: actions/setup-java@v3
        with:
          java-version: '17'
          distribution: 'temurin'

      - uses: vemonet/setup-spark@v1
        with:
          spark-version: ${{ matrix.spark }}
          hadoop-version: '3'

      - run: spark-submit app.py data.json

Here is the result detail for each run in Github Actions:

The spark-submit.yaml file also specifies the matrix strategy to test the PySpark script on multiple versions of Python and Spark. In this example, the PySpark script will be tested on:

  • Python 3.10, Spark 3.3.2.
  • Python 3.10, Spark 3.4.0.
  • Python 3.11, Spark 3.3.2.
  • Python 3.11, Spark 3.4.0.

This is useful for ensuring that your PySpark script works correctly on different versions of Python and Spark. Additionally, you can add Spark unit tests and run them via Github Actions to ensure that all tests pass before merging PRs.

That's it. You can find all code in my repo: https://github.com/duyet/spark-in-github-actions

References

  • duyet/spark-in-github-actions
  • GitHub Actions Documentation
  • Github Actions matrix strategy
May 7, 2023·3 years ago
|Data|
DataData EngineeringApache SparkGithub
|Edit|

Related Posts

Cài Apache Spark standalone bản pre-built

Mình nhận được nhiều phản hồi từ bài viết BigData - Cài đặt Apache Spark trên Ubuntu 14.04 rằng sao cài khó và phức tạp thế. Thực ra bài viết đó mình hướng dẫn cách build và install từ source.

May 31, 2017·9 years ago
Read more

DuckDB

In this post, I want to explore the features and capabilities of DuckDB, an open-source, in-process SQL OLAP database management system written in C++11 that has been gaining popularity recently. According to what people have said, DuckDB is designed to be easy to use and flexible, allowing you to run complex queries on relational datasets using either local, file-based DuckDB instances or the cloud service MotherDuck.

Sep 3, 2023·2 years ago
Read more

Airflow control the parallelism and concurrency (draw)

How to control parallelism and concurrency

Jul 16, 2023·3 years ago
Read more

GPT vs Traditional NLP Models

The field of Natural Language Processing (NLP) has seen remarkable advancements in recent years, and the emergence of the Generative Pre-trained Transformer (GPT) has revolutionized the way NLP models operate. GPT is a cutting-edge language model that employs deep learning to generate human-like text. Unlike conventional NLP models, which required extensive training on specific tasks, GPT is pre-trained on vast amounts of data and can be fine-tuned for various NLP tasks

Apr 1, 2023·3 years ago
Read more