Apache Spark on Docker
Note: This post is over 11 years old. The information may be outdated.
Docker and Spark are two powerful technologies that remain highly relevant in 2025. While the original docker-spark repository demonstrates basic containerization, this guide has been updated to reflect modern best practices.
Updated for 2025: This post was originally published in 2015. It has been significantly revised to replace deprecated tools (boot2docker), use current Spark versions (3.4+), add Docker Compose examples, include PySpark workflows, and mention Kubernetes deployment patterns.
Install Docker (2025 Edition)
Docker Desktop (Recommended for 2024+)
Install Docker Desktop for your platform:
- macOS: Docker Desktop for Mac
- Windows: Docker Desktop for Windows
- Linux: Docker Engine Installation
Ubuntu/Debian Linux
sudo apt-get update
sudo apt-get install docker.io docker-compose
sudo usermod -aG docker $USER
newgrp docker
Verify Installation
docker run hello-world
This will pull and run the hello-world image, confirming Docker is working correctly.
Modern Spark on Docker Context (2025)
When containerizing Apache Spark in 2025, consider these important points:
- Use Current Spark Versions: Spark 3.4+ provides significant performance improvements and Python 3.10+ support
- Deployment Architecture:
- Docker Compose: Ideal for local development and testing
- Kubernetes: Industry standard for production workloads (see Spark on Kubernetes)
- Best Practices:
- Use lightweight base images (e.g.,
eclipse-temurin:11-jre-slimorpython:3.11-slim) - Implement security scanning for container images
- Use minimal, multi-stage builds
- Use lightweight base images (e.g.,
- PySpark Support: Modern deployments typically support both Scala and Python APIs
Pull the image from Docker Repository
docker pull duyetdev/docker-spark
Building the image
docker build --rm -t duyetdev/docker-spark .
Running the Image
Historical Note: Boot2Docker
Note: boot2docker was deprecated in favor of Docker Desktop (2016+). If you're using a legacy system, skip this section.
Modern Approach: Docker Compose (Recommended)
For containerized Spark with proper networking and resource management, use Docker Compose:
version: '3.8'
services:
spark-master:
image: bitnami/spark:latest
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
ports:
- '8080:8080'
- '7077:7077'
- '4040:4040'
healthcheck:
test: ['CMD', 'curl', '-f', 'http://localhost:8080']
interval: 30s
timeout: 10s
retries: 3
spark-worker:
image: bitnami/spark:latest
depends_on:
- spark-master
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark-master:7077
- SPARK_WORKER_MEMORY=1G
- SPARK_WORKER_CORES=1
ports:
- '8081:8081'
Then run:
docker-compose up -d
Access Spark Master UI at http://localhost:8080
Legacy: Direct Docker Run (Not Recommended)
For simple testing with the original image:
docker run -it -p 8080:8080 -p 7077:7077 \
-h spark-master \
bitnami/spark:latest
Testing Spark (2025 Edition)
Using Spark Shell (Scala)
Connect to the running container and test with Scala:
docker-compose exec spark-master spark-shell --master spark://spark-master:7077
# In the Spark shell:
scala> sc.parallelize(1 to 1000).count()
res0: Long = 1000
Using PySpark (Python) - Recommended
Modern Spark workflows typically use Python. Test with PySpark:
docker-compose exec spark-master pyspark --master spark://spark-master:7077
# In the PySpark shell:
>>> sc.parallelize(range(1, 1001)).count()
1000
>>> df = spark.createDataFrame([(i, i*2) for i in range(1, 100)], ["id", "value"])
>>> df.show(5)
Submit a Spark Job
For production-like testing, submit a job:
docker-compose exec spark-master spark-submit \
--master spark://spark-master:7077 \
--class org.apache.spark.examples.SparkPi \
/opt/spark/examples/jars/spark-examples_2.13-3.4.0.jar 100
Monitor with Web UI
- Spark Master UI: http://localhost:8080
- Spark Worker UI: http://localhost:8081
- Application UI: http://localhost:4040 (while job is running)
Additional Resources for 2025
Related Posts
Cài Apache Spark standalone bản pre-built
Mình nhận được nhiều phản hồi từ bài viết BigData - Cài đặt Apache Spark trên Ubuntu 14.04 rằng sao cài khó và phức tạp thế. Thực ra bài viết đó mình hướng dẫn cách build và install từ source.
PySpark - Thiếu thư viện Python trên Worker
Apache Spark chạy trên Cluster, với Java thì đơn giản. Với Python thì package python phải được cài trên từng Node của Worker. Nếu không bạn sẽ gặp phải lỗi thiếu thư viện.
PySpark Getting Started
Hadoop is the standard tool for distributed computing across really large data sets and is the reason why you see "Big Data" on advertisements as you walk through the airport. It has become an operating system for Big Data, providing a rich ecosystem of tools and techniques that allow you to use a large cluster of relatively cheap commodity hardware to do computing at supercomputer scale. Two ideas from Google in 2003 and 2004 made Hadoop possible: a framework for distributed storage (The Google File System), which is implemented as HDFS in Hadoop, and a framework for distributed computing (MapReduce).
Cài đặt Apache Spark trên Ubuntu 14.04
Trong lúc tìm hiểu vài thứ về BigData cho một số dự án, mình quyết định chọn Apache Spark thay cho Hadoop. Theo như giới thiệu từ trang chủ của Apache Spark, thì tốc độ của nó cao hơn 100x so với Hadoop MapReduce khi chạy trên bộ nhớ, và nhanh hơn 10x lần khi chạy trên đĩa, tương thích hầu hết các CSDL phân tán (HDFS, HBase, Cassandra, ...). Ta có thể sử dụng Java, Scala hoặc Python để triển khai các thuật toán trên Spark.