Apache Spark on Docker
Note: This post is over 10 years old. The information may be outdated.
Docker and Spark are two powerful technologies that remain highly relevant in 2025. While the original docker-spark repository demonstrates basic containerization, this guide has been updated to reflect modern best practices.
Updated for 2025: This post was originally published in 2015. It has been significantly revised to replace deprecated tools (boot2docker), use current Spark versions (3.4+), add Docker Compose examples, include PySpark workflows, and mention Kubernetes deployment patterns.
Install Docker (2025 Edition)
Docker Desktop (Recommended for 2024+)
Install Docker Desktop for your platform:
- macOS: Docker Desktop for Mac
- Windows: Docker Desktop for Windows
- Linux: Docker Engine Installation
Ubuntu/Debian Linux
sudo apt-get update
sudo apt-get install docker.io docker-compose
sudo usermod -aG docker $USER
newgrp docker
Verify Installation
docker run hello-world
This will pull and run the hello-world image, confirming Docker is working correctly.
Modern Spark on Docker Context (2025)
When containerizing Apache Spark in 2025, consider these important points:
- Use Current Spark Versions: Spark 3.4+ provides significant performance improvements and Python 3.10+ support
- Deployment Architecture:
- Docker Compose: Ideal for local development and testing
- Kubernetes: Industry standard for production workloads (see Spark on Kubernetes)
- Best Practices:
- Use lightweight base images (e.g.,
eclipse-temurin:11-jre-slimorpython:3.11-slim) - Implement security scanning for container images
- Use minimal, multi-stage builds
- Use lightweight base images (e.g.,
- PySpark Support: Modern deployments typically support both Scala and Python APIs
Pull the image from Docker Repository
docker pull duyetdev/docker-spark
Building the image
docker build --rm -t duyetdev/docker-spark .
Running the Image
Historical Note: Boot2Docker
Note: boot2docker was deprecated in favor of Docker Desktop (2016+). If you're using a legacy system, skip this section.
Modern Approach: Docker Compose (Recommended)
For containerized Spark with proper networking and resource management, use Docker Compose:
version: '3.8'
services:
spark-master:
image: bitnami/spark:latest
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
ports:
- '8080:8080'
- '7077:7077'
- '4040:4040'
healthcheck:
test: ['CMD', 'curl', '-f', 'http://localhost:8080']
interval: 30s
timeout: 10s
retries: 3
spark-worker:
image: bitnami/spark:latest
depends_on:
- spark-master
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark-master:7077
- SPARK_WORKER_MEMORY=1G
- SPARK_WORKER_CORES=1
ports:
- '8081:8081'
Then run:
docker-compose up -d
Access Spark Master UI at http://localhost:8080
Legacy: Direct Docker Run (Not Recommended)
For simple testing with the original image:
docker run -it -p 8080:8080 -p 7077:7077 \
-h spark-master \
bitnami/spark:latest
Testing Spark (2025 Edition)
Using Spark Shell (Scala)
Connect to the running container and test with Scala:
docker-compose exec spark-master spark-shell --master spark://spark-master:7077
# In the Spark shell:
scala> sc.parallelize(1 to 1000).count()
res0: Long = 1000
Using PySpark (Python) - Recommended
Modern Spark workflows typically use Python. Test with PySpark:
docker-compose exec spark-master pyspark --master spark://spark-master:7077
# In the PySpark shell:
>>> sc.parallelize(range(1, 1001)).count()
1000
>>> df = spark.createDataFrame([(i, i*2) for i in range(1, 100)], ["id", "value"])
>>> df.show(5)
Submit a Spark Job
For production-like testing, submit a job:
docker-compose exec spark-master spark-submit \
--master spark://spark-master:7077 \
--class org.apache.spark.examples.SparkPi \
/opt/spark/examples/jars/spark-examples_2.13-3.4.0.jar 100
Monitor with Web UI
- Spark Master UI: http://localhost:8080
- Spark Worker UI: http://localhost:8081
- Application UI: http://localhost:4040 (while job is running)