Run a Spark Application
This guide shows how to submit an Apache Spark job with a SparkApplication custom resource, monitor it to completion, and inspect the result. It uses the built-in Spark Pi example that ships in the Spark image.
Prerequisites
- Alauda Build of Spark Operator is installed and its CSV reports
Succeeded — see Installation.
kubectl access to the target cluster.
1. Create a namespace and Spark RBAC
In cluster mode the Spark driver creates and manages its executor pods, so the driver's ServiceAccount needs permission to manage pods (and the services / configmaps Spark uses) in its namespace.
apiVersion: v1
kind: Namespace
metadata:
name: spark-demo
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: spark
namespace: spark-demo
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: spark-driver
namespace: spark-demo
rules:
- apiGroups: [""]
resources: ["pods", "services", "configmaps", "persistentvolumeclaims"]
verbs: ["create", "get", "list", "watch", "delete", "deletecollection", "patch", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: spark-driver
namespace: spark-demo
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: spark-driver
subjects:
- kind: ServiceAccount
name: spark
namespace: spark-demo
Save as spark-rbac.yaml and apply:
kubectl apply -f spark-rbac.yaml
2. Submit a SparkApplication
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: spark-pi
namespace: spark-demo
spec:
type: Scala
mode: cluster
image: docker.io/library/spark:4.0.1
imagePullPolicy: IfNotPresent
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples.jar"
arguments: ["100"]
sparkVersion: "4.0.1"
restartPolicy:
type: Never
driver:
coreRequest: "500m"
coreLimit: "1000m"
memory: 512m
serviceAccount: spark
securityContext:
runAsUser: 185
runAsGroup: 185
runAsNonRoot: true
allowPrivilegeEscalation: false
capabilities: { drop: ["ALL"] }
seccompProfile: { type: RuntimeDefault }
executor:
instances: 2
coreRequest: "500m"
coreLimit: "1000m"
memory: 512m
securityContext:
runAsUser: 185
runAsGroup: 185
runAsNonRoot: true
allowPrivilegeEscalation: false
capabilities: { drop: ["ALL"] }
seccompProfile: { type: RuntimeDefault }
spec.image — the Apache Spark runtime image for the driver and executors. On an air-gapped cluster, use the copy relocated into the platform registry (recorded in the CSV relatedImages) or your internal mirror, e.g. build-harbor.alauda.cn/mlops/spark:4.0.1.
spec.mainApplicationFile — path or URL to the application artifact. local:// refers to a path inside the image; the Spark examples jar ships in the Spark image.
spec.arguments — arguments passed to the application; for Spark Pi this is the number of partitions.
spec.driver.serviceAccount — must reference the ServiceAccount created in step 1.
spec.driver.securityContext / spec.executor.securityContext — required to satisfy a restricted Pod Security Admission policy. The Spark image runs as UID/GID 185.
spec.executor.instances — number of executor pods to run.
Save as spark-pi.yaml and apply:
kubectl apply -f spark-pi.yaml
3. Monitor the application
The operator reports progress in .status.applicationState.state, which moves through SUBMITTED → RUNNING → COMPLETED.
# high-level status
kubectl get sparkapplication spark-pi -n spark-demo
# just the state field
kubectl get sparkapplication spark-pi -n spark-demo \
-o jsonpath='{.status.applicationState.state}{"\n"}'
# driver and executor pods (driver pod is named <app-name>-driver)
kubectl get pods -n spark-demo
4. Verify the result
When the application reaches COMPLETED, check the driver log for the computed value of Pi:
kubectl logs spark-pi-driver -n spark-demo | grep "Pi is roughly"
You should see output similar to:
5. Clean up
kubectl delete sparkapplication spark-pi -n spark-demo
# optionally remove the demo namespace
kubectl delete namespace spark-demo
Schedule a recurring job
To run a Spark application on a schedule, wrap the same application spec in a ScheduledSparkApplication:
apiVersion: sparkoperator.k8s.io/v1beta2
kind: ScheduledSparkApplication
metadata:
name: spark-pi-scheduled
namespace: spark-demo
spec:
schedule: "@every 5m"
concurrencyPolicy: Forbid
successfulRunHistoryLimit: 3
failedRunHistoryLimit: 1
template:
type: Scala
mode: cluster
image: docker.io/library/spark:4.0.1
imagePullPolicy: IfNotPresent
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples.jar"
arguments: ["100"]
sparkVersion: "4.0.1"
restartPolicy:
type: Never
driver:
coreRequest: "500m"
coreLimit: "1000m"
memory: 512m
serviceAccount: spark
executor:
instances: 2
coreRequest: "500m"
coreLimit: "1000m"
memory: 512m
spec.schedule — a cron expression (or @every <duration>) controlling when runs start.
spec.concurrencyPolicy — Allow, Forbid, or Replace; controls overlapping runs.
spec.template — the SparkApplication spec to run on each tick.
Inspect the schedule and its generated runs:
kubectl get scheduledsparkapplication spark-pi-scheduled -n spark-demo
kubectl get sparkapplication -n spark-demo
Common SparkApplication fields
For the full schema, see the upstream Spark Operator API reference.