Run Spark Rapids
This page helps you run Spark jobs on GPUs using the NVIDIA RAPIDS Accelerator by outlining setup steps, configuration, and verification procedures.
Prerequisites
- Ensure you have access to a cluster with GPU nodes and required permissions.
- Java, Hadoop, Spark, and Hive are already installed and accessible in your environment.
- CUDA libraries compatible with your RAPIDS version are installed.
Steps to Run Spark Rapids
- Download the required JAR files.
x
wget http://repo1.acceldata.dev/repository/odp-central/com/nvidia/rapids-4-spark_2.12/25.06.0.3.3.6.2-1/rapids-4-spark_2.12-25.06.0.3.3.6.2-1-cuda11.jar
wget https://repo1.maven.org/maven2/ai/rapids/cudf/25.06.0/cudf-25.06.0-cuda11.jar
- Set environment variables.
export HIVE_HOME=/usr/odp/3.3.6.2-1/hive
export SPARK_HOME=/usr/odp/3.3.6.2-1/spark3
export HADOOP_CLASSPATH=$(hadoop classpath)
Make sure the variables above reflect your cluster's directory structure.
- Validate the CUDA validation.
Before running your Spark job, check CUDA availability:
nvidia-smi
This command shows the available GPUs and the current CUDA version.
- Launch Spark-Shell with rapids.
$SPARK_HOME/bin/spark-shell \
--master yarn \
--conf spark.yarn.queue=GPU \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.sql.enabled=true \
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.task.resource.gpu.amount=0.1 \
--conf spark.resources.discoveryScript=$SPARK_HOME/examples/src/main/scripts/getGpusResources.sh \
--conf spark.executor.resource.gpu.discoveryScript=$SPARK_HOME/examples/src/main/scripts/getGpusResources.sh \
--conf spark.metrics.enabled=false \
--jars rapids-4-spark_2.12-25.06.0.3.3.6.2-1-cuda11.jar,cudf-25.06.0-cuda11.jar
Adjust the script paths and versions based on your actual deployment.
- Run a sample job.
In the Spark shell, try running a basic DataFrame operation to test GPU acceleration:
val df = spark.range(1, 1_000_000)
df.selectExpr("id", "id * 2 as double_id").show()
or
val df = spark.range(1, 100000000).toDF("id")
val result = df.groupByExpr("id % 100").count()
result.show()
Monitor the Spark UI (typically at port 4040) to verify that GPU resources are being allocated and used for the tasks.
- Validation the job execution.
- Check Spark logs in the Resource Manager for any RAPIDS library loading or GPU assignment errors.
- Confirm RAPIDS acceleration is being used with log entries about
com.nvidia.spark.rapids
. - You can also set additional debug logs for more visibility:
--conf spark.rapids.sql.logging.enabled=true
Optional Steps
- Tuning: Adjust
spark.executor.memory
,spark.executor.cores
, andspark.executor.instances
for optimal performance. - Library Version Check: Make sure Spark, CUDA, and CUDF versions are compatible.
- Python Jobs: If running with PySpark, update the above procedure accordingly (e.g., use
pyspark
instead ofspark-shell
).
Was this page helpful?