Migrating Spark Jobs to xDP

Overview

This guide helps you package an existing Spark application as a container image and run it on xDP. It covers how to write a Dockerfile using the xDP Spark base image, how to write generic code for both xStore mode and Datastore mode, and how xDP handles credentials and governance automatically at runtime.

xDP supports two ways for Spark jobs to access data:

xStore mode — Spark discovers catalogs dynamically via the xStore plugin. The platform injects all configuration and credentials. Your code uses plain spark.sql() with three-part table names. This is the recommended approach for new jobs when your Compute Cluster is linked to xStore.
Datastore mode — Spark connects directly to registered data stores (S3, ADLS, HDFS, Hive). xDP injects credentials as environment variables. Your code reads those variables and configures connectors.

For complete working examples, see the public acceldata-io/xdp-examples repository.

Prerequisites

Before you begin, verify the following:

Apache Spark is installed on your target Compute Cluster. See App Spark
Your data sources are registered — as xStore catalogs (for xStore mode) or as Data Stores (for Datastore mode).

Data Access Modes at a Glance

Aspect	xStore Mode	Datastore Mode
Catalog discovery	Dynamic via xStore	Static, per-datastore config at submit time
Credential handling	Automatic — OAuth2 token injected by platform	Env vars injected per registered Data Store
Code complexity	`spark.sql("SELECT * FROM catalog.schema.table")`	Manual connector config per source type
Governance	xCentral policies enforced automatically	No governance enforcement in job code
When to use	Compute Cluster linked to xStore	No xStore link, or direct raw storage access

Step 1 — Write the Dockerfile

xStore Jobs (PySpark)

xStore jobs use the xDP Spark base image with the xStore connector pre-installed. The image includes the xStore Spark plugin, GVFS, Iceberg, Delta Lake, Hadoop AWS, and all major JDBC drivers — no extra JARs are needed.

# xStore base image — includes xStore connector, GVFS, Iceberg, Delta Lake, JDBC drivers ARG BASE_IMAGE=<BASE_IMAGE> FROM ${BASE_IMAGE} USER root WORKDIR /app # xDP injects Kerberos/Hadoop config at runtime for Hive (ODP) catalogs # These directories are created here so the platform can mount config into them RUN mkdir -p /opt/acceldata/odp_hive_catalog/conf \ /opt/acceldata/odp_hive_catalog/keytab ENV HADOOP_CONF_DIR=/opt/acceldata/odp_hive_catalog/conf \ KRB5_CONFIG=/opt/acceldata/odp_hive_catalog/conf/krb5.conf # Copy application scripts COPY xstore/ /app/xstore/ RUN chmod +x /app/xstore/*.py && chown -R 185:185 /app # Run as non-root (UID 185 is the spark user in the xDP base image) USER 185

Datastore Jobs (PySpark)

Datastore jobs use the standard xDP Spark image. You copy your application code and install any additional Python packages.

# Standard xDP Spark base image (Python + JDK) ARG BASE_IMAGE=<BASE_IMAGE> FROM ${BASE_IMAGE} USER root WORKDIR /app ENV SPARK_HOME=/opt/spark ENV SPARK_JARS_DIR=$SPARK_HOME/jars # Kerberos / Hadoop config paths (pre-set in the base image) ENV HADOOP_CONF_DIR=/etc/hadoop/conf \ JAVA_SECURITY_KRB5_CONF=/etc/krb5.conf \ KRB5_CONFIG=/etc/krb5.conf \ CLASSPATH=$SPARK_JARS_DIR/* \ PATH=$PATH:$SPARK_JARS_DIR \ SPARK_USER="185" \ HADOOP_USER="185" # Copy application code COPY S3/ /app/S3/ COPY ADLS/ /app/ADLS/ COPY ODP/ /app/ODP/ # Copy extra JARs (e.g., hadoop-aws, azure-storage) COPY jars/ /opt/spark/jars/ # Install Python dependencies in a virtual environment RUN apt-get update && apt-get install -y python3-pip python3-venv && \ python3 -m venv /opt/venv && \ /opt/venv/bin/pip install --upgrade pip # Uncomment to install from a requirements file # COPY requirements.txt . # RUN /opt/venv/bin/pip install -r requirements.txt ENV PATH="/opt/venv/bin:$PATH" RUN chmod 777 /app USER 185

Java / Scala (both modes)

# Use the xStore base image for xStore jobs, or the standard image for Datastore jobs ARG BASE_IMAGE=<BASE_IMAGE> FROM ${BASE_IMAGE} USER root WORKDIR /app # Copy the fat JAR produced by your build tool (Gradle/Maven) COPY target/my-spark-job-1.0.0.jar /app/my-spark-job.jar USER 185

Build Commands

# Build for linux/amd64 (required for Kubernetes on x86 nodes) docker build --platform linux/amd64 -t my-registry.example.com/my-spark-job:1.0.0 . # Override the base image at build time docker build --platform linux/amd64 \ --build-arg BASE_IMAGE=191579300362.dkr.ecr.us-east-1.amazonaws.com/acceldata/xdp/dp/spark:3.5.5-scala2.12-java17-python3-ubuntu-xstore-1.0.2 \ -t my-registry.example.com/my-spark-job:1.0.0 . # Push to your registry docker push my-registry.example.com/my-spark-job:1.0.0

Image Best Practices

Practice	Why
Use versioned tags (e.g., `1.0.0`, `git-abc1234`)	Prevents silent changes from breaking production jobs; `latest` is not repeatable
Run as non-root user `185`	Required by Kubernetes security policies on most clusters
Never copy credentials into the image	All credentials are injected at runtime by xDP
Use a `.dockerignore` file	Exclude `.git`, `__pycache__`, local configs, and test data to keep the image small
Pin the base image to a specific digest or tag	Ensures reproducible builds across environments

Step 2 — Write Datastore Spark Code

Info

Use when your Compute Cluster has no xStore link, or when you need direct access to raw storage paths not managed by xStore.

In Datastore mode, the core principle is no hardcoded values. Every credential, endpoint, path, and table name must come from an environment variable read at runtime.

When you declare a Data Store Dependency on a Spark job in xDP, the platform automatically injects the matching environment variables into the driver and executor pods. Your code reads them with os.environ.get().

S3

import os from pyspark.sql import SparkSession spark = SparkSession.builder.appName("S3Job").getOrCreate() # xDP injects these when you add an S3 Data Store Dependency access_key = os.environ.get("DATASTORE_AWS_ACCESS_KEY_ID") secret_key = os.environ.get("DATASTORE_AWS_SECRET_ACCESS_KEY") bucket = os.environ.get("DATASTORE_S3_BUCKET_NAME") input_path = os.environ.get("DATASTORE_S3_FILE_PATH") output_path = os.environ.get("DATASTORE_S3_FILE_PATH_OUTPUT") hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration() hadoop_conf.set("fs.s3a.access.key", access_key) hadoop_conf.set("fs.s3a.secret.key", secret_key) df = spark.read.csv(f"s3a://{bucket}/{input_path}", header=True, inferSchema=True) df.write.csv(f"s3a://{bucket}/{output_path}", header=True, mode="overwrite") spark.stop()

ADLS (Azure Data Lake Storage )

import os from pyspark.sql import SparkSession spark = SparkSession.builder.appName("ADLSJob").getOrCreate() # xDP injects these when you add an ADLS Data Store Dependency account = os.environ.get("DATASTORE_AZURE_STORAGE_ACCOUNT_NAME") container = os.environ.get("DATASTORE_AZURE_CONTAINER_NAME") client_id = os.environ.get("DATASTORE_AZURE_CLIENT_ID") secret = os.environ.get("DATASTORE_AZURE_CLIENT_SECRET") tenant_id = os.environ.get("DATASTORE_AZURE_TENANT_ID") input_path = os.environ.get("ADLS_FILE_PATH") output_path = os.environ.get("ADLS_FILE_PATH_OUTPUT") hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration() hadoop_conf.set(f"fs.azure.account.auth.type.{account}.dfs.core.windows.net", "OAuth") hadoop_conf.set(f"fs.azure.account.oauth.provider.type.{account}.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider") hadoop_conf.set(f"fs.azure.account.oauth2.client.id.{account}.dfs.core.windows.net", client_id) hadoop_conf.set(f"fs.azure.account.oauth2.client.secret.{account}.dfs.core.windows.net", secret) hadoop_conf.set(f"fs.azure.account.oauth2.client.endpoint.{account}.dfs.core.windows.net", f"https://login.microsoftonline.com/{tenant_id}/oauth2/token") df = spark.read.csv(f"abfss://{container}@{account}.dfs.core.windows.net/{input_path}", header=True, inferSchema=True) df.write.csv(f"abfss://{container}@{account}.dfs.core.windows.net/{output_path}", header=True, mode="overwrite") spark.stop()

HDFS (with Kerberos)

import os from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("HDFSJob") \ .config("spark.kerberos.keytab", os.environ.get("KERBEROS_KEYTAB")) \ .config("spark.kerberos.principal", os.environ.get("KERBEROS_PRINCIPAL")) \ .getOrCreate() # xDP injects these when you add a Hadoop Data Store Dependency hdfs_url = os.environ.get("URL") # e.g. hdfs://namenode:8020 principal = os.environ.get("KERBEROS_PRINCIPAL") keytab = os.environ.get("KERBEROS_KEYTAB") input_path = os.environ.get("HDFS_FILE_PATH") output_path = os.environ.get("HDFS_FILE_OUTPUT_PATH") hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration() hadoop_conf.set("hadoop.security.authentication", "kerberos") hadoop_conf.set("hadoop.security.authorization", "true") UserGroupInformation = spark.sparkContext._jvm.org.apache.hadoop.security.UserGroupInformation UserGroupInformation.setConfiguration(hadoop_conf) UserGroupInformation.loginUserFromKeytab(principal, keytab) df = spark.read.csv(f"{hdfs_url}/{input_path}", header=True, inferSchema=True) df.write.csv(f"{hdfs_url}/{output_path}", header=True, mode="overwrite") spark.stop()

Info

Non-Kerberos HDFS: For clusters using simple authentication, omit the spark.kerberos.* configs and the UserGroupInformation block. Only URL and HDFSFILEPATH are needed.

Hive Metastore

import os from pyspark.sql import SparkSession # xDP injects the metastore URI when you add an Data Store Dependency metastore_uri = os.environ.get("HIVE_METASTORE_URIS") # e.g. thrift://hive-metastore:9083 input_table = os.environ.get("HIVE_INPUT_TABLE") # e.g. mydb.source_table output_table = os.environ.get("HIVE_OUTPUT_TABLE") # e.g. mydb.output_table spark = SparkSession.builder \ .appName("HiveJob") \ .config("spark.hadoop.hive.metastore.uris", metastore_uri) \ .enableHiveSupport() \ .getOrCreate() df = spark.sql(f"SELECT * FROM {input_table}") df.write.mode("overwrite").saveAsTable(output_table) spark.stop()

Step 3 — Submit the Job on xDP

Once your image is built and pushed to the registry, create and run it from the xDP UI.

For a full walkthrough of the job creation wizard — image path, script path, Data Store Dependencies, xStore Catalog toggle, Metalake selection, resource settings, and scheduling
For the difference between xStore and Datastore submission options

Reference

Public Code Examples

The acceldata-io/xdp-examples repository contains ready-to-build examples for Python, Java, and Scala across HDFS, S3, ADLS. Clone it as a starting point for your own jobs.

xStore Base Image — Included Components

When using the xStore base image, the following components are pre-installed. No additional JARs are needed for any of the supported catalog types.

Component	Purpose
xStore Spark Connector	Enables `catalog.schema.table` SQL access via xStore
xStore GVFS	Enables `gvfs://fileset/...` URI access for HDFS/cloud filesets
Apache Iceberg runtime	Iceberg table format support
Delta Lake	Delta table format support
Hadoop AWS (`hadoop-aws`, `aws-java-sdk-bundle`)	S3/S3-compatible storage
PostgreSQL JDBC	PostgreSQL catalog access
Snowflake JDBC	Snowflake catalog access
Databricks JDBC	Unity Catalog access
Kerberos tooling	`krb5-user`, `kinit` for Hive/HDFS Kerberos auth

Datastore — Environment Variables Injected by xDP

The following variables are available in your job when you declare the corresponding Data Store Dependency.

S3

Variable	Description
`DATASTORE_AWS_ACCESS_KEY_ID`	AWS access key ID
`DATASTORE_AWS_SECRET_ACCESS_KEY`	AWS secret access key
`DATASTORE_S3_BUCKET_NAME`	S3 bucket name
`DATASTORE_S3_REGION`	AWS region (e.g., `us-east-1`)

Pass additional runtime values (file paths, output paths) as Job-level Environment Variables in the xDP UI.

ADLS

Variable	Description
`DATASTORE_AZURE_STORAGE_ACCOUNT_NAME`	Azure Storage Account name
`DATASTORE_AZURE_CONTAINER_NAME`	Blob container name
`DATASTORE_AZURE_CLIENT_ID`	Service principal client ID
`DATASTORE_AZURE_CLIENT_SECRET`	Service principal secret
`DATASTORE_AZURE_TENANT_ID`	Azure AD tenant ID

Hadoop / HDFS

Variable	Description
`URL`	HDFS NameNode URL (e.g., `hdfs://namenode:8020`)
`KERBEROS_PRINCIPAL`	Kerberos principal for authentication
`KERBEROS_KEYTAB`	Path to the keytab file inside the container

Variable	Description
`HIVE_METASTORE_URIS`	Thrift URI for the Hive Metastore (e.g., `thrift://hive-metastore:9083`)

Extra JARs for Datastore Mode

Datastore mode jobs may require additional JARs depending on the data source. Download these and place them in your jars/ directory before building your image.

Data Source	Required JARs
S3	`hadoop-aws-3.3.4.jar`, `aws-java-sdk-bundle-1.12.262.jar`
ADLS	`hadoop-azure-3.3.4.jar`, `azure-storage-8.6.6.jar`
Hive	Included in the xDP Spark base image

Download S3 JARs:

mkdir -p jars wget -P jars https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar wget -P jars https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar

Migrating Spark Jobs to xDP

Overview

Prerequisites

Data Access Modes at a Glance

Step 1 — Write the Dockerfile

xStore Jobs (PySpark)

Datastore Jobs (PySpark)

Java / Scala (both modes)

Build Commands

Image Best Practices

Step 2 — Write Datastore Spark Code

S3

ADLS (Azure Data Lake Storage )

HDFS (with Kerberos)

Hive Metastore

Step 3 — Submit the Job on xDP

Reference

Public Code Examples

xStore Base Image — Included Components

Datastore — Environment Variables Injected by xDP

S3

ADLS

Hadoop / HDFS

Hive Metastore

Extra JARs for Datastore Mode

Related Documentation