Title
Create new category
Edit page index title
Edit category
Edit link
Migrating Spark Jobs to xDP
Overview
This guide helps you package an existing Spark application as a container image and run it on xDP. It covers how to write a Dockerfile using the xDP Spark base image, how to write generic code for both xStore mode and Datastore mode, and how xDP handles credentials and governance automatically at runtime.
xDP supports two ways for Spark jobs to access data:
- xStore mode — Spark discovers catalogs dynamically via the xStore plugin. The platform injects all configuration and credentials. Your code uses plain
spark.sql()with three-part table names. This is the recommended approach for new jobs when your Compute Cluster is linked to xStore. - Datastore mode — Spark connects directly to registered data stores (S3, ADLS, HDFS, Hive). xDP injects credentials as environment variables. Your code reads those variables and configures connectors.
For complete working examples, see the public acceldata-io/xdp-examples repository.
Prerequisites
Before you begin, verify the following:
- Apache Spark is installed on your target Compute Cluster. See App Spark
- Your data sources are registered — as xStore catalogs (for xStore mode) or as Data Stores (for Datastore mode).
Data Access Modes at a Glance
| Aspect | xStore Mode | Datastore Mode |
|---|---|---|
| Catalog discovery | Dynamic via xStore | Static, per-datastore config at submit time |
| Credential handling | Automatic — OAuth2 token injected by platform | Env vars injected per registered Data Store |
| Code complexity | spark.sql("SELECT * FROM catalog.schema.table") | Manual connector config per source type |
| Governance | xCentral policies enforced automatically | No governance enforcement in job code |
| When to use | Compute Cluster linked to xStore | No xStore link, or direct raw storage access |
Step 1 — Write the Dockerfile
xStore Jobs (PySpark)
xStore jobs use the xDP Spark base image with the xStore connector pre-installed. The image includes the xStore Spark plugin, GVFS, Iceberg, Delta Lake, Hadoop AWS, and all major JDBC drivers — no extra JARs are needed.
xxxxxxxxxx# xStore base image — includes xStore connector, GVFS, Iceberg, Delta Lake, JDBC driversARG BASE_IMAGE=<BASE_IMAGE> FROM ${BASE_IMAGE} USER root WORKDIR /app # xDP injects Kerberos/Hadoop config at runtime for Hive (ODP) catalogs# These directories are created here so the platform can mount config into themRUN mkdir -p /opt/acceldata/odp_hive_catalog/conf \ /opt/acceldata/odp_hive_catalog/keytab ENV HADOOP_CONF_DIR=/opt/acceldata/odp_hive_catalog/conf \ KRB5_CONFIG=/opt/acceldata/odp_hive_catalog/conf/krb5.conf # Copy application scriptsCOPY xstore/ /app/xstore/ RUN chmod +x /app/xstore/*.py && chown -R 185:185 /app # Run as non-root (UID 185 is the spark user in the xDP base image)USER 185Datastore Jobs (PySpark)
Datastore jobs use the standard xDP Spark image. You copy your application code and install any additional Python packages.
xxxxxxxxxx# Standard xDP Spark base image (Python + JDK)ARG BASE_IMAGE=<BASE_IMAGE> FROM ${BASE_IMAGE} USER root WORKDIR /app ENV SPARK_HOME=/opt/sparkENV SPARK_JARS_DIR=$SPARK_HOME/jars # Kerberos / Hadoop config paths (pre-set in the base image)ENV HADOOP_CONF_DIR=/etc/hadoop/conf \ JAVA_SECURITY_KRB5_CONF=/etc/krb5.conf \ KRB5_CONFIG=/etc/krb5.conf \ CLASSPATH=$SPARK_JARS_DIR/* \ PATH=$PATH:$SPARK_JARS_DIR \ SPARK_USER="185" \ HADOOP_USER="185" # Copy application codeCOPY S3/ /app/S3/COPY ADLS/ /app/ADLS/COPY ODP/ /app/ODP/ # Copy extra JARs (e.g., hadoop-aws, azure-storage)COPY jars/ /opt/spark/jars/ # Install Python dependencies in a virtual environmentRUN apt-get update && apt-get install -y python3-pip python3-venv && \ python3 -m venv /opt/venv && \ /opt/venv/bin/pip install --upgrade pip # Uncomment to install from a requirements file# COPY requirements.txt .# RUN /opt/venv/bin/pip install -r requirements.txt ENV PATH="/opt/venv/bin:$PATH" RUN chmod 777 /app USER 185Java / Scala (both modes)
xxxxxxxxxx# Use the xStore base image for xStore jobs, or the standard image for Datastore jobsARG BASE_IMAGE=<BASE_IMAGE> FROM ${BASE_IMAGE} USER root WORKDIR /app # Copy the fat JAR produced by your build tool (Gradle/Maven)COPY target/my-spark-job-1.0.0.jar /app/my-spark-job.jar USER 185Build Commands
# Build for linux/amd64 (required for Kubernetes on x86 nodes)docker build --platform linux/amd64 -t my-registry.example.com/my-spark-job:1.0.0 . # Override the base image at build timedocker build --platform linux/amd64 \ --build-arg BASE_IMAGE=191579300362.dkr.ecr.us-east-1.amazonaws.com/acceldata/xdp/dp/spark:3.5.5-scala2.12-java17-python3-ubuntu-xstore-1.0.2 \ -t my-registry.example.com/my-spark-job:1.0.0 . # Push to your registrydocker push my-registry.example.com/my-spark-job:1.0.0Image Best Practices
| Practice | Why |
|---|---|
Use versioned tags (e.g., 1.0.0, git-abc1234) | Prevents silent changes from breaking production jobs; latest is not repeatable |
Run as non-root user 185 | Required by Kubernetes security policies on most clusters |
| Never copy credentials into the image | All credentials are injected at runtime by xDP |
Use a .dockerignore file | Exclude .git, __pycache__, local configs, and test data to keep the image small |
| Pin the base image to a specific digest or tag | Ensures reproducible builds across environments |
Step 2 — Write Datastore Spark Code
Use when your Compute Cluster has no xStore link, or when you need direct access to raw storage paths not managed by xStore.
In Datastore mode, the core principle is no hardcoded values. Every credential, endpoint, path, and table name must come from an environment variable read at runtime.
When you declare a Data Store Dependency on a Spark job in xDP, the platform automatically injects the matching environment variables into the driver and executor pods. Your code reads them with os.environ.get().
S3
xxxxxxxxxximport osfrom pyspark.sql import SparkSession spark = SparkSession.builder.appName("S3Job").getOrCreate() # xDP injects these when you add an S3 Data Store Dependencyaccess_key = os.environ.get("DATASTORE_AWS_ACCESS_KEY_ID")secret_key = os.environ.get("DATASTORE_AWS_SECRET_ACCESS_KEY")bucket = os.environ.get("DATASTORE_S3_BUCKET_NAME")input_path = os.environ.get("DATASTORE_S3_FILE_PATH")output_path = os.environ.get("DATASTORE_S3_FILE_PATH_OUTPUT") hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration()hadoop_conf.set("fs.s3a.access.key", access_key)hadoop_conf.set("fs.s3a.secret.key", secret_key) df = spark.read.csv(f"s3a://{bucket}/{input_path}", header=True, inferSchema=True)df.write.csv(f"s3a://{bucket}/{output_path}", header=True, mode="overwrite") spark.stop()ADLS (Azure Data Lake Storage )
xxxxxxxxxximport osfrom pyspark.sql import SparkSession spark = SparkSession.builder.appName("ADLSJob").getOrCreate() # xDP injects these when you add an ADLS Data Store Dependencyaccount = os.environ.get("DATASTORE_AZURE_STORAGE_ACCOUNT_NAME")container = os.environ.get("DATASTORE_AZURE_CONTAINER_NAME")client_id = os.environ.get("DATASTORE_AZURE_CLIENT_ID")secret = os.environ.get("DATASTORE_AZURE_CLIENT_SECRET")tenant_id = os.environ.get("DATASTORE_AZURE_TENANT_ID")input_path = os.environ.get("ADLS_FILE_PATH")output_path = os.environ.get("ADLS_FILE_PATH_OUTPUT") hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration()hadoop_conf.set(f"fs.azure.account.auth.type.{account}.dfs.core.windows.net", "OAuth")hadoop_conf.set(f"fs.azure.account.oauth.provider.type.{account}.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")hadoop_conf.set(f"fs.azure.account.oauth2.client.id.{account}.dfs.core.windows.net", client_id)hadoop_conf.set(f"fs.azure.account.oauth2.client.secret.{account}.dfs.core.windows.net", secret)hadoop_conf.set(f"fs.azure.account.oauth2.client.endpoint.{account}.dfs.core.windows.net", f"https://login.microsoftonline.com/{tenant_id}/oauth2/token") df = spark.read.csv(f"abfss://{container}@{account}.dfs.core.windows.net/{input_path}", header=True, inferSchema=True)df.write.csv(f"abfss://{container}@{account}.dfs.core.windows.net/{output_path}", header=True, mode="overwrite") spark.stop()HDFS (with Kerberos)
xxxxxxxxxximport osfrom pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("HDFSJob") \ .config("spark.kerberos.keytab", os.environ.get("KERBEROS_KEYTAB")) \ .config("spark.kerberos.principal", os.environ.get("KERBEROS_PRINCIPAL")) \ .getOrCreate() # xDP injects these when you add a Hadoop Data Store Dependencyhdfs_url = os.environ.get("URL") # e.g. hdfs://namenode:8020principal = os.environ.get("KERBEROS_PRINCIPAL")keytab = os.environ.get("KERBEROS_KEYTAB")input_path = os.environ.get("HDFS_FILE_PATH")output_path = os.environ.get("HDFS_FILE_OUTPUT_PATH") hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration()hadoop_conf.set("hadoop.security.authentication", "kerberos")hadoop_conf.set("hadoop.security.authorization", "true") UserGroupInformation = spark.sparkContext._jvm.org.apache.hadoop.security.UserGroupInformationUserGroupInformation.setConfiguration(hadoop_conf)UserGroupInformation.loginUserFromKeytab(principal, keytab) df = spark.read.csv(f"{hdfs_url}/{input_path}", header=True, inferSchema=True)df.write.csv(f"{hdfs_url}/{output_path}", header=True, mode="overwrite") spark.stop()Non-Kerberos HDFS: For clusters using simple authentication, omit the spark.kerberos.* configs and the UserGroupInformation block. Only URL and HDFSFILEPATH are needed.
Hive Metastore
xxxxxxxxxximport osfrom pyspark.sql import SparkSession # xDP injects the metastore URI when you add an Data Store Dependencymetastore_uri = os.environ.get("HIVE_METASTORE_URIS") # e.g. thrift://hive-metastore:9083input_table = os.environ.get("HIVE_INPUT_TABLE") # e.g. mydb.source_tableoutput_table = os.environ.get("HIVE_OUTPUT_TABLE") # e.g. mydb.output_table spark = SparkSession.builder \ .appName("HiveJob") \ .config("spark.hadoop.hive.metastore.uris", metastore_uri) \ .enableHiveSupport() \ .getOrCreate() df = spark.sql(f"SELECT * FROM {input_table}")df.write.mode("overwrite").saveAsTable(output_table) spark.stop()Step 3 — Submit the Job on xDP
Once your image is built and pushed to the registry, create and run it from the xDP UI.
- For a full walkthrough of the job creation wizard — image path, script path, Data Store Dependencies, xStore Catalog toggle, Metalake selection, resource settings, and scheduling
- For the difference between xStore and Datastore submission options
Reference
Public Code Examples
The acceldata-io/xdp-examples repository contains ready-to-build examples for Python, Java, and Scala across HDFS, S3, ADLS. Clone it as a starting point for your own jobs.
xStore Base Image — Included Components
When using the xStore base image, the following components are pre-installed. No additional JARs are needed for any of the supported catalog types.
| Component | Purpose |
|---|---|
| xStore Spark Connector | Enables catalog.schema.table SQL access via xStore |
| xStore GVFS | Enables gvfs://fileset/... URI access for HDFS/cloud filesets |
| Apache Iceberg runtime | Iceberg table format support |
| Delta Lake | Delta table format support |
Hadoop AWS (hadoop-aws, aws-java-sdk-bundle) | S3/S3-compatible storage |
| PostgreSQL JDBC | PostgreSQL catalog access |
| Snowflake JDBC | Snowflake catalog access |
| Databricks JDBC | Unity Catalog access |
| Kerberos tooling | krb5-user, kinit for Hive/HDFS Kerberos auth |
Datastore — Environment Variables Injected by xDP
The following variables are available in your job when you declare the corresponding Data Store Dependency.
S3
| Variable | Description |
|---|---|
DATASTORE_AWS_ACCESS_KEY_ID | AWS access key ID |
DATASTORE_AWS_SECRET_ACCESS_KEY | AWS secret access key |
DATASTORE_S3_BUCKET_NAME | S3 bucket name |
DATASTORE_S3_REGION | AWS region (e.g., us-east-1) |
Pass additional runtime values (file paths, output paths) as Job-level Environment Variables in the xDP UI.
ADLS
| Variable | Description |
|---|---|
DATASTORE_AZURE_STORAGE_ACCOUNT_NAME | Azure Storage Account name |
DATASTORE_AZURE_CONTAINER_NAME | Blob container name |
DATASTORE_AZURE_CLIENT_ID | Service principal client ID |
DATASTORE_AZURE_CLIENT_SECRET | Service principal secret |
DATASTORE_AZURE_TENANT_ID | Azure AD tenant ID |
Hadoop / HDFS
| Variable | Description |
|---|---|
URL | HDFS NameNode URL (e.g., hdfs://namenode:8020) |
KERBEROS_PRINCIPAL | Kerberos principal for authentication |
KERBEROS_KEYTAB | Path to the keytab file inside the container |
Hive Metastore
| Variable | Description |
|---|---|
HIVE_METASTORE_URIS | Thrift URI for the Hive Metastore (e.g., thrift://hive-metastore:9083) |
Extra JARs for Datastore Mode
Datastore mode jobs may require additional JARs depending on the data source. Download these and place them in your jars/ directory before building your image.
| Data Source | Required JARs |
|---|---|
| S3 | hadoop-aws-3.3.4.jar, aws-java-sdk-bundle-1.12.262.jar |
| ADLS | hadoop-azure-3.3.4.jar, azure-storage-8.6.6.jar |
| Hive | Included in the xDP Spark base image |
Download S3 JARs:
xxxxxxxxxxmkdir -p jarswget -P jars https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jarwget -P jars https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jarRelated Documentation
For additional help, contact our Support Team!
©2026, Acceldata Inc — All Rights Reserved.