Migrating Spark Jobs to xDP

Overview

This guide helps you package an existing Spark application as a container image and run it on xDP. It covers how to write a Dockerfile using the xDP Spark base image, how to write generic code for both xStore mode and Datastore mode, and how xDP handles credentials and governance automatically at runtime.

xDP supports two ways for Spark jobs to access data:

  • xStore mode — Spark discovers catalogs dynamically via the xStore plugin. The platform injects all configuration and credentials. Your code uses plain spark.sql() with three-part table names. This is the recommended approach for new jobs when your Compute Cluster is linked to xStore.
  • Datastore mode — Spark connects directly to registered data stores (S3, ADLS, HDFS, Hive). xDP injects credentials as environment variables. Your code reads those variables and configures connectors.

For complete working examples, see the public acceldata-io/xdp-examples repository.

Prerequisites

Before you begin, verify the following:

  • Apache Spark is installed on your target Compute Cluster. See App Spark
  • Your data sources are registered — as xStore catalogs (for xStore mode) or as Data Stores (for Datastore mode).

Data Access Modes at a Glance

AspectxStore ModeDatastore Mode
Catalog discoveryDynamic via xStoreStatic, per-datastore config at submit time
Credential handlingAutomatic — OAuth2 token injected by platformEnv vars injected per registered Data Store
Code complexityspark.sql("SELECT * FROM catalog.schema.table")Manual connector config per source type
GovernancexCentral policies enforced automaticallyNo governance enforcement in job code
When to useCompute Cluster linked to xStoreNo xStore link, or direct raw storage access

Step 1 — Write the Dockerfile

xStore Jobs (PySpark)

xStore jobs use the xDP Spark base image with the xStore connector pre-installed. The image includes the xStore Spark plugin, GVFS, Iceberg, Delta Lake, Hadoop AWS, and all major JDBC drivers — no extra JARs are needed.

Docker
Copy

Datastore Jobs (PySpark)

Datastore jobs use the standard xDP Spark image. You copy your application code and install any additional Python packages.

Docker
Copy

Java / Scala (both modes)

Docker
Copy

Build Commands

Bash
Copy

Image Best Practices

PracticeWhy
Use versioned tags (e.g., 1.0.0, git-abc1234)Prevents silent changes from breaking production jobs; latest is not repeatable
Run as non-root user 185Required by Kubernetes security policies on most clusters
Never copy credentials into the imageAll credentials are injected at runtime by xDP
Use a .dockerignore fileExclude .git, __pycache__, local configs, and test data to keep the image small
Pin the base image to a specific digest or tagEnsures reproducible builds across environments

Step 2 — Write Datastore Spark Code

Use when your Compute Cluster has no xStore link, or when you need direct access to raw storage paths not managed by xStore.

In Datastore mode, the core principle is no hardcoded values. Every credential, endpoint, path, and table name must come from an environment variable read at runtime.

When you declare a Data Store Dependency on a Spark job in xDP, the platform automatically injects the matching environment variables into the driver and executor pods. Your code reads them with os.environ.get().

S3

Python
Copy

ADLS (Azure Data Lake Storage )

Python
Copy

HDFS (with Kerberos)

Python
Copy

Non-Kerberos HDFS: For clusters using simple authentication, omit the spark.kerberos.* configs and the UserGroupInformation block. Only URL and HDFSFILEPATH are needed.

Hive Metastore

Python
Copy

Step 3 — Submit the Job on xDP

Once your image is built and pushed to the registry, create and run it from the xDP UI.

  • For a full walkthrough of the job creation wizard — image path, script path, Data Store Dependencies, xStore Catalog toggle, Metalake selection, resource settings, and scheduling
  • For the difference between xStore and Datastore submission options

Reference

Public Code Examples

The acceldata-io/xdp-examples repository contains ready-to-build examples for Python, Java, and Scala across HDFS, S3, ADLS. Clone it as a starting point for your own jobs.

xStore Base Image — Included Components

When using the xStore base image, the following components are pre-installed. No additional JARs are needed for any of the supported catalog types.

ComponentPurpose
xStore Spark ConnectorEnables catalog.schema.table SQL access via xStore
xStore GVFSEnables gvfs://fileset/... URI access for HDFS/cloud filesets
Apache Iceberg runtimeIceberg table format support
Delta LakeDelta table format support
Hadoop AWS (hadoop-aws, aws-java-sdk-bundle)S3/S3-compatible storage
PostgreSQL JDBCPostgreSQL catalog access
Snowflake JDBCSnowflake catalog access
Databricks JDBCUnity Catalog access
Kerberos toolingkrb5-user, kinit for Hive/HDFS Kerberos auth

Datastore — Environment Variables Injected by xDP

The following variables are available in your job when you declare the corresponding Data Store Dependency.

S3

VariableDescription
DATASTORE_AWS_ACCESS_KEY_IDAWS access key ID
DATASTORE_AWS_SECRET_ACCESS_KEYAWS secret access key
DATASTORE_S3_BUCKET_NAMES3 bucket name
DATASTORE_S3_REGIONAWS region (e.g., us-east-1)

Pass additional runtime values (file paths, output paths) as Job-level Environment Variables in the xDP UI.

ADLS

VariableDescription
DATASTORE_AZURE_STORAGE_ACCOUNT_NAMEAzure Storage Account name
DATASTORE_AZURE_CONTAINER_NAMEBlob container name
DATASTORE_AZURE_CLIENT_IDService principal client ID
DATASTORE_AZURE_CLIENT_SECRETService principal secret
DATASTORE_AZURE_TENANT_IDAzure AD tenant ID

Hadoop / HDFS

VariableDescription
URLHDFS NameNode URL (e.g., hdfs://namenode:8020)
KERBEROS_PRINCIPALKerberos principal for authentication
KERBEROS_KEYTABPath to the keytab file inside the container

Hive Metastore

VariableDescription
HIVE_METASTORE_URISThrift URI for the Hive Metastore (e.g., thrift://hive-metastore:9083)

Extra JARs for Datastore Mode

Datastore mode jobs may require additional JARs depending on the data source. Download these and place them in your jars/ directory before building your image.

Data SourceRequired JARs
S3hadoop-aws-3.3.4.jar, aws-java-sdk-bundle-1.12.262.jar
ADLShadoop-azure-3.3.4.jar, azure-storage-8.6.6.jar
HiveIncluded in the xDP Spark base image

Download S3 JARs:

Bash
Copy
VariableType to search · ESC to discard
GlossaryType to search · ESC to discard
InsertType to search · ESC to discard
No matches