Troubleshoot Spark 3 Dynamic Allocation and Shuffle Service Issues After ODP Upgrade

After upgrading from Spark2 to Spark3 “on ODP”, Spark applications may fail when dynamic allocation and external shuffle service are enabled.

Symptoms can include:

Spark jobs hanging
Executor allocation failures
Shuffle-related exceptions
PySpark startup failures
Spark service check failures

Symptoms

Applications run successfully without dynamic allocation:

Bash
    
xxxxxxxxxx
 
spark.dynamicAllocation.enabled=false
Copy

but fail when the following settings are enabled:

Bash
    
xxxxxxxxxx
 
spark.dynamicAllocation.enabled=truespark.dynamicAllocation.minExecutorsspark.dynamicAllocation.maxExecutorsspark.shuffle.service.enabled=true
Copy

Cause

During Spark2 → Spark3 migration, shuffle service configuration may not fully align with the Spark3 deployment.

Common causes include:

Incorrect Shuffle Classpath

Bash
    
xxxxxxxxxx
 
yarn.nodemanager.aux-services.spark3_shuffle.classpath
Copy

does not point to the Spark3 shuffle libraries.

Incorrect Shuffle Service Port

Bash
    
xxxxxxxxxx
 
spark.shuffle.service.port
Copy

is configured with a non-functional port.

Legacy Spark2 Components

Spark2 symlinks remain present and interfere with Spark3 client execution.

Resolution

Update Shuffle Port

Configure:

Bash
    
xxxxxxxxxx
 
spark.shuffle.service.port=7337
Copy

Configure the location for jar files for the external shuffle service yarn.nodemanager.aux-services.spar.

Bash
    
xxxxxxxxxx
 
yarn.nodemanager.aux-services.spark3_shuffle.classpath=/usr/odp/current/spark3-client/aux/*
Copy

Restart:

Bash
    
xxxxxxxxxx
 
Spark ServicesYARN Services
Copy

Validate Dynamic Allocation

Re-enable:

Bash
    
xxxxxxxxxx
 
spark.dynamicAllocation.enabled=truespark.shuffle.service.enabled=true
Copy

Submit a Spark application and verify that executor allocation functions normally.

PySpark Failure After Spark3 Migration

Symptoms

Bash
    
xxxxxxxxxx
 
Multiple versions of Spark are installedbut SPARK_MAJOR_VERSION is not set
Copy

followed by:

Bash
    
xxxxxxxxxx
 
TypeError: code() argument 13 must be str, not int
Copy

Cause

PySpark is launching the Spark2 runtime instead of Spark3.

Resolution

Verify Spark2 is no longer required.
Remove obsolete Spark2 references:

Bash
    
xxxxxxxxxx
 
spark2-clientspark2-historyserverspark2-thriftserver
Copy

Restart Spark services.

Validate:

Bash
    
xxxxxxxxxx
 
spark-shell
Copy

and

Bash
    
xxxxxxxxxx
 
pyspark
Copy

start successfully.

Validation

Run:

Bash
    
xxxxxxxxxx
 
spark-shell
Copy

Run:

Bash
    
xxxxxxxxxx
 
pyspark
Copy

Submit:

Bash
    
xxxxxxxxxx
 
spark-submit
Copy

with dynamic allocation enabled.

Confirm:

Executors are allocated successfully.
Shuffle operations complete successfully.
No Livy or YARN errors are reported.

Best Practices

Remove obsolete Spark2 components after migration.
Verify shuffle service configuration before enabling dynamic allocation.
Validate Spark shell, PySpark, and Spark submit workflows after upgrade.
Test production jobs before enabling dynamic allocation in production.

Summary

Spark3 upgrades may expose issues related to external shuffle services, dynamic allocation, and legacy Spark2 references.

Correcting the Spark3 shuffle configuration, validating the shuffle service port, and removing obsolete Spark2 components typically resolve these issues and restore normal Spark operation.

Last updated on

Was this page helpful?