Title
Create new category
Edit page index title
Edit category
Edit link
Troubleshoot Spark 3 Dynamic Allocation and Shuffle Service Issues After ODP Upgrade
After upgrading from Spark2 to Spark3 “on ODP”, Spark applications may fail when dynamic allocation and external shuffle service are enabled.
Symptoms can include:
- Spark jobs hanging
- Executor allocation failures
- Shuffle-related exceptions
- PySpark startup failures
- Spark service check failures
Symptoms
Applications run successfully without dynamic allocation:
xxxxxxxxxxspark.dynamicAllocation.enabled=falsebut fail when the following settings are enabled:
xxxxxxxxxxspark.dynamicAllocation.enabled=truespark.dynamicAllocation.minExecutorsspark.dynamicAllocation.maxExecutorsspark.shuffle.service.enabled=trueCause
During Spark2 → Spark3 migration, shuffle service configuration may not fully align with the Spark3 deployment.
Common causes include:
Incorrect Shuffle Classpath
xxxxxxxxxxyarn.nodemanager.aux-services.spark3_shuffle.classpathdoes not point to the Spark3 shuffle libraries.
Incorrect Shuffle Service Port
xxxxxxxxxxspark.shuffle.service.portis configured with a non-functional port.
Legacy Spark2 Components
Spark2 symlinks remain present and interfere with Spark3 client execution.
Resolution
Update Shuffle Port
Configure:
xxxxxxxxxxspark.shuffle.service.port=7337Configure the location for jar files for the external shuffle service yarn.nodemanager.aux-services.spar.
xxxxxxxxxxyarn.nodemanager.aux-services.spark3_shuffle.classpath=/usr/odp/current/spark3-client/aux/*Restart:
xxxxxxxxxxSpark ServicesYARN ServicesValidate Dynamic Allocation
Re-enable:
xxxxxxxxxxspark.dynamicAllocation.enabled=truespark.shuffle.service.enabled=trueSubmit a Spark application and verify that executor allocation functions normally.
PySpark Failure After Spark3 Migration
Symptoms
xxxxxxxxxxMultiple versions of Spark are installedbut SPARK_MAJOR_VERSION is not setfollowed by:
xxxxxxxxxxTypeError: code() argument 13 must be str, not intCause
PySpark is launching the Spark2 runtime instead of Spark3.
Resolution
- Verify Spark2 is no longer required.
- Remove obsolete Spark2 references:
xxxxxxxxxxspark2-clientspark2-historyserverspark2-thriftserverRestart Spark services.
Validate:
xxxxxxxxxxspark-shelland
xxxxxxxxxxpysparkstart successfully.
Validation
Run:
xxxxxxxxxxspark-shellRun:
xxxxxxxxxxpysparkSubmit:
xxxxxxxxxxspark-submitwith dynamic allocation enabled.
Confirm:
- Executors are allocated successfully.
- Shuffle operations complete successfully.
- No Livy or YARN errors are reported.
Best Practices
- Remove obsolete Spark2 components after migration.
- Verify shuffle service configuration before enabling dynamic allocation.
- Validate Spark shell, PySpark, and Spark submit workflows after upgrade.
- Test production jobs before enabling dynamic allocation in production.
Summary
Spark3 upgrades may expose issues related to external shuffle services, dynamic allocation, and legacy Spark2 references.
Correcting the Spark3 shuffle configuration, validating the shuffle service port, and removing obsolete Spark2 components typically resolve these issues and restore normal Spark operation.