Troubleshoot Spark 3 Dynamic Allocation and Shuffle Service Issues After ODP Upgrade

After upgrading from Spark2 to Spark3 “on ODP”, Spark applications may fail when dynamic allocation and external shuffle service are enabled.

Symptoms can include:

  • Spark jobs hanging
  • Executor allocation failures
  • Shuffle-related exceptions
  • PySpark startup failures
  • Spark service check failures

Symptoms

Applications run successfully without dynamic allocation:

Bash
Copy

but fail when the following settings are enabled:

Bash
Copy

Cause

During Spark2 → Spark3 migration, shuffle service configuration may not fully align with the Spark3 deployment.

Common causes include:

Incorrect Shuffle Classpath

Bash
Copy

does not point to the Spark3 shuffle libraries.

Incorrect Shuffle Service Port

Bash
Copy

is configured with a non-functional port.

Legacy Spark2 Components

Spark2 symlinks remain present and interfere with Spark3 client execution.

Resolution

Update Shuffle Port

Configure:

Bash
Copy

Configure the location for jar files for the external shuffle service yarn.nodemanager.aux-services.spar.

Bash
Copy

Restart:

Bash
Copy

Validate Dynamic Allocation

Re-enable:

Bash
Copy

Submit a Spark application and verify that executor allocation functions normally.

PySpark Failure After Spark3 Migration

Symptoms

Bash
Copy

followed by:

Bash
Copy

Cause

PySpark is launching the Spark2 runtime instead of Spark3.

Resolution

  • Verify Spark2 is no longer required.
  • Remove obsolete Spark2 references:
Bash
Copy

Restart Spark services.

Validate:

Bash
Copy

and

Bash
Copy

start successfully.

Validation

Run:

Bash
Copy

Run:

Bash
Copy

Submit:

Bash
Copy

with dynamic allocation enabled.

Confirm:

  • Executors are allocated successfully.
  • Shuffle operations complete successfully.
  • No Livy or YARN errors are reported.

Best Practices

  • Remove obsolete Spark2 components after migration.
  • Verify shuffle service configuration before enabling dynamic allocation.
  • Validate Spark shell, PySpark, and Spark submit workflows after upgrade.
  • Test production jobs before enabling dynamic allocation in production.

Summary

Spark3 upgrades may expose issues related to external shuffle services, dynamic allocation, and legacy Spark2 references.

Correcting the Spark3 shuffle configuration, validating the shuffle service port, and removing obsolete Spark2 components typically resolve these issues and restore normal Spark operation.

VariableType to search · ESC to discard
GlossaryType to search · ESC to discard
InsertType to search · ESC to discard
No matches