Celeborn vs YARN External Shuffle Service

Architecture Comparison

The diagram below illustrates the fundamental architectural difference between YARN ESS and Celeborn. ESS scatters shuffle data across all compute nodes and creates M×N network connections. Celeborn consolidates all data for a given partition onto a single worker, reducing connections to M+N.

Feature Comparison

FeatureYARN ESSCeleborn
ArchitectureCo-located with compute nodesDedicated shuffle cluster
Data LocationLocal disks on compute nodesRemote shuffle workers
Network PatternM×N (mappers × reducers)M+N (mappers + reducers)
I/O PatternRandom I/O (many small files)Sequential I/O (consolidated files)
ReplicationNoneConfigurable (1 or 2 copies)
Node FailureShuffle data lost, stage retriesData survives, job continues
Dynamic AllocationLimited by shuffle localityFull executor elasticity
Storage DisaggregationNot supportedYes — HDFS, S3 supported
Multi-Tenant IsolationNo isolationQuota & worker tags supported
High AvailabilitySingle point of failureRaft-based Master HA (3+ nodes)

Performance Comparison

MetricYARN ESSCelebornImprovement
Shuffle TimeBaseline30–50% fasterNetwork efficiency
Job StabilityNode failure = stage retryNode failure = continueData replication
Disk I/ORandom small filesSequential large files2–5× throughput
Network ConnectionsO(M×N)O(M+N)Drastically reduced
Memory PressureOn compute nodesOn dedicated shuffle nodesResource isolation

When to Use Celeborn vs ESS

ScenarioRecommendation
Small jobs (< 100 GB shuffle)ESS may be sufficient
Large jobs (> 100 GB shuffle)Celeborn recommended
Dynamic allocation requiredCeleborn required
Cloud / Kubernetes deploymentCeleborn recommended
Spot / preemptible instancesCeleborn required
Multi-tenant clusterCeleborn recommended
Frequent node failuresCeleborn recommended
Disaggregated storage neededCeleborn required
Single-user developmentESS may be sufficient

Migration Path

Phase 1 — Pilot

  • Deploy Celeborn alongside existing ESS
  • Test with non-critical workloads
  • Compare performance metrics in parallel

Phase 2 — Gradual Rollout

  • Enable Celeborn for specific teams or jobs via worker tags
  • Monitor stability and performance

Phase 3 — Full Migration

  • Make Celeborn the default shuffle service cluster-wide
  • Disable ESS on compute nodes
  • Optimize Celeborn cluster sizing based on observed usage
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
  Last updated