Introduction to Celeborn

What is Celeborn?

Apache Celeborn is a Remote Shuffle Service (RSS) designed to improve the efficiency, stability, and flexibility of shuffle operations in distributed compute engines. It supports Apache Spark, Apache Flink, Apache Tez, and MapReduce.

Why Celeborn?

Traditional shuffle frameworks have significant limitations that become critical at scale:

ProblemTraditional ShuffleCeleborn Solution
Network EfficiencyM × N connections between Mappers and ReducersConsolidated M+N connections via Celeborn workers
Disk I/ORandom I/O on compute nodesSequential I/O on dedicated shuffle nodes
Dynamic AllocationLimited by shuffle data localityFull executor elasticity
Node FailureShuffle data lost, job fails or retriesData replicated — job continues without retry
StorageLarge local disks required on compute nodesDedicated shuffle storage (local, HDFS, S3)

Key Benefits

  • Performance: 2–5× improvement in shuffle-heavy workloads
  • Stability: Data replication prevents job failures from node loss
  • Elasticity: Enables true dynamic resource allocation
  • Disaggregation: Separates compute from shuffle storage
  • Multi-Engine: Supports Spark, Flink, Tez, and MapReduce
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
  Last updated