Monitoring and Metrics

Prometheus Endpoints

Bash
Copy

Key Metrics

Master Metrics

MetricDescription
RegisteredShuffleCountTotal registered shuffles across all applications
RunningApplicationCountNumber of currently active applications
WorkerCountTotal registered workers
AvailableWorkerCountWorkers in a healthy state
ActiveShuffleSizeTotal bytes of active shuffle data
IsActiveMaster1 if this node is the Raft leader, 0 otherwise

Worker Metrics

MetricDescription
ActiveShuffleSizeBytes of shuffle data currently stored on this worker
ActiveShuffleFileCountNumber of shuffle files on this worker
PausePushDataStatusBackpressure status — non-zero means writer is being throttled
DiskBufferIn-memory buffer used for pending disk writes
NettyMemoryOff-heap memory consumed by Netty networking layer
DeviceCelebornFreeBytesRemaining free disk space per device

Prometheus Scrape Configuration

Bash
Copy

Grafana Dashboards

Import pre-built Grafana dashboards from the Celeborn installation:

  • $CELEBORN_HOME/assets/grafana/celeborn-dashboard.json — Internal Celeborn metrics
  • $CELEBORN_HOME/assets/grafana/celeborn-jvm-dashboard.json — JVM / GC metrics

REST API Reference

EndpointMethodDescription
/api/v1/mastersGETList all master nodes and their roles (leader/follower)
/api/v1/workersGETList all registered workers and their status
/api/v1/shufflesGETList active shuffles with size and partition info
/api/v1/applicationsGETList running applications using Celeborn
/pingGETHealth check — returns HTTP 200 when service is healthy
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
  Last updated