Monitoring and Metrics

Prometheus Endpoints

Bash
    
xxxxxxxxxx
 
# Master metricshttp://<master-host>:9098/metrics/prometheus # Worker metricshttp://<worker-host>:9096/metrics/prometheus
Copy

Key Metrics

Master Metrics

Metric	Description
RegisteredShuffleCount	Total registered shuffles across all applications
RunningApplicationCount	Number of currently active applications
WorkerCount	Total registered workers
AvailableWorkerCount	Workers in a healthy state
ActiveShuffleSize	Total bytes of active shuffle data
IsActiveMaster	1 if this node is the Raft leader, 0 otherwise

Worker Metrics

Metric	Description
ActiveShuffleSize	Bytes of shuffle data currently stored on this worker
ActiveShuffleFileCount	Number of shuffle files on this worker
PausePushDataStatus	Backpressure status — non-zero means writer is being throttled
DiskBuffer	In-memory buffer used for pending disk writes
NettyMemory	Off-heap memory consumed by Netty networking layer
DeviceCelebornFreeBytes	Remaining free disk space per device

Prometheus Scrape Configuration

Bash
    
xxxxxxxxxx
 
# prometheus.yml scrape_configs:   - job_name: 'celeborn'     metrics_path: /metrics/prometheus     scrape_interval: 15s     static_configs:       - targets:         - 'master1:9098'         - 'master2:9098'         - 'master3:9098'         - 'worker1:9096'         - 'worker2:9096'         - 'worker3:9096'
Copy

Grafana Dashboards

Import pre-built Grafana dashboards from the Celeborn installation:

$CELEBORN_HOME/assets/grafana/celeborn-dashboard.json — Internal Celeborn metrics
$CELEBORN_HOME/assets/grafana/celeborn-jvm-dashboard.json — JVM / GC metrics

REST API Reference

Endpoint	Method	Description
/api/v1/masters	GET	List all master nodes and their roles (leader/follower)
/api/v1/workers	GET	List all registered workers and their status
/api/v1/shuffles	GET	List active shuffles with size and partition info
/api/v1/applications	GET	List running applications using Celeborn
/ping	GET	Health check — returns HTTP 200 when service is healthy

Last updated on Apr 23, 2026

Was this page helpful?