Prerequisites Capacity Planning

Capacity Planning

A minimum of 3 nodes is required for Kubernetes control-plane functionality.

The following applications drive capacity requirements:

Collect the following metrics for your environment before sizing nodes:

Metric	What to Measure
Concurrent Spark jobs	Peak and average running jobs
Concurrent Trino queries	Maximum sustained query load
Active JupyterHub sessions	Simultaneous notebook users
Data scan volume	Terabytes scanned per hour/day
Processing throughput	Required gigabytes per second

Application	Recommendation
Spark	Allocate 60--70% of executor memory for direct processing
Trino	Size memory pools to match anticipated query complexity
JupyterHub	Set per-user memory limits (typically 2 GB -- 16 GB)

Requirement	Specification
Type	NVMe SSD (mandatory for shuffle operations)
Mount protocol	JBOD -- `/mnt/disk1`, `/mnt/disk2`, etc.

Use existing object storage (S3, GCS, Azure Blob) as the data lake repository. This is also required for hosting Spark History Server event logs.

Path	Minimum	Recommended
Inter-node communication	25 Gbps	100 Gbps
Storage network	Dedicated high-bandwidth link to object storage	--

Consider segregating data-plane and control-plane traffic onto separate networks.

Last updated on

Was this page helpful?