Connectivity

Spark Connect

Spark Connect is a client-server architecture that decouples Spark client applications from the cluster, enabling remote connectivity using the standard DataFrame API. Developers can connect from any IDE — PyCharm, Jupyter, VS Code — using a lightweight 1.5 MB client instead of the full 355 MB PySpark installation.

Install:

Bash
Copy

Example:

Bash
Copy

Spark 4.1.1 Spark Connect improvements:

EnhancementDetail
Protobuf plan compressionExecution plans compressed with zstd, reducing network overhead for large/complex plans
Chunked Arrow result streamingQuery results streamed in chunks over gRPC, improving stability for large result sets
Large local relationsRemoved the previous 2 GB size limit, enabling DataFrames from large Pandas or in-memory objects

Spark ML on Connect

Spark ML on Spark Connect is now Generally Available for the Python client. A new model size estimation mechanism allows intelligent model caching on the driver — models are cached in memory or spilled to disk based on estimated size.

Bash
Copy
VariableType to search · ESC to discard
GlossaryType to search · ESC to discard
InsertType to search · ESC to discard
No matches
  Last updated