Connectivity

Spark Connect

Spark Connect is a client-server architecture that decouples Spark client applications from the cluster, enabling remote connectivity using the standard DataFrame API. Developers can connect from any IDE — PyCharm, Jupyter, VS Code — using a lightweight 1.5 MB client instead of the full 355 MB PySpark installation.

Install:

Bash
Copy

Example:

Bash
Copy

Spark 4.1.1 Spark Connect improvements:

EnhancementDetail
Protobuf plan compressionExecution plans compressed with zstd, reducing network overhead for large/complex plans
Chunked Arrow result streamingQuery results streamed in chunks over gRPC, improving stability for large result sets
Large local relationsRemoved the previous 2 GB size limit, enabling DataFrames from large Pandas or in-memory objects

Spark ML on Connect

Spark ML on Spark Connect is now Generally Available for the Python client. A new model size estimation mechanism allows intelligent model caching on the driver — models are cached in memory or spilled to disk based on estimated size.

Bash
Copy
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
  Last updated