Connectivity

Spark Connect

Spark Connect is a client-server architecture that decouples Spark client applications from the cluster, enabling remote connectivity using the standard DataFrame API. Developers can connect from any IDE — PyCharm, Jupyter, VS Code — using a lightweight 1.5 MB client instead of the full 355 MB PySpark installation.

Install:

Bash
    
xxxxxxxxxx
 
pip install pyspark-connect
Copy

Example:

Bash
    
xxxxxxxxxx
 
from pyspark.sql import SparkSession spark = SparkSession \    .builder \    .appName("Spark Connect Example") \    .master("sc://your-spark-server:15002") \    .getOrCreate() df = spark.read.format("csv").option("header", "true").load("/path/to/data.csv")df.show()
Copy

Spark 4.1.1 Spark Connect improvements:

Enhancement	Detail
Protobuf plan compression	Execution plans compressed with zstd, reducing network overhead for large/complex plans
Chunked Arrow result streaming	Query results streamed in chunks over gRPC, improving stability for large result sets
Large local relations	Removed the previous 2 GB size limit, enabling DataFrames from large Pandas or in-memory objects

Spark ML on Connect

Spark ML on Spark Connect is now Generally Available for the Python client. A new model size estimation mechanism allows intelligent model caching on the driver — models are cached in memory or spilled to disk based on estimated size.

Bash
    
xxxxxxxxxx
 
from pyspark.ml.classification import LogisticRegressionfrom pyspark.ml.feature import VectorAssemblerfrom pyspark.sql import SparkSession spark = SparkSession.builder.remote("sc://your-spark-server:15002").getOrCreate() data = spark.read.format("libsvm").load("/path/to/data.txt")lr = LogisticRegression(maxIter=10, regParam=0.3)model = lr.fit(data)print(model.coefficients)
Copy

Last updated on May 14, 2026

Was this page helpful?