Data Sources and Extensions

Python Data Source APIs

Spark 4.1.1 supports creating custom data sources and sinks entirely in Python — no Scala or Java required — for both batch and streaming queries.

Bash
    
xxxxxxxxxx
 
from pyspark.sql.datasource import DataSource, DataSourceReader class CustomDataSource(DataSource):    def reader(self, schema):        return CustomReader(schema, self.options) class CustomReader(DataSourceReader):    def partitions(self):        return [None]     def read(self, partition):        for row in self._load_records(self.options.get("path")):            yield row spark.dataSource.register(CustomDataSource, "custom_source")df = spark.read.format("custom_source").option("path", "/path/to/data").load()df.show()
Copy

XML Connector

Built-in XML support for reading and writing XML files with configurable row and root tag options.

Bash
    
xxxxxxxxxx
 
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("XML Connector").getOrCreate() df = spark.read \    .format("xml") \    .option("rowTag", "record") \    .load("/path/to/data.xml") df.write \    .format("xml") \    .option("rootTag", "records") \    .option("rowTag", "record") \    .save("/path/to/output.xml")
Copy

Data Source V2 (DSV2)

DSV2 provides a cleaner, higher-performance data source API with support for constraint pushdown, replacing older V1 APIs.

Bash
    
xxxxxxxxxx
 
from pyspark.sql.datasource import DataSource class CustomDSV2(DataSource):    def reader(self, schema):        return MyDSV2Reader(schema, self.options)     def writer(self, schema, overwrite):        return MyDSV2Writer(self.options) spark.dataSource.register(CustomDSV2, "custom_dsv2")df = spark.read.format("custom_dsv2").option("option_key", "value").load()df.write.format("custom_dsv2").option("option_key", "value").save()
Copy

Delta Lake 4.0

Delta Lake 4.0, compatible with Spark 4.1.1, introduces major lakehouse improvements.

Feature	Description
Delta Connect	Full Spark Connect support for Delta operations
Coordinated Commits	Multi-cloud concurrent write support
Liquid Clustering	Adaptive clustering for faster reads and writes
Time Travel	Query historical versions by timestamp or version

Bash
    
xxxxxxxxxx
 
CREATE TABLE orders (id INT, amount DOUBLE, status STRING) USING delta;INSERT INTO orders VALUES (1, 99.99, 'pending'), (2, 149.50, 'completed'); -- Time travelSELECT * FROM orders VERSION AS OF 0;SELECT * FROM orders TIMESTAMP AS OF '2025-01-01T00:00:00'; -- Optimize with liquid clusteringALTER TABLE orders CLUSTER BY (status);
Copy

Last updated on May 14, 2026

Was this page helpful?