PySpark Improvements

Arrow-Native UDFs (@arrow_udf)

Spark 4.1.1 introduces the @arrow_udf decorator for scalar functions that accept and return pyarrow.Array objects directly — bypassing Pandas conversion overhead entirely.

Bash
Copy

Arrow-Native UDTFs (@arrow_udtf)

The @arrow_udtf decorator enables table functions that process entire pyarrow.RecordBatch objects at once, rather than row-by-row — dramatically faster for splitting and exploding operations.

Bash
Copy

Python Worker Logging

Debugging Python UDFs has historically been difficult because logs get lost in executor stdout/stderr. Spark 4.1.1 introduces dedicated UDF log capture, exposable via a built-in table-valued function.

Enable:

Bash
Copy

Example:

Bash
Copy

Python Data Source API with Filter Pushdown

The Python Data Source API (for custom data sources written in Python) gains filter pushdown in Spark 4.1.1 — the data source can now receive query predicates from the optimizer and apply them at the source, reducing data movement.

Bash
Copy

Python UDTFs

Python UDTFs produce multiple rows per input row — ideal for exploding, splitting, or generating records from a single input.

Bash
Copy

Pandas 2.x Support

Spark 4.1.1 supports pandas 2.x with Arrow-backed conversions for fast, zero-copy exchange between pandas and Spark DataFrames.

Bash
Copy

PySpark UDF Unified Profiler

Profiles PySpark UDFs for CPU and memory usage to identify bottlenecks.

Bash
Copy
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
  Last updated