SQL Features

ANSI Mode (Enabled by Default)

ANSI mode is on by default in Spark 4.1.1, aligning Spark SQL with ANSI SQL standards. It enforces stricter rules for NULL handling, type conversions, and arithmetic.

Bash
Copy

Behavior changes vs Spark 3.x:

OperationSpark 3.xSpark 4.1.1 (ANSI default)
Integer overflowSilent wrap-aroundThrows ArithmeticException
Invalid type castReturns nullThrows AnalysisException
Division by zeroReturns nullThrows ArithmeticException
Bash
Copy

SQL Scripting (GA)

SQL Scripting is now Generally Available and enabled by default, transforming Spark SQL into a full programmable environment with loops, conditionals, variables, and error handling — directly in SQL.

New in 4.1.1: CONTINUE HANDLER for error recovery and multi-variable DECLARE syntax.

Example — control flow:

Bash
Copy

Example — error handling with CONTINUE HANDLER:

Bash
Copy

VARIANT Data Type (GA)

The VARIANT data type is now Generally Available, providing a standardized way to store semi-structured data like JSON without rigid schemas. A major performance enhancement in 4.1.1 is shredding — commonly queried fields within a VARIANT column are automatically extracted and stored as typed Parquet columns, dramatically reducing I/O.

Performance benchmarks (shredded vs alternatives):

ComparisonRead Performance Gain
VARIANT with shredding vs non-shredded VARIANT8x faster
VARIANT with shredding vs JSON strings30x faster
Write performance (trade-off)20–50% slower writes

Example:

Bash
Copy

Recursive CTE

Spark 4.1.1 adds native support for Recursive Common Table Expressions, enabling traversal of hierarchical data structures — org charts, bill of materials, graph topologies — directly in SQL.

Example — org chart traversal:

Bash
Copy

Approximate Data Sketches

Spark 4.1.1 expands approximate aggregation beyond HyperLogLog with two new native sketch types for efficient approximate analytics on massive datasets.

Sketch TypeSQL FunctionUse Case
KLL (Quantiles)kll_sketch_agg, kll_sketch_percentileApproximate percentiles/quantiles with minimal memory
Thetatheta_sketch_agg, theta_sketch_distinctApproximate set operations (union, intersection, difference)

KLL example — approximate percentiles:

Bash
Copy

Theta example — approximate distinct counts across datasets:

Bash
Copy

Collation Support

Spark 4.1.1 supports string collation, allowing locale-aware and case-insensitive string comparisons — essential for multilingual applications.

Bash
Copy
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
  Last updated