SQL Features

ANSI Mode (Enabled by Default)

ANSI mode is on by default in Spark 4.1.1, aligning Spark SQL with ANSI SQL standards. It enforces stricter rules for NULL handling, type conversions, and arithmetic.

Bash
    
xxxxxxxxxx
 
spark.sql.ansi.enabled=true   # default in Spark 4.1.1
Copy

Behavior changes vs Spark 3.x:

Operation	Spark 3.x	Spark 4.1.1 (ANSI default)
Integer overflow	Silent wrap-around	Throws `ArithmeticException`
Invalid type cast	Returns `null`	Throws `AnalysisException`
Division by zero	Returns `null`	Throws `ArithmeticException`

Bash
    
xxxxxxxxxx
 
-- Throws ArithmeticException in Spark 4.1.1SELECT 2147483647 + 1;
Copy

SQL Scripting (GA)

SQL Scripting is now Generally Available and enabled by default, transforming Spark SQL into a full programmable environment with loops, conditionals, variables, and error handling — directly in SQL.

New in 4.1.1: CONTINUE HANDLER for error recovery and multi-variable DECLARE syntax.

Example — control flow:

Bash
    
xxxxxxxxxx
 
BEGIN  DECLARE total_count INT DEFAULT 0;  DECLARE dept_id INT DEFAULT 100;   SET total_count = (SELECT COUNT(*) FROM employees WHERE department_id = dept_id);   IF total_count > 50 THEN    INSERT INTO large_departments VALUES (dept_id, total_count);  ELSE    INSERT INTO small_departments VALUES (dept_id, total_count);  END IF;END;
Copy

Example — error handling with CONTINUE HANDLER:

Bash
    
xxxxxxxxxx
 
BEGIN  DECLARE CONTINUE HANDLER FOR SQLEXCEPTION    INSERT INTO error_log VALUES (CURRENT_TIMESTAMP, 'Error occurred');   INSERT INTO target_table SELECT * FROM source_table;END;
Copy

VARIANT Data Type (GA)

The VARIANT data type is now Generally Available, providing a standardized way to store semi-structured data like JSON without rigid schemas. A major performance enhancement in 4.1.1 is shredding — commonly queried fields within a VARIANT column are automatically extracted and stored as typed Parquet columns, dramatically reducing I/O.

Performance benchmarks (shredded vs alternatives):

Comparison	Read Performance Gain
VARIANT with shredding vs non-shredded VARIANT	8x faster
VARIANT with shredding vs JSON strings	30x faster
Write performance (trade-off)	20–50% slower writes

Example:

Bash
    
xxxxxxxxxx
 
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Variant Example").getOrCreate() # Create table with VARIANT columnspark.sql("""    CREATE TABLE events (        id BIGINT,        payload VARIANT    ) USING PARQUET""") # Insert semi-structured dataspark.sql("""    INSERT INTO events VALUES    (1, parse_json('{"user": "alice", "action": "login", "score": 95}')),    (2, parse_json('{"user": "bob", "action": "purchase", "amount": 49.99}'))""") # Query specific fieldsspark.sql("SELECT id, payload:user, payload:action FROM events").show()
Copy

Recursive CTE

Spark 4.1.1 adds native support for Recursive Common Table Expressions, enabling traversal of hierarchical data structures — org charts, bill of materials, graph topologies — directly in SQL.

Example — org chart traversal:

Bash
    
xxxxxxxxxx
 
WITH RECURSIVE org_hierarchy AS (  -- Anchor: start with top-level managers  SELECT employee_id, name, manager_id, 0 AS level  FROM employees  WHERE manager_id IS NULL   UNION ALL   -- Recursive: find each manager's direct reports  SELECT e.employee_id, e.name, e.manager_id, oh.level + 1  FROM employees e  JOIN org_hierarchy oh ON e.manager_id = oh.employee_id)SELECT * FROM org_hierarchy ORDER BY level, name;
Copy

Approximate Data Sketches

Spark 4.1.1 expands approximate aggregation beyond HyperLogLog with two new native sketch types for efficient approximate analytics on massive datasets.

Sketch Type	SQL Function	Use Case
KLL (Quantiles)	`kll_sketch_agg`, `kll_sketch_percentile`	Approximate percentiles/quantiles with minimal memory
Theta	`theta_sketch_agg`, `theta_sketch_distinct`	Approximate set operations (union, intersection, difference)

KLL example — approximate percentiles:

Bash
    
xxxxxxxxxx
 
-- Build a KLL sketch over a large columnSELECT kll_sketch_agg(response_time_ms) AS response_sketchFROM request_logs; -- Query p50, p95, p99 from the sketchSELECT kll_sketch_percentile(response_sketch, array(0.5, 0.95, 0.99))FROM (SELECT kll_sketch_agg(response_time_ms) AS response_sketch FROM request_logs);
Copy

Theta example — approximate distinct counts across datasets:

Bash
    
xxxxxxxxxx
 
-- Approximate unique users across two date rangesSELECT theta_sketch_distinct(  theta_sketch_union(    (SELECT theta_sketch_agg(user_id) FROM events WHERE dt = '2025-01-01'),    (SELECT theta_sketch_agg(user_id) FROM events WHERE dt = '2025-01-02')  )) AS approx_unique_users_2days;
Copy

Collation Support

Spark 4.1.1 supports string collation, allowing locale-aware and case-insensitive string comparisons — essential for multilingual applications.

Bash
    
xxxxxxxxxx
 
CREATE TABLE products (name STRING COLLATE 'en_US.UTF8'); SELECT * FROM productsWHERE name = 'café' COLLATE 'en_US.UTF8';
Copy

Last updated on May 14, 2026

Was this page helpful?