Title
Create new category
Edit page index title
Edit category
Edit link
SQL Features
ANSI Mode (Enabled by Default)
ANSI mode is on by default in Spark 4.1.1, aligning Spark SQL with ANSI SQL standards. It enforces stricter rules for NULL handling, type conversions, and arithmetic.
xxxxxxxxxxspark.sql.ansi.enabled=true # default in Spark 4.1.1Behavior changes vs Spark 3.x:
| Operation | Spark 3.x | Spark 4.1.1 (ANSI default) |
|---|---|---|
| Integer overflow | Silent wrap-around | Throws ArithmeticException |
| Invalid type cast | Returns null | Throws AnalysisException |
| Division by zero | Returns null | Throws ArithmeticException |
xxxxxxxxxx-- Throws ArithmeticException in Spark 4.1.1SELECT 2147483647 + 1;SQL Scripting (GA)
SQL Scripting is now Generally Available and enabled by default, transforming Spark SQL into a full programmable environment with loops, conditionals, variables, and error handling — directly in SQL.
New in 4.1.1: CONTINUE HANDLER for error recovery and multi-variable DECLARE syntax.
Example — control flow:
xxxxxxxxxxBEGIN DECLARE total_count INT DEFAULT 0; DECLARE dept_id INT DEFAULT 100; SET total_count = (SELECT COUNT(*) FROM employees WHERE department_id = dept_id); IF total_count > 50 THEN INSERT INTO large_departments VALUES (dept_id, total_count); ELSE INSERT INTO small_departments VALUES (dept_id, total_count); END IF;END;Example — error handling with CONTINUE HANDLER:
xxxxxxxxxxBEGIN DECLARE CONTINUE HANDLER FOR SQLEXCEPTION INSERT INTO error_log VALUES (CURRENT_TIMESTAMP, 'Error occurred'); INSERT INTO target_table SELECT * FROM source_table;END;VARIANT Data Type (GA)
The VARIANT data type is now Generally Available, providing a standardized way to store semi-structured data like JSON without rigid schemas. A major performance enhancement in 4.1.1 is shredding — commonly queried fields within a VARIANT column are automatically extracted and stored as typed Parquet columns, dramatically reducing I/O.
Performance benchmarks (shredded vs alternatives):
| Comparison | Read Performance Gain |
|---|---|
| VARIANT with shredding vs non-shredded VARIANT | 8x faster |
| VARIANT with shredding vs JSON strings | 30x faster |
| Write performance (trade-off) | 20–50% slower writes |
Example:
xxxxxxxxxxfrom pyspark.sql import SparkSession spark = SparkSession.builder.appName("Variant Example").getOrCreate() # Create table with VARIANT columnspark.sql(""" CREATE TABLE events ( id BIGINT, payload VARIANT ) USING PARQUET""") # Insert semi-structured dataspark.sql(""" INSERT INTO events VALUES (1, parse_json('{"user": "alice", "action": "login", "score": 95}')), (2, parse_json('{"user": "bob", "action": "purchase", "amount": 49.99}'))""") # Query specific fieldsspark.sql("SELECT id, payload:user, payload:action FROM events").show()Recursive CTE
Spark 4.1.1 adds native support for Recursive Common Table Expressions, enabling traversal of hierarchical data structures — org charts, bill of materials, graph topologies — directly in SQL.
Example — org chart traversal:
xxxxxxxxxxWITH RECURSIVE org_hierarchy AS ( -- Anchor: start with top-level managers SELECT employee_id, name, manager_id, 0 AS level FROM employees WHERE manager_id IS NULL UNION ALL -- Recursive: find each manager's direct reports SELECT e.employee_id, e.name, e.manager_id, oh.level + 1 FROM employees e JOIN org_hierarchy oh ON e.manager_id = oh.employee_id)SELECT * FROM org_hierarchy ORDER BY level, name;Approximate Data Sketches
Spark 4.1.1 expands approximate aggregation beyond HyperLogLog with two new native sketch types for efficient approximate analytics on massive datasets.
| Sketch Type | SQL Function | Use Case |
|---|---|---|
| KLL (Quantiles) | kll_sketch_agg, kll_sketch_percentile | Approximate percentiles/quantiles with minimal memory |
| Theta | theta_sketch_agg, theta_sketch_distinct | Approximate set operations (union, intersection, difference) |
KLL example — approximate percentiles:
xxxxxxxxxx-- Build a KLL sketch over a large columnSELECT kll_sketch_agg(response_time_ms) AS response_sketchFROM request_logs; -- Query p50, p95, p99 from the sketchSELECT kll_sketch_percentile(response_sketch, array(0.5, 0.95, 0.99))FROM (SELECT kll_sketch_agg(response_time_ms) AS response_sketch FROM request_logs);Theta example — approximate distinct counts across datasets:
xxxxxxxxxx-- Approximate unique users across two date rangesSELECT theta_sketch_distinct( theta_sketch_union( (SELECT theta_sketch_agg(user_id) FROM events WHERE dt = '2025-01-01'), (SELECT theta_sketch_agg(user_id) FROM events WHERE dt = '2025-01-02') )) AS approx_unique_users_2days;Collation Support
Spark 4.1.1 supports string collation, allowing locale-aware and case-insensitive string comparisons — essential for multilingual applications.
xxxxxxxxxxCREATE TABLE products (name STRING COLLATE 'en_US.UTF8'); SELECT * FROM productsWHERE name = 'café' COLLATE 'en_US.UTF8';