Gluten with Velox

Apache Gluten with Velox backend is a native execution plugin for Apache Spark that accelerates SQL query performance without requiring any code changes to your existing Spark applications.

Key Benefits

  • Zero Code Changes – Works with existing Spark SQL, DataFrame, and Dataset APIs
  • Significant Performance Gains – 2–5× faster on analytical workloads
  • Drop-in Replacement – Simply add JAR and configuration
  • Transparent Acceleration – Automatically offloads supported operations to native Velox engine
  • Production Ready – Battle-tested on TPC-DS, TPC-H benchmarks

What is Gluten?

Apache Gluten is a Spark plugin that offloads SQL query execution from JVM to native C++ execution engines. It acts as a “glue” layer between Apache Spark and vectorized execution engines like Velox.

Architecture Overview

Copy

Why Velox?

Velox is Meta's unified execution engine that powers:

  • Meta's data warehouse (Presto)
  • Stream processing systems
  • Machine learning infrastructure

Key Features

  • Vectorized execution – processes data in batches
  • SIMD instructions – hardware-accelerated operations
  • Advanced codegen – runtime code generation
  • Memory efficiency – Arrow-based zero-copy design

How It Works

Transparent Query Acceleration

When you enable Gluten, Spark automatically:

  1. Analyzes your SQL query plan
  2. Identifies operations that can be accelerated (scans, filters, joins, aggregations)
  3. Offloads those operations to the Velox native engine
  4. Falls back to Spark for unsupported operations
  5. Returns results – your application receives the same output, just faster!

Supported Operations

✅ Fully Supported

  • Table scans (Parquet, ORC, CSV)
  • Filters and predicates
  • Projections
  • Hash joins (inner, left, right, full outer, semi, anti)
  • Hash aggregations (sum, avg, min, max, count)
  • Sort operations
  • Window functions
  • String operations
  • Date/time operations
  • Math functions

⚠️ Partial Support

  • Some complex UDFs
  • Certain window function combinations
  • Specialized data types

❌ Not Supported

  • Custom data sources
  • Complex nested UDFs
  • Some exotic data types

Performance Expectations

Based on TPC-DS and production workloads:

Workload TypeExpected Speedup
Scan-heavy queries2–3×
Join-heavy queries3–5×
Aggregation-heavy2–4×
Complex analytics2–3×
Simple queries1.5–2×

Real-world Examples

  • TPC-DS Query 72: 45s → 12s (3.75× speedup)
  • TPC-DS Query 95: 120s → 35s (3.4× speedup)

Getting Started

Prerequisites

ComponentVersion
Apache Spark3.5.x
Java8 or 11 (upcoming release)
Operating SystemLinux x86_64
CPUAVX2 support recommended

Step 1: Obtain Gluten JAR

Contact your Acceldata account manager or download from:

Copy

Place JAR in:

Copy

Step 2: Enable Gluten (No Code Changes!)

Option A: Via spark-submit

Bash
Copy

Option B: Via spark-defaults.conf

Add to $SPARK_HOME/conf/spark-defaults.conf:

Copy

Option C: Programmatic Configuration (Scala/Java)

Scala
Copy

Option D: PySpark

Python
Copy

Step 3: Run Your Application

That's it! Your application runs exactly as before, but faster.

Configuration Guide

Minimal Configuration (Quick Start)

For JDK 11

Bash
Copy

For JDK 8

Bash
Copy

Note: JDK 8 does not require the --add-opens parameters.

Bash
Copy

TPC-DS Benchmark Configuration (Tested)

This is the exact configuration used by Acceldata for TPC-DS benchmarking.

For JDK 11

Bash
Copy

For JDK 8

Bash
Copy

Performance Tuning – Core Gluten Parameters

1. Memory Configuration

  • spark.memory.offHeap.enabled – enable off-heap memory
  • spark.memory.offHeap.size – off-heap memory size (recommend 8–12g per executor)
  • spark.gluten.memory.fraction – fraction of off-heap for Gluten (recommend ~0.7)

Rule of thumb

Off-heap memory ≈ 60–75% of executor memory Example: executor-memory=16goffHeap.size=10–12g

2. Shuffle Configuration

  • spark.shuffle.manager = org.apache.spark.shuffle.sort.ColumnarShuffleManager
  • spark.gluten.sql.columnar.shuffle.codec = lz4 (or zstd)

3. Velox Backend Settings

Key parameters to tune:

  • spark.gluten.sql.columnar.backend.velox.maxBatchSize = 16384–32768
  • spark.gluten.sql.columnar.backend.velox.memCacheSize = 2–4g
  • spark.gluten.sql.columnar.backend.velox.aggregationPreferredSize = 1048576–4194304
  • spark.gluten.sql.columnar.backend.velox.bloomFilterEnabled = true (join-heavy workloads)

4. Spilling Configuration

Bash
Copy

5. Adaptive Query Execution (AQE)

Bash
Copy

Monitoring and Validation

Verify Gluten is Active

Method 1 – Spark UI

  • Open Spark UI: http://<driver-host>:4040
  • Environment tab → check:
  • spark.plugins = org.apache.gluten.GlutenPlugin
  • spark.gluten.sql.enabled = true

Method 2 – Query Plan

Scala
Copy
Python
Copy

Method 3 – Logs

Bash
Copy

Performance Metrics

  • Query execution time (Spark UI → SQL tab)
  • Shuffle read/write size and time
  • GC time
  • Off-heap memory usage and spilling
  • CPU utilization

Example comparison

Bash
Copy

Benchmark Your Workload

Bash
Copy

Best Practices

1. Data Format Recommendations

Best performance

  • Parquet (columnar, compressed)
  • ORC (columnar, compressed)

Good

  • CSV (large files)
  • JSON (structured)

Poor

  • Small files (<128MB)
  • Plain uncompressed text
SQL
Copy

2. Partition Strategy

  • Partition by date/time for time-series
  • Aim for 128–256MB file/partition size
  • Avoid thousands of tiny files
SQL
Copy

3. Query Optimization

Do

  • Use WHERE predicates for pushdown
  • Use partition pruning
  • SELECT only needed columns
  • Enable bloom filters for joins

Avoid

  • SELECT * on wide tables
  • Complex nested UDFs
  • Excessive small shuffles

4. Resource Allocation

Bash
Copy

5. Fallback Strategy

Bash
Copy

Troubleshooting

Issue 1: “Plugin GlutenPlugin could not be loaded”

Symptoms

Copy

Fix

  • Verify JAR path:
Bash
Copy
  • Ensure JAR is included in --jars
  • Check Spark version compatibility (Gluten 1.4.0 ↔ Spark 3.5.x)

Issue 2: Native Library Load Failure

Symptoms

Copy

Fix

Bash
Copy

Issue 3: Out of Memory (OOM)

Fix

  • Increase off-heap:
Bash
Copy
  • Enable spilling:
Bash
Copy
  • Reduce batch size:
Bash
Copy

Issue 4: Result Mismatch vs Vanilla Spark

Fix

  • Check data types (decimals, timestamps)
  • Ensure query is supported by Gluten
  • Temporarily disable Gluten:
Scala
Copy

Issue 5: Slower with Gluten

  • Very small data (<100MB) → use vanilla Spark
  • Excessive ColumnarToRow conversions → inspect plan and simplify query
  • Increase spark.memory.offHeap.size
  • Tune spark.gluten.sql.columnar.shuffle.codec (try lz4 or zstd)

Debugging Tips

Enable more logging:

Bash
Copy

Check plans:

Scala
Copy

Inspect Gluten-related configs:

Scala
Copy

FAQ

General

Q: Do I need to rewrite my Spark applications? A: No. Gluten works transparently with existing Spark SQL, DataFrame, and Dataset code.

Q: Will query results be the same? A: Yes. If you see differences, treat them as bugs and escalate.

Q: Does Gluten work with PySpark / SparkR / Scala / Java? A: Yes.

Q: Does Gluten support UDFs? A: Built-in functions are supported; custom UDFs usually fall back to Spark.

Q: What about Structured Streaming? A: Works for SQL operations in micro-batch mode.

Performance

  • Typical speedup: 2–5× on analytical workloads
  • Limited gains on very small queries (<100MB)
  • Fully compatible with dynamic allocation

Technical

  • File formats: Parquet, ORC, CSV, JSON (best with Parquet / ORC)
  • Works with Hive metastore and Hive tables
  • Compatible with Delta Lake / Iceberg / Hudi through Spark
  • Storage: S3, HDFS, ADLS, GCS, etc.

Disable for a specific query:

Scala
Copy

Compatibility & Deployment

  • Spark versions: 3.3.x, 3.4.x, 3.5.x (Gluten 1.4.0)
  • Java: 8 and 11
  • Cluster managers: YARN, Kubernetes, Standalone
  • Arch: x86_64 (ARM64 planned)

Rollback: remove Gluten JAR + plugin configs. Works in notebooks (Jupyter, Zeppelin) via Spark conf.

Quick Reference

Essential Config

Bash
Copy

Quick Checks

Bash
Copy

Appendix – Key Gluten Parameters

  • spark.gluten.sql.enabled – master enable/disable switch
  • spark.gluten.sql.columnar.backend.lib – backend engine (velox)
  • spark.gluten.memory.fraction – fraction of off-heap for Gluten
  • spark.gluten.sql.columnar.preferColumnar – prefer columnar exec
  • spark.gluten.sql.columnar.forceShuffledHashJoin – force hash join
  • spark.gluten.sql.columnar.shuffle.codec – shuffle codec (lz4)
  • spark.gluten.sql.columnar.backend.velox.maxBatchSize – rows per batch
  • spark.gluten.sql.columnar.backend.velox.memCacheSize – cache size
  • spark.gluten.sql.columnar.backend.velox.spillEnabled – enable spilling
  • spark.gluten.sql.columnar.backend.velox.spillPath – spill directory
  • spark.gluten.sql.columnar.backend.velox.bloomFilterEnabled – bloom filters
  • spark.gluten.sql.columnar.backend.velox.aggregationPreferredSize – aggregation table size

Support and Resources

  • Apache Gluten GitHub – project source and issues
  • Velox GitHub – execution engine documentation
  • Apache Spark Tuning Guide – general Spark performance best practices
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
  Last updated