App Spark

Spark Installation

The Spark Installation capability in xDP provides a streamlined, one-click process for deploying and managing Apache Spark on your Kubernetes compute clusters. It automates the provisioning, configuration, and integration of Spark, transforming a complex setup process into a simple, repeatable workflow. This allows your data teams to focus on building data pipelines and analytics applications, rather than managing infrastructure, ensuring a production-ready, observable, and governed Spark environment from day one.

Key Concepts

  • Application Catalog: A curated registry within xDP containing essential data processing and analytics applications like Spark, Trino, and JupyterHub. All applications in the catalog are pre-configured for rapid deployment on xDP-managed compute clusters.
  • Compute Cluster: The target Kubernetes environment where applications are installed. xDP manages the lifecycle of these applications within the selected cluster.
  • Spark History Server: An optional but critical plugin that provides a web UI for inspecting the logs and metrics of completed and running Spark applications. This is essential for debugging, performance tuning, and maintaining a complete operational history (Data Lineage) of your Spark jobs.
  • Accelerator Plugins: An optional accelerator plugin that leverages C++ library to accelerate Spark SQL queries. Enabling this can significantly improve performance for SQL-heavy workloads by offloading execution to a native engine.
  • Configuration as Code: xDP allows you to manage Spark's configuration using a simple form or a powerful YAML editor. This approach enables you to version control, audit, and programmatically replicate your Spark environment configurations, aligning with modern GitOps and DataOps practices.

Capabilities

  • Automated Provisioning: Deploy a fully configured Apache Spark environment to any registered compute cluster with a single click, eliminating manual setup and configuration drift.
  • Integrated Observability: xDP automatically injects a Spark listener into your jobs, capturing detailed runtime metrics. This feeds into the platform's data observability features, helping you monitor Data Quality, performance, and cost.
  • Centralized Configuration: Manage all Spark settings, from core properties to advanced plugin configurations, through a unified interface. Switch between a user-friendly form for common settings and a full YAML editor for complete control.
  • Plugin Management: Easily enable or disable key services like the Spark History Server for debugging or performance accelerators to meet the specific needs of your workloads.

Tutorial (Getting Started)

This tutorial guides you through installing your first Apache Spark application on a compute cluster.

Prerequisites

  • You have an active xDP account with privileges to install applications.
  • A Compute Cluster is already registered and running in xDP. See Compute Clusters for more information.
  • The target cluster has sufficient CPU, memory, and storage resources to run Spark.
  • Supported Spark versions: 3.0.x through 4.0.x (any >= 3.0.0). Spark jobs run against a bundled base image — a pre-built container packaging a specific Spark version. Acceldata ships tested images for 3.3.3 (default), 3.3.4, 3.5.5, and 4.0.0; other >= 3.0.0 versions can be used with a custom image. You specify both the container image and the sparkVersion field independently at job submission.

Your First Installation

  1. Select the Spark Application: From the main navigation, go to Platform > Apps. This displays the Application Catalog. Locate the Apache Spark card and click Install.
  2. Configure Optional Plugins: The Install Plugins screen, allows you to enable additional services. For this tutorial, enable the Spark History Server. This service is invaluable for debugging and analyzing job performance later. Click Continue.

Example: Your team needs to debug a failed nightly ETL job. The Spark History Server provides access to the detailed logs and execution plan, allowing you to quickly identify the root cause of the failure.

  1. Set Application Parameters: Review the application configuration. You can use the Form Editor for a simplified view or switch to the YAML Editor for advanced customization. For now, the default settings are sufficient. Click Save & Continue.
  2. Complete the Installation: The final screen confirms that your configuration has been applied and the installation process has started. The application will be available shortly.
  3. Monitor Installation Status: Monitor the installation progress from this screen. Once complete, click Go to Applications to return to the catalog, where the Spark application shows the updated status.

How-to Guides

Enable the Spark History Server Post-Installation

If you initially installed Spark without the History Server, you can enable it at any time.

  1. Navigate to Platform > Apps.
  2. Locate the installed Apache Spark application and click the Edit button. You are taken to the "Install Plugins" configuration step.
  3. Toggle the Spark History Server switch to the on position.
  4. Click Continue and then Save & Continue through the remaining steps to apply the change.
  5. Verification: xDP applies the new configuration to your cluster. After a few moments, the History Server is deployed and accessible.

Customize Spark Properties using YAML

For advanced control, use the YAML editor to add or override default Spark properties.

  1. Navigate to Platform > Apps, find your Spark installation, and click Edit.
  2. Proceed with steps mentioned in the "Configure Application" step (Step 2).
  3. Select the YAML Editor tab.
  4. Locate the relevant section to add your custom properties. For example, to set a default executor memory, you might add it under a spark-defaults configuration key. # Example snippet - actual structure may vary spark-defaults: "spark.executor.memory": "4g" "spark.driver.memory": "2g"
  5. Click Save & Continue to apply your changes.
  6. Verification: The new configuration is applied. Any new Spark jobs submitted through xDP will now use these default properties.

Reference

Configuration Options

The following optional plugins can be configured during the Spark installation process.

ParameterDescriptionDefault
Spark History ServerEnables the Spark History Server, which provides a UI to view logs and details for completed and running Spark applications. Essential for debugging and performance analysis.Disabled
AcceleratorsEnables the available accelerators backend for Spark SQL. This can accelerate query performance by offloading processing to a native C++ execution engine. Recommended for SQL-intensive workloads.Disabled

The YAML editor provides access to the underlying Helm chart values for the Spark deployment. The specific keys depend on the chart version, but common customizable parameters include resource requests/limits, node selectors, and Spark default properties.

Best Practices

  • Tip: Always enable the Spark History Server for development and production environments. The operational insight it provides for debugging and performance tuning far outweighs the minor resource cost.
  • Resource Sizing: Before installing, ensure your target compute cluster has adequate resources. Monitor resource utilization after installation to right-size your Spark workloads and prevent resource contention.
  • Use the YAML Editor for Production: While the Form Editor is convenient, use the YAML editor to manage configurations for production environments. You can save the YAML file in a version control system like Git to track changes and automate deployments.
  • Isolate Workloads: Consider deploying separate Spark application instances for different teams or projects on the same compute cluster. This can be achieved by running the installation multiple times with different application names.
  • Security: When configuring properties in YAML, use secrets management tools for sensitive values like passwords or access keys rather than hardcoding them directly in the configuration.
VariableType to search · ESC to discard
GlossaryType to search · ESC to discard
InsertType to search · ESC to discard
No matches