Title
Create new category
Edit page index title
Edit category
Edit link
App Airflow
Airflow Installation
Apache Airflow is an open-source platform for programmatically authoring, scheduling, and monitoring workflows. Within xDP, the Airflow application provides a robust, enterprise-grade orchestration engine to manage complex data pipelines. It enables you to define Directed Acyclic Graphs (DAGs) of tasks, manage their dependencies, and ensure reliable execution at scale. This capability replaces basic job scheduling with a powerful framework for building resilient, observable, and automated data workflows, directly improving the timeliness and consistency of your data operations.
Architecture
xDP integrates Airflow as the unified orchestration layer for all jobs and workflows. When you create and schedule a job or a multi-step workflow, xDP transparently generates the corresponding Airflow DAG. This architecture ensures that all data operations, from a simple scheduled Spark job to a complex ETL pipeline, benefit from Airflow's advanced features like dependency management, retries, and monitoring.
Key Concepts
- Directed Acyclic Graph (DAG): A collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. In xDP, both single jobs and multi-step pipelines are managed as DAGs.
- Operator: A pre-defined template for a single task in a workflow. xDP uses a custom
XDPJobOperatorto execute Spark jobs, notebook tasks, and other operations defined within the platform. - Metadata Database: A database, typically PostgreSQL, that Airflow uses to store the state of all tasks and workflows. This is critical for tracking execution history, managing connections, and ensuring operational consistency.
- DAG Storage: A persistent, shared storage location, such as MinIO or Amazon S3, where Airflow reads DAG definition files. xDP automatically syncs your workflow definitions to this location.
Capabilities
- Centralized Orchestration: Install and manage a dedicated Airflow instance for your compute cluster to handle all scheduling and workflow execution.
- Flexible Dependency Configuration: Choose between using xDP's internal, managed PostgreSQL and MinIO services for quick setup, or connect to your own external databases and S3-compatible storage for production workloads.
- Resource Customization: Fine-tune CPU and memory resources for Airflow components (Scheduler, Webserver, Workers) using either a simplified form editor or a full-featured YAML editor for advanced control.
- Seamless Integration: Once installed, Airflow becomes the underlying engine for the Workflows feature, enabling you to build complex data pipelines without leaving the xDP interface.
Tutorial (Getting Started)
Prerequisites
- You have an active xDP account with permissions to install applications on a compute cluster.
- A compute cluster, such as
democluster, is running and available. - Supported version: Airflow
1.2
Install Your First Airflow Instance
This tutorial guides you through installing Apache Airflow on your cluster using the default settings, which leverage xDP's internal services for metadata and DAG storage.
- From the xDP sidebar, navigate to Platform > Apps. This displays the application catalog.
- Locate the AIRFLOW application card and click Install.
- The installation wizard begins. The
democlusteris automatically selected as the target. The first step is Configure PostgreSQL. - Leave the Use Internal PostgreSQL toggle enabled. This instructs xDP to automatically provision and configure a dedicated PostgreSQL instance for Airflow's metadata. Click Next.
- In the Configure DAG Storage step, leave the Use Internal MinIO toggle enabled. This provisions an internal object storage bucket for your Airflow DAG files. Click Next.
- The final step is Configure Application. Here you can review and customize the Airflow deployment. For this initial setup, the default parameters are sufficient. Click Save & Continue to begin the installation.
- xDP now deploys Airflow to your cluster. You can monitor the progress from the Apps page. Once complete, the status on the AIRFLOW card changes from "Not Installed" to "Installed".
You have successfully installed Apache Airflow! You can now proceed to the Workflows section to start building data pipelines.
Best Practices
- Use External Services for Production: For production environments, always configure Airflow with an external, highly available PostgreSQL database and an S3-compatible object store. This decouples Airflow's state from the lifecycle of the cluster and prevents data loss.
- Manage DAGs with Git-Sync: While you can manually upload DAGs to your configured storage, the best practice is to store them in a Git repository. Use a CI/CD pipeline to automatically sync changes from your repository to the S3 bucket that Airflow monitors.
- > Tip: Start with the default resource allocations provided by xDP. Monitor the CPU and memory usage of the Airflow scheduler and webserver pods under a typical workload, and only increase resources as needed to optimize cost and performance.
- Secure Sensitive Information: Use Airflow Connections and Variables to store secrets like database passwords, API keys, and tokens. Avoid hardcoding sensitive information directly in your DAG files.
For additional help, contact our Support Team!
©2026, Acceldata Inc — All Rights Reserved.