MultiNode Installation

This section provides detailed steps for installing and configuring Apache Airflow in your environment using RPM packages and Management Pack (Mpack). Apache Airflow is a powerful tool for managing complex workflows and data processing pipelines. By utilizing RPM packages and Mpack, the installation and integration processes are simplified, ensuring a smooth deployment. Follow these instructions to effectively set up and manage workflows using Apache Airflow on your system.

Here’s how to set up Airflow across two nodes:

Node 1:

  • Airflow Scheduler
  • Airflow Webserver
  • Airflow Celery Worker

Node 2:

  • Airflow Celery Worker

Configuration Details

Node 1

  • Install Airflow.
  • Configure Airflow to run the Scheduler, Webserver, and Worker components.
  • Adjust the airflow.cfg to set the correct hostname and port for the Webserver.

Node 2

  • Install Airflow.
  • Configure Airflow to run only the Worker component.
  • Ensure the Worker’s configuration references the correct broker URL for communications with the Scheduler and the message queue (e.g., RabbitMQ).

General Configuration

  • Install all necessary dependencies on both nodes, including Python packages, database connectors, and other required libraries.
  • If using a database backend, set up the database instances and configure the connection settings in airflow.cfg.
  • Confirm network settings allow for proper communication between Node 1 and Node 2, focusing on required Airflow ports and any related services (e.g., RabbitMQ).
  • Test the configuration thoroughly to ensure all Airflow components communicate effectively across both nodes.

This setup distributes Airflow components over two nodes, optimizing resource utilization and component management based on your specific infrastructure and needs.

RHEL8 Setup for Node 1

Prerequisites

To establish the necessary development tools and Python packages, ensure the installation of Python packages with versions 3.8 and above. Note that while this documentation employs Python 3.8, Apache Airflow is compatible with Python version 3.8.

Bash
Copy

Database Setup

For an optimal test drive experience of Airflow, it is recommended to configure a database backend using either PostgreSQL or MySQL. By default, Airflow utilizes SQLite, designed primarily for development purposes.

Airflow extends support to specific versions of database engines, so it's crucial to verify your version compatibility, as older versions may lack support for certain SQL statements:

  • PostgreSQL: Versions 12 through 16
  • MySQL: Version 8.0, Innovation
  • MSSQL (experimental, discontinuing in version 2.9.0): Versions 2017 and 2019
  • SQLite: Version 3.15.0 and above

Before proceeding with Apache Airflow installation, establish a compatible database, selecting between PostgreSQL or MySQL based on your preferences and system requirements. Confirm that the chosen database version meets the minimum compatibility prerequisites:

  • MySQL: Version 8.0 or higher
  • PostgreSQL: Version 12 or higher

Note Oracle databases are not supported by Apache Airflow.

Follow the respective instructions below for your chosen database system to initialize and configure it for use with Apache Airflow.

PostgreSQL Database Setup

To use PostgreSQL with Apache Airflow, perform the following steps to install and configure it:

  1. Install PostgreSQL:
Bash
Copy
  1. Initialize and Start PostgreSQL:
Bash
Copy
  1. Create PostgreSQL Database and User for Airflow:

To set up the database and user for Apache Airflow in PostgreSQL, follow these steps:

Bash
Copy
Bash
Copy

Now, the PostgreSQL database named airflow and the user airflow with the specified settings and privileges have been created. Proceed with the next steps to configure Apache Airflow with this PostgreSQL database.

  1. Configure PostgreSQL Settings for Airflow:

After creating the Airflow database and user in PostgreSQL, modify the PostgreSQL configuration to allow connections from the Apache Airflow server. Follow these steps:

Bash
Copy
Bash
Copy

Save and close the file.

Bash
Copy
Bash
Copy

Save and close the file.

  1. Restart PostgreSQL to Apply Changes:
Bash
Copy

MySQL Database Setup for Airflow

To set up MySQL as the database backend for Apache Airflow, perform the following steps:

  1. Install MySQL Server:
Bash
Copy
  1. Install the mysqlclient Python package:
Bash
Copy
  1. Start the MySQL service:
Bash
Copy
  1. Install MySQL Connector for Python:
Bash
Copy
  1. Secure MySQL installation (optional but recommended):
Bash
Copy

Follow the prompts to secure the MySQL installation, including setting a root password.

  1. Create Database and User for Airflow:
Bash
Copy

Enter the root password when prompted. Perform the following inside MySQL shell:

SQL
Copy
  1. Restart MySQL to Apply Changes:
Bash
Copy

Now, the MySQL database is set up with a database named airflow and a user named airflow with the necessary privileges. Proceed to configure Apache Airflow to use this MySQL database as its backend.

Apache Airflow Installation using Mpack

Ensure to establish symbolic links for Python to utilize a version exceeding Python 3.8.

Bash
Copy

This guide outlines the installation and setup process for Apache Airflow using Management Pack (Mpack) on an Ambari-managed cluster.

Install Mpack and Configure:

Install Mpack:

Bash
Copy

Uninstall the previous Mpack, if required:

Bash
Copy

Change Simlinks:

Bash
Copy

Restart Ambari Server:

Bash
Copy

Your Apache Airflow installation, utilizing RPM packages and Mpack, is now ready for use on your Ambari-managed cluster.

Before starting the installation or setting up the Airflow service through the Ambari UI, make sure that the RabbitMQ configuration on the master node is complete.

RHEL 8 Setup for Node 2

Prerequisites

Bash
Copy

Install the necessary database connectors for the database you are using.

Bash
Copy

Install Apache Airflow using the Ambari UI:

  • Add a service:
  • Select Airflow:
  • Select Scheduler and Webserver nodes:

The webserver can be hosted on multiple nodes, but the scheduler must be located on the master node.

  • Select the nodes for the Celery worker, as we are employing a two-node setup in our case.
  • Enter the database and Rabbit MQ details:

Database Options:

  1. Select either MySQL or PostgreSQL for your backend database:
  2. Set up the Airflow backend database connection string and Celery configurations. You will need to enter specific details such as the database name, password, username, type of database (MySQL or PostgreSQL), and host IP. The script provided will automatically generate the necessary configuration details for the database connection string and Celery settings.

Input Database Information in the Ambari UI:

  • Database Name
  • Password
  • Username
  • Database Type: Select MySQL or PostgreSQL
  • Host IP

If you are using RabbitMQ, configure and add the following RabbitMQ settings:

  • RabbitMQ Username
  • RabbitMQ Password
  • RabbitMQ Virtual Host
  • Celery Broker
  • Try to access the flower UI.
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
  Last updated