Multi Node Installation

This section provides detailed steps for installing and configuring Apache Airflow in your environment using RPM packages and Management Pack (Mpack). Apache Airflow is a powerful tool for managing complex workflows and data processing pipelines. By utilizing RPM packages and Mpack, the installation and integration processes are simplified, ensuring a smooth deployment. Follow these instructions to effectively set up and manage workflows using Apache Airflow on your system.

Here’s how to set up Airflow across two nodes:

Node 1:

  • Airflow Scheduler
  • Airflow Webserver
  • Airflow Celery Worker

Node 2:

  • Airflow Celery Worker

Configuration Details

Node 1

  • Install Airflow.
  • Configure Airflow to run the Scheduler, Webserver, and Worker components.
  • Adjust the airflow.cfg to set the correct hostname and port for the Webserver.

Node 2

  • Install Airflow.
  • Configure Airflow to run only the Worker component.
  • Ensure the Worker’s configuration references the correct broker URL for communications with the Scheduler and the message queue (e.g., RabbitMQ).

General Configuration

  • Install all necessary dependencies on both nodes, including Python packages, database connectors, and other required libraries.
  • If using a database backend, set up the database instances and configure the connection settings in airflow.cfg.
  • Confirm network settings allow for proper communication between Node 1 and Node 2, focusing on required Airflow ports and any related services (e.g., RabbitMQ).
  • Test the configuration thoroughly to ensure all Airflow components communicate effectively across both nodes.

This setup distributes Airflow components over two nodes, optimizing resource utilization and component management based on your specific infrastructure and needs.

RHEL8 Setup for Node 1

Prerequisites

Before starting, install the necessary development tools and Python packages. Ensure you have Python 3.11 or newer, as Airflow supports Python versions from 3.11 onwards.

Bash
Copy

Database Setup

For effective testing and operation, configure a database backend, choosing between PostgreSQL and MySQL. While Airflow defaults to SQLite for development, it supports:

  • PostgreSQL: Versions 12 through 16
  • MySQL: Version 8.0, Innovation
  • MSSQL (experimental, discontinuing in version 2.9.0): Versions 2017 and 2019
  • SQLite: Version 3.15.0 and above

Ensure the database version you select is compatible with Airflow.

  • MySQL: Version 8.0 or higher
  • PostgreSQL: Version 12 or higher

Note Oracle databases are not supported by Apache Airflow.

Follow the respective instructions below for your chosen database system to initialize and configure it for use with Apache Airflow.

PostgreSQL Database Setup

To use PostgreSQL with Apache Airflow, perform the following steps to install and configure it:

  1. Install psycopg2-binary Python Package:
Bash
Copy
  1. Install PostgreSQL:
Bash
Copy
  1. Initialize and Start PostgreSQL:
Bash
Copy
  1. Create PostgreSQL Database and User for Airflow:

To set up the database and user for Apache Airflow in PostgreSQL, perform the following steps:

  • Access the PostgreSQL Shell:
Bash
Copy
  • Inside the PostgreSQL shell, execute the following commands:
SQL
Copy

If you are using Postgres-15 then run the below commands:

SQL
Copy

The PostgreSQL database named 'airflow' and the user 'airflow' with the designated settings and privileges are now established. Continue with the following steps to configure Apache Airflow to use this PostgreSQL database.

  1. Configure PostgreSQL Settings for Airflow:

Once the Airflow database and user have been set up in PostgreSQL, adjust the PostgreSQL configuration to permit connections from the Apache Airflow server. Proceed with the following steps:

  • Open the PostgreSQL Configuration File:
Bash
Copy
  • Inside the file, modify the following settings:
Bash
Copy
  • Save and close the file.
  • Open the pg_hba.conf file:
Bash
Copy
  • Add this line at the end of the file and replace {host_IP} with the actual IP address of the machine running Apache Airflow.
Bash
Copy
  • Add entries for all the nodes in the database:
Bash
Copy
  • Add IPs for all the nodes and save and close the file.
  1. Restart PostgreSQL to apply changes:
Bash
Copy

MySQL Database Setup for Airflow

To set up MySQL as the database backend for Apache Airflow, perform the following steps:

  1. Install MySQL Server:
Bash
Copy
  1. Install the mysqlclient Python package:
Bash
Copy
  1. Start the MySQL service:
Bash
Copy
  1. Install MySQL Connector for Python:
Bash
Copy
  1. Secure MySQL installation (optional but recommended):
Bash
Copy

Follow the prompts to secure the MySQL installation, including setting a root password.

  1. Create Database and User for Airflow:
Bash
Copy

Enter the root password when prompted. Perform the following inside MySQL shell:

SQL
Copy

If you are using a multinode setup, add users for other nodes as well:

SQL
Copy
  1. Restart MySQL to Apply Changes:
Bash
Copy

The MySQL database has been configured with a database named 'airflow' and a user named 'airflow' who has the required privileges. Next, configure Apache Airflow to utilize this MySQL database as its backend.

Apache Airflow Installation using Mpack

Ensure to establish symbolic links for Python to utilize a version exceeding Python 3.11.

Bash
Copy

The following steps outline the installation and setup process for Apache Airflow using Management Pack (Mpack) on an Ambari-managed cluster.

Install Mpack:

Bash
Copy

Uninstall the previous Mpack, if required:

Bash
Copy

Change Simlinks:

Bash
Copy

Restart Ambari Server:

Bash
Copy

Your Apache Airflow installation, utilizing RPM packages and Mpack, is now ready for use on your Ambari-managed cluster.

Before starting the installation or setting up the Airflow service through the Ambari UI, make sure that the RabbitMQ configuration on the master node is complete.

RHEL8 Setup for Node 2

Prerequisites

Bash
Copy

Install the necessary database connectors for the database you are using.

Bash
Copy

Before proceeding with the Airflow installation from Ambari, ensure that you have set up the Apache Airflow repository on both nodes.

Install Apache Airflow using the Ambari UI

  • Add Service:
  • Select Airflow:
  • Select Scheduler and Webserver nodes:

The webserver can be hosted on multiple nodes, but the scheduler must be located on the master node.

  • Select the nodes for the Celery worker, as we are employing a two-node setup in our case.
  • Enter the database and Rabbit MQ details:

Database Options:

  1. Select either MySQL or PostgreSQL for your backend database:
  2. Set up the Airflow backend database connection string and Celery configurations. You will need to enter specific details such as the database name, password, username, type of database (MySQL or PostgreSQL), and host IP. The script provided will automatically generate the necessary configuration details for the database connection string and Celery settings.

Input Database Information in the Ambari UI:

  • Database Name
  • Password
  • Username
  • Database Type: Select MySQL or PostgreSQL
  • Host IP

If you are using RabbitMQ, configure and add the following RabbitMQ settings:

  • RabbitMQ Username
  • RabbitMQ Password
  • RabbitMQ Virtual Host
  • Celery Broker
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
  Last updated