Apache Pinot Batch Import using HDFS

Overview

This document provides a step-by-step guide for batch importing data into Apache Pinot using HDFS as the storage backend.

Before running any Pinot commands, make sure to set Java 11 on the CLI and export other required configurations.

Bash
Copy

Create different tables and schemas.

Prepare the Data

Create a directory to store raw data.

Bash
Copy

Create a sample CSV file with transcript data:

Bash
Copy

Define Schema and Table Configuration

Schema Definition

Bash
Copy

Table Definition

Bash
Copy
Bash
Copy

Upload Schema and Table Configuration

Register the schema:

Bash
Copy

Register the table:

Bash
Copy

Configure Batch Ingestion Job

Configuration in case of Kerberos Enabled Cluster

In case you are using Kerberos, add the below properties.

Note In Advanced pinot-controller-conf, add the below properties, save, and restart the Pinot Service from Ambari.

Bash
Copy

Make sure to update the values based on your requirements.

Remove the following properties in case you are not using Kerberos from the below yaml file and above mentioned Kerberos configurations from the Pinot Controller conf.

Bash
Copy

Create a batch ingestion job configuration file:

Bash
Copy

Set Up Hadoop Environment Variables

Bash
Copy

Run the Batch Ingestion Job

Execute the following command to launch the data ingestion job:

Bash
Copy
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
  Last updated