Operational Workflow of Pinot

Apache Pinot follows a structured workflow when pulling data from HDFS, processing it into segments, and pushing it into the Pinot cluster. Let's break it down step by step.

Internal Working of Pinot's Batch Ingestion from HDFS

When you push data from a CSV file stored in HDFS, Pinot follows these steps:

Reading the CSV File from HDFS (Input Directory)

Pinot reads the CSV file(s) from the inputDirURI (hdfs:///pinot/input/).
It uses a record reader (e.g., CSVRecordReader) to parse the data.

Processing and Creating Segments (Staging Directory)

Pinot divides the dataset into chunks called segments.
Each segment is a self-contained, optimized storage unit that holds a portion of the dataset.
Segments are generated in a staging directory (stagingDir: 'hdfs:///pinot/staging/').
The processing happens either locally (standalone mode) or using Hadoop (MapReduce job) for scalability.

Storing Processed Segments (Output Directory)

After segment creation, Pinot writes them to the output directory (outputDirURI: 'hdfs:///pinot/output/').
These are tarred (.tar.gz) for efficient storage and transfer.

Pushing Segments to Pinot Cluster

The generated segments are then uploaded to the Pinot Controller.
This is done using:
- Tar Push → Segments are copied to Pinot via HTTP.
- URI Push → The controller downloads segments from the output directory (HDFS).
The segments are then distributed across Pinot Servers, ready for queries.

Explanation of Key Directories

Directory	Purpose
Input Directory (`inputDirURI`)	Stores raw CSV files in HDFS.
Staging Directory (`stagingDir`)	Temporary storage for processing intermediate files before segment creation.
Output Directory (`outputDirURI`)	Stores the final Pinot segments before pushing to the Controller.

Pinot Pull Operation Flow

Job Execution Begins
- Pinot reads the hadoopIngestionJobSpec.yaml file.
- Identifies the input location, output location, and processing parameters.
Segment Generation (MapReduce or Standalone)
- Pinot reads CSV files from HDFS (inputDirURI).
- Data is processed using the configured record reader.
- The data is split into multiple segments based on partitioning logic.
- Segments are compressed and stored in outputDirURI.
Segment Push to Pinot
- Pinot registers the segments with the Controller.
- The Broker updates its metadata to include new segments.
- Segments are then assigned to Pinot Servers for query execution.
Query Execution
- Once segments are available, Pinot Brokers start handling queries.
- Queries fetch data from Pinot Servers, which read from the stored segments.

Example Workflow with Paths

Assume:

Your raw CSV file is at hdfs:///pinot/input/airline_data.csv.
The staging directory is hdfs:///pinot/staging/.
The processed segments will be saved in hdfs:///pinot/output/.

Step-by-Step Execution:

Pinot reads CSV from hdfs:///pinot/input/
Splits data into segments (hdfs:///pinot/staging/)
Creates tarred segment files (hdfs:///pinot/output/segment_1.tar.gz)
Pushes segments to the Pinot Controller
Pinot Servers load segments, and queries can be executed

High-Level Diagram

Verification

Check HDFS Directories

hdfs dfs -ls hdfs:///pinot/input/ hdfs dfs -ls hdfs:///pinot/staging/ hdfs dfs -ls hdfs:///pinot/output/

Check Segments in Pinot

curl -X GET "http://localhost:9000/tables/airline_table/segments"

Query Data in Pinot

SELECT * FROM airline_table LIMIT 10;

Last updated on May 6, 2025

Was this page helpful?