Documentation
ODP 3.3.6.3-1
Release Notes
What is ODP
Installation
Component User guide and Installation Instructions
Upgrade Instructions
Downgrade Instructions
Reference Guide
Security Guide
Troubleshooting Guide
Uninstall ODP
Title
Message
Create new category
What is the title of your new category?
Edit page index title
What is the title of the page index?
Edit category
What is the new title of your category?
Edit link
What is the new title and URL of your link?
Operational Workflow of Pinot
Summarize Page
Copy Markdown
Open in ChatGPT
Open in Claude
Connect to Cursor
Connect to VS Code
Apache Pinot follows a structured workflow when pulling data from HDFS, processing it into segments, and pushing it into the Pinot cluster. Let's break it down step by step.
Internal Working of Pinot's Batch Ingestion from HDFS
When you push data from a CSV file stored in HDFS, Pinot follows these steps:
Reading the CSV File from HDFS (Input Directory)
- Pinot reads the CSV file(s) from the
inputDirURI(hdfs:///pinot/input/). - It uses a record reader (e.g.,
CSVRecordReader) to parse the data.
Processing and Creating Segments (Staging Directory)
- Pinot divides the dataset into chunks called segments.
- Each segment is a self-contained, optimized storage unit that holds a portion of the dataset.
- Segments are generated in a staging directory (
stagingDir: 'hdfs:///pinot/staging/'). - The processing happens either locally (standalone mode) or using Hadoop (MapReduce job) for scalability.
Storing Processed Segments (Output Directory)
- After segment creation, Pinot writes them to the output directory (
outputDirURI: 'hdfs:///pinot/output/'). - These are tarred (
.tar.gz) for efficient storage and transfer.
Pushing Segments to Pinot Cluster
The generated segments are then uploaded to the Pinot Controller.
This is done using:
- Tar Push → Segments are copied to Pinot via HTTP.
- URI Push → The controller downloads segments from the output directory (HDFS).
The segments are then distributed across Pinot Servers, ready for queries.
Explanation of Key Directories
| Directory | Purpose |
|---|---|
Input Directory (inputDirURI) | Stores raw CSV files in HDFS. |
Staging Directory (stagingDir) | Temporary storage for processing intermediate files before segment creation. |
Output Directory (outputDirURI) | Stores the final Pinot segments before pushing to the Controller. |
Pinot Pull Operation Flow
Job Execution Begins
- Pinot reads the
hadoopIngestionJobSpec.yamlfile. - Identifies the input location, output location, and processing parameters.
- Pinot reads the
Segment Generation (MapReduce or Standalone)
- Pinot reads CSV files from HDFS (inputDirURI).
- Data is processed using the configured record reader.
- The data is split into multiple segments based on partitioning logic.
- Segments are compressed and stored in
outputDirURI.
Segment Push to Pinot
- Pinot registers the segments with the Controller.
- The Broker updates its metadata to include new segments.
- Segments are then assigned to Pinot Servers for query execution.
Query Execution
- Once segments are available, Pinot Brokers start handling queries.
- Queries fetch data from Pinot Servers, which read from the stored segments.
Example Workflow with Paths
Assume:
- Your raw CSV file is at
hdfs:///pinot/input/airline_data.csv. - The staging directory is
hdfs:///pinot/staging/. - The processed segments will be saved in
hdfs:///pinot/output/.
Step-by-Step Execution:
- Pinot reads CSV from
hdfs:///pinot/input/ - Splits data into segments (
hdfs:///pinot/staging/) - Creates tarred segment files (
hdfs:///pinot/output/segment_1.tar.gz) - Pushes segments to the Pinot Controller
- Pinot Servers load segments, and queries can be executed
High-Level Diagram
Bash
+-------------------------+| HDFS (Raw CSV Files) | <---- (Step 1: InputDirURI)+-------------------------+ | v+-------------------------+| Pinot Segment Creator | <---- (Step 2: StagingDir)+-------------------------+ | v+-------------------------+| HDFS (Segment Output) | <---- (Step 3: OutputDirURI)+-------------------------+ | v+-------------------------+| Pinot Controller | <---- (Step 4: Segment Push)+-------------------------+ | v+-------------------------+| Pinot Servers (Query) | <---- (Step 5: Query Execution)+-------------------------+Verification
Check HDFS Directories
Bash
hdfs dfs -ls hdfs:///pinot/input/hdfs dfs -ls hdfs:///pinot/staging/hdfs dfs -ls hdfs:///pinot/output/Check Segments in Pinot
Bash
curl -X GET "http://localhost:9000/tables/airline_table/segments"Query Data in Pinot
Bash
SELECT * FROM airline_table LIMIT 10;Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
Last updated on May 6, 2025
Was this page helpful?
Next to read:
Leverage HDFS as Deep Storage for Real-Time TablesDiscard Changes
Do you want to discard your current changes and overwrite with the template?
Archive Synced Block
Message
Create new Template
What is this template's title?
Delete Template
Message
On This Page
Operational Workflow of PinotInternal Working of Pinot's Batch Ingestion from HDFSReading the CSV File from HDFS (Input Directory)Processing and Creating Segments (Staging Directory)Storing Processed Segments (Output Directory)Pushing Segments to Pinot ClusterExplanation of Key DirectoriesPinot Pull Operation FlowExample Workflow with PathsHigh-Level DiagramVerificationCheck HDFS DirectoriesCheck Segments in PinotQuery Data in Pinot