Operational Workflow of Pinot
Apache Pinot follows a structured workflow when pulling data from HDFS, processing it into segments, and pushing it into the Pinot cluster. Let's break it down step by step.
Internal Working of Pinot's Batch Ingestion from HDFS
When you push data from a CSV file stored in HDFS, Pinot follows these steps:
Reading the CSV File from HDFS (Input Directory)
- Pinot reads the CSV file(s) from the
inputDirURI
(hdfs:///pinot/input/
). - It uses a record reader (e.g.,
CSVRecordReader
) to parse the data.
Processing and Creating Segments (Staging Directory)
- Pinot divides the dataset into chunks called segments.
- Each segment is a self-contained, optimized storage unit that holds a portion of the dataset.
- Segments are generated in a staging directory (
stagingDir: 'hdfs:///pinot/staging/'
). - The processing happens either locally (standalone mode) or using Hadoop (MapReduce job) for scalability.
Storing Processed Segments (Output Directory)
- After segment creation, Pinot writes them to the output directory (
outputDirURI: 'hdfs:///pinot/output/'
). - These are tarred (
.tar.gz
) for efficient storage and transfer.
Pushing Segments to Pinot Cluster
The generated segments are then uploaded to the Pinot Controller.
This is done using:
- Tar Push → Segments are copied to Pinot via HTTP.
- URI Push → The controller downloads segments from the output directory (HDFS).
The segments are then distributed across Pinot Servers, ready for queries.
Explanation of Key Directories
Directory | Purpose |
---|---|
Input Directory (inputDirURI ) | Stores raw CSV files in HDFS. |
Staging Directory (stagingDir ) | Temporary storage for processing intermediate files before segment creation. |
Output Directory (outputDirURI ) | Stores the final Pinot segments before pushing to the Controller. |
Pinot Pull Operation Flow
Job Execution Begins
- Pinot reads the
hadoopIngestionJobSpec.yaml
file. - Identifies the input location, output location, and processing parameters.
- Pinot reads the
Segment Generation (MapReduce or Standalone)
- Pinot reads CSV files from HDFS (inputDirURI).
- Data is processed using the configured record reader.
- The data is split into multiple segments based on partitioning logic.
- Segments are compressed and stored in
outputDirURI
.
Segment Push to Pinot
- Pinot registers the segments with the Controller.
- The Broker updates its metadata to include new segments.
- Segments are then assigned to Pinot Servers for query execution.
Query Execution
- Once segments are available, Pinot Brokers start handling queries.
- Queries fetch data from Pinot Servers, which read from the stored segments.
Example Workflow with Paths
Assume:
- Your raw CSV file is at
hdfs:///pinot/input/airline_data.csv
. - The staging directory is
hdfs:///pinot/staging/
. - The processed segments will be saved in
hdfs:///pinot/output/
.
Step-by-Step Execution:
- Pinot reads CSV from
hdfs:///pinot/input/
- Splits data into segments (
hdfs:///pinot/staging/
) - Creates tarred segment files (
hdfs:///pinot/output/segment_1.tar.gz
) - Pushes segments to the Pinot Controller
- Pinot Servers load segments, and queries can be executed
High-Level Diagram
+-------------------------+
| HDFS (Raw CSV Files) | <---- (Step 1: InputDirURI)
+-------------------------+
|
v
+-------------------------+
| Pinot Segment Creator | <---- (Step 2: StagingDir)
+-------------------------+
|
v
+-------------------------+
| HDFS (Segment Output) | <---- (Step 3: OutputDirURI)
+-------------------------+
|
v
+-------------------------+
| Pinot Controller | <---- (Step 4: Segment Push)
+-------------------------+
|
v
+-------------------------+
| Pinot Servers (Query) | <---- (Step 5: Query Execution)
+-------------------------+
Verification
Check HDFS Directories
hdfs dfs -ls hdfs:///pinot/input/
hdfs dfs -ls hdfs:///pinot/staging/
hdfs dfs -ls hdfs:///pinot/output/
Check Segments in Pinot
curl -X GET "http://localhost:9000/tables/airline_table/segments"
Query Data in Pinot
SELECT * FROM airline_table LIMIT 10;
Was this page helpful?
On This Page
Operational Workflow of PinotInternal Working of Pinot's Batch Ingestion from HDFSReading the CSV File from HDFS (Input Directory)Processing and Creating Segments (Staging Directory)Storing Processed Segments (Output Directory)Pushing Segments to Pinot ClusterExplanation of Key DirectoriesPinot Pull Operation FlowExample Workflow with PathsHigh-Level DiagramVerificationCheck HDFS DirectoriesCheck Segments in PinotQuery Data in Pinot