Apache | HDFS
The Hadoop Distributed File System, or HDFS, is a foundational technology for storing massive amounts of data in a distributed context. It is intended to scale from a single server to thousands of machines while maintaining high availability and resilience. HDFS is well-known for its capacity to efficiently manage large amounts of organized and unstructured data.
Apache HDFS in ADOC
Within the framework of Acceldata's Data Observability Cloud (ADOC), the integration of HDFS is an essential component that contributes to the improvement of data observability and reliability, respectively. In order to give profound insights and analytics, ADOC makes use of the distributed architecture of HDFS. This helps to ensure that the data quality and performance are optimal.
Steps to Add HDFS as a Data Source
To add HDFS as a Data source:
- Click Register from the left pane.
- Click Add Data Source.
- Select the HDFS Data Source. HDFS Data Source basic Details page is displayed.

- Enter a name for the data source in the Data Source name field.
- (Optional) Enter a description for the data source in the Description field.
- Enable the Data Reliability capability by switching on the toggle switch.
- Select a Data Plane from the Select Data Plane drop-down menu.

- Enter your Name Node URI.
- Enter the Cluster Name.
- Click on the
Test Connection button. After the connection testing is successful, the button will change toConnected . - Click on the Next button.
- Add the Asset Name, Path Expression, File Type, and the File Monitoring Channel Type.

- Turn on this toggle switch to enable the Crawler Execution Schedule to select a time tag and time zone to schedule the execution of crawlers for Data Reliability.
- Click the Submit button.
HDFS is now added as a data source. You can choose to crawl your HDFS account now or later.

Control Plane Concurrent Connections and Queueing Mechanism
The ADOC Control Plane (CP) now supports a queueing mechanism for managing concurrent connections at the data source level. This feature is aimed at controlling and optimizing the execution of jobs, thereby preventing overload on customer databases and improving system performance and reliability. This guide provides an overview of how concurrent job execution is managed and queued, as well as details on the configuration process for manual and scheduled executions.
Key Features
- Concurrency Control at Datasource Level: Define the maximum number of concurrent jobs allowed for a specific data source.
- Queueing Mechanism for Jobs: Introduce a queueing mechanism to manage jobs that exceed the configured concurrency limit, ensuring smooth execution without overloading the database.
- Support for Multiple Job Types: Currently supports data quality, reconciliation, and profiling jobs.
- Flexibility in Slot Allocation: Users can set the number of available slots as per their performance needs.
Concurrency Control and Queueing Mechanism
Why Concurrency Control is Needed?
Previously, no concurrency control existed to manage numerous jobs on the Control Plane. This meant that users may submit a huge number of jobs at once, potentially overflowing their database and causing performance issues or even system breakdowns. The new concurrency management technique ensures that only a fixed number of jobs can run concurrently, with additional jobs queued.
The concurrency control and queueing mechanism has been implemented for SAP Hana data sources. The new feature allows users to set the maximum number of concurrent jobs for a particular data source. If the number of jobs triggered exceeds the defined limit, the remaining jobs are queued until a slot becomes available.
How the Mechanism Works
- Job Slots: Users can define the number of slots available for concurrent job execution for a given data source. For example, if a data source is configured with a maximum of 5 concurrent jobs, only five jobs will run simultaneously.
- Queueing Mechanism: If more than five jobs are triggered, the excess jobs are moved to a queue and marked as "waiting." As soon as a running job completes, a slot is freed, and a job from the queue is picked for execution.
- Slot Monitoring: A background service continuously monitors the availability of job slots, checking every minute to see if a queued job can be started.
Configuration
Setting Concurrent Job Limits
When configuring a new data source or editing an existing one, users have the option to enable job concurrency control. By default, this setting is disabled, but it can be enabled, and users can set the Maximum Slots to define how many jobs can run concurrently.
Steps to Configure Job Concurrency:
- Navigate to the data source configuration page.
- Enable Job Concurrency Control by toggling the setting.
- Enter the number of slots (e.g., 1, 5, 10) that should be available for concurrent job execution.
- Save the configuration.
- Slot Setting: Suppose a user sets the Maximum Slots to 1 for a particular data source.
- Job Submission: The user then triggers three profiling jobs simultaneously.
- Queueing: Only one job will start immediately. The remaining two jobs are queued, and their status is shown as waiting.
- Slot Release: Once the first job completes, a slot is released, and the next job in the queue is started.
Benefits
- Prevents Overload: By limiting the number of concurrent jobs, the feature helps prevent overloading of customer databases, thus maintaining performance and avoiding potential crashes.
- Flexible Configuration: Users can adjust the number of concurrent slots based on their performance needs, giving them control over the workload being processed.
- Scalable: While this feature is currently implemented for SAP Hana data sources, it can be extended to other data sources such as Snowflake with minimal changes.
The queueing method for concurrent connections at the data source level is critical for maintaining system stability and optimal performance when dealing with multiple task executions. By restricting the amount of concurrent jobs and implementing a queueing system, the Control Plane may effectively manage workloads without overflowing the database.
From v 2.12.1 onwards MapR can be configured with Hive using both with SASL and without SASL support.
Key Features on MapR HDFS:
MapR Integration
Seamless Compatibility: ADOC now seamlessly integrates with MapR, providing robust support for managing and monitoring Hive and HDFS data sources within the MapR ecosystem.
Comprehensive Data Support: This integration facilitates efficient handling and observability of data, ensuring high data reliability and integrity across your MapR deployments.
SASL and Non-SASL Configurations:
- With SASL: For environments where security is paramount, ADOC supports MapR configured with Hive using SASL. This ensures enhanced security and authentication for data transactions.
- Without SASL: Recognizing diverse operational needs, ADOC also supports MapR configurations without SASL, offering flexibility while maintaining efficient data management and monitoring.
Streamlined Data Operations:
- Unified Monitoring: With this integration, users gain a unified view of their data operations across Hive and HDFS within MapR, simplifying data management tasks.
- Advanced Analytics and Observability: Users can leverage ADOC's advanced analytics capabilities for deeper insights and proactive observability of their data in MapR environments.
MapR HIVE without SASL
Integration Approach
- Shadowing All MapR Hadoop Jars: As a critical step in the integration process, ADOC shadows all MapR Hadoop jars. This ensures that the ADOC platform can effectively interact with the MapR environment without conflicts or compatibility issues.
- Adding MapR-Related Properties: To optimize the integration, specific MapR-related properties have been added. These properties enable ADOC to accurately recognize and interface with the MapR HIVE components, ensuring efficient data management and monitoring.*
MapR HIVE with SASL
Integration Approach
Shadowing MapR Hadoop Jars:
Purpose: To ensure compatibility and smooth functioning of Hadoop components within the ADOC framework.
Process: Replace standard Hadoop jars with MapR-specific Hadoop jars in the ADOC environment. This step is crucial for ensuring that all Hadoop functionalities are aligned with MapR's modified Hadoop components.
Adding MapR Security Options:
Configuration: Incorporate MapR security features into the ADOC environment. This includes setting up SASL (Simple Authentication and Security Layer) to secure the communication channels.
Details: Adjust ADOC's configuration files to include necessary security parameters and authentication mechanisms required for SASL integration.
Setting up MapR Hive with SASL in ADOC:
- Ensure that all MapR and Hive services are operational in your environment.
- Follow the steps for shadowing MapR Hadoop jars, ensuring that they are correctly placed in the ADOC class path.
2.1. Create a Custom Hadoop Classpath File: Create a custom Hadoop classpath file, e.g., mapr_classpath.sh, to manage the classpath configuration.
2.2. Add MapR Client JARs to the Classpath File: Edit the custom classpath file to include the MapR client JARs. For example:
export HADOOP_CLASSPATH=/opt/mapr/lib/*:$HADOOP_CLASSPATH
2.3. Update Hadoop Configuration Files: Update the Hadoop configuration files, such as core-site.xml and hdfs-site.xml, to include the MapR-specific properties. This allows Hadoop to interact with the MapR file system and other MapR services.
2.4. Restart Hadoop Services: After making these changes, restart the Hadoop services to apply the new configurations.
- Update the ADOC configuration files with the required MapR security options, paying special attention to SASL properties.
- Run sample applications to test the connectivity and functionality of Hive and MaprFs in the ADOC environment. This should include tests for data querying, loading, and other common Hive operations.
Same Cluster MapR HDFS vs Hive Reconciliation
The ADOC V2.12.1, introduced the capability to reconcile data between MapR Hadoop Distributed File System (HDFS) and Hive within the same cluster. This feature ensures consistency and integrity of data across these two critical components of the MapR ecosystem.
Reconciliation Process
1. Data Synchronization:
Objective: To maintain data consistency between MapR HDFS and Hive.
Method: ADOC performs periodic scans to compare data stored in MapR HDFS with the metadata in Hive. This process identifies discrepancies between the file system and the Hive tables. **
2. Discrepancy Resolution:
Approach: When inconsistencies are detected, ADOC initiates a resolution process. This might involve updating Hive metadata to reflect the current state of the data in MapR HDFS or vice versa.
Automation: The reconciliation process is largely automated, with ADOC handling most discrepancies. However, certain complex cases might require manual intervention. **
3. Reporting and Alerts:
Notifications: ADOC provides alerts and detailed reports about discrepancies and the actions taken to resolve them.
Dashboard: A dedicated section in the ADOC dashboard displays the reconciliation status, highlighting any ongoing or resolved issues.
Reconciliation Considerations | Description |
---|---|
Data Volume and Frequency | High data volumes and frequent updates in either MapR HDFS or Hive can impact the reconciliation process. It's important to schedule scans considering the workload to optimize performance. |
Schema Changes | Any schema changes in Hive tables should be monitored closely as they can lead to inconsistencies with the data stored in MapR HDFS. |
Permissions and Security | Ensure proper permissions are set up in both MapR HDFS and Hive to allow ADOC to access and modify data and metadata as required. |
Error Handling | Implement robust error handling mechanisms to address any issues that might arise during the reconciliation process. |
Cross Cluster MapR HDFS/Hive and Apache HDFS/Hive Reconciliation
Data Comparison Across Clusters:
- Objective: To ensure data consistency between MapR HDFS/Hive in one cluster and Apache HDFS/Hive in another.
- Method: ADOC executes a cross-cluster comparison, analyzing data files in MapR HDFS and their corresponding metadata in Hive, and comparing them against the data and metadata in Apache HDFS/Hive.
Metadata Synchronization:
- Process: The system synchronizes metadata across the clusters to reflect the current state of the data accurately. This involves aligning table structures, formats, and other relevant metadata parameters.
Discrepancy Identification and Resolution:
- Detection: ADOC identifies discrepancies between data files and metadata across the different clusters.
- Resolution Strategy: The system employs pre-defined rules to determine the best course of action for resolving discrepancies. This might involve updating metadata, flagging data inconsistencies, or recommending manual review.
Cross-Cluster Communication:
- Connectivity: Establish secure and efficient communication channels between the different clusters to facilitate data and metadata exchange.
- Data Transfer: Ensure that the data transfer across clusters is optimized for performance and security, considering the large volume of data typically involved.
Error Handling and Logging:
- Logging: Maintain comprehensive logs of all reconciliation activities, including any discrepancies found and actions taken.
- Error Management: Implement robust mechanisms to handle errors that might occur during the reconciliation process.
Operational Considerations
- Network Latency and Bandwidth: Account for network latency and bandwidth constraints when designing the reconciliation process across clusters.
- Cluster Configurations: Ensure compatibility in configurations and versions between the MapR and Apache clusters to facilitate seamless reconciliation.
- Security and Compliance: Adhere to security protocols and compliance requirements, especially when handling sensitive data across clusters.