Working with Ozone File System

ODP currently does not support Ozone as the default file system. Yet ODP Ozone is configured to work independently of HDFS.

Prerequisites

To enable ofs support with applications, configure applications to use necessary jars and ozone-site.xml.

  • Install Ozone client from Ambari UI on the node where you want to enable ofs support with Hadoop service.
  • Add the ozone-filesystem-hadoop3.jar to the application classpath:
Bash
Copy
  • Add the following configs to core-site.xml:
Bash
Copy
  • Add the following configs from ozone-site.xml to hdfs-site.xml:
PropertyValue
ozone.om.service.idsomservice
ozone.om.address.omservice.om0<om-node1-host>:9862
ozone.om.address.omservice.om1<om-node2-host>:9862
ozone.om.address.omservice.om2<om-node3-host>:9862
ozone.om.nodes.omserviceom0,om1,om2
  • Include the ozone-filesystem-hadoop3-1.4.0*.jar file in the mapreduce.application.classpath property in the mapred-site.xml file.
  • Restart Hadoop services.

HDFS with OFS

To access hdfs dfs operations with ozone storage:

Bash
Copy

Examples:

  • List files:
Bash
Copy
  • Create directory:
Bash
Copy
  • Upload file:
Bash
Copy
  • Reading file:
Bash
Copy

retry.RetryInvocationHandler: com.google.protobuf.ServiceException: INFO logs may be ignored as client hits all OM hosts one by one to identify leader OM.

Bash
Copy

Yarn with Ozone

YARN enables the execution of jobs that interact with data stored in or written to the Ozone file system.

  • If Ranger authorization is enabled, grant the necessary permissions to OFS (Ozone File System) buckets, HDFS paths, and YARN queues to allow the required operations as per the job requirements.
  • Authenticate users with Kerberos credentials when operating in a secure cluster.
  • Submit the job.

Below is an example job that performs word count on data from a file in OFS and stores the output file containing the word count result back into OFS.

Bash
Copy

Job failing with : INFO mapreduce.Job: Task Id : task-id , Status : FAILEDError: java.io.IOException: Cannot resolve OM host omservice in the URI``

Configure MapReduce job to use ozone-site.xml. Alternatively, you can pass the configurations during runtime:

Bash
Copy

Hive with Ozone

Although Hive installation and operations use HDFS as the default file system, Ozone can be configured to be a parallel file system for Hive operations.

Configure Hive to work with Ozone:

  • Navigate to the Ambari UI > Hive > Configs > Advanced Hive-env and add
Bash
Copy
  • Restart Hive.
  • If Ranger authorization is enabled, grant the necessary permissions to OFS (Ozone File System) buckets, HDFS paths, and Hive URL to allow the required operations as per the job requirements.
  • Authenticate users with Kerberos credentials when operating in a secure cluster.

Store tables in OFS:

To create tables in OFS, add LOCATION '<OFS_URI> to CREATE TABLE command. This will make Hive tables reside at the specified location in Ozone. All data changes here after will be in effect at the table in the given OFS_URI.

Here are sample Hive operations with Hive accessing OFS :

  • Connect to Beeline
  • Create new table in OFS
Bash
Copy
  • Validate the new table:
Bash
Copy
  • Add values to the table:
Bash
Copy
  • Validate the newly added tables:
Bash
Copy

Spark with Ozone

While Ozone is capable of operating independently, the current version of Ambari does not facilitate Spark installation without HDFS integration.

Apache Spark can access data from Apache Ozone and perform tasks.

To access Apache Ozone, configure Spark:

  • Configure Spark shell to use /usr/odp/current/ozone-client/share/ozone/lib/ozone-filesystem-hadoop3-client-1.4.0.3.2.3.3-2.jar.
  • If Ranger authorization is enabled, grant the necessary permissions to OFS (Ozone File System) buckets to allow the required operations as per the job requirements.
  • Authenticate users with Kerberos credentials when operating in a secure cluster.

Accessing Apache Ozone Data in Apache Spark:

  • Create sample data to be read by Spark:
Bash
Copy
  • Upload the employee.csv file to Ozone:
Bash
Copy
  • Provide necessary permissions under Ozone policies, for the Spark user to access the respective bucket and file, if Ranger authorization is enabled.
  • Allow the Spark user to submit Yarn applications.
  • Launch spark-shell:
Bash
Copy
  • Access the .csv file content in Ozone as Spark df:
Bash
Copy

Custom PySpark Job

To run a Spark job using OFS, run the following command:

Bash
Copy

For a secure cluster, add --keytab <keytab> --principal <principal> values to above command.

Here is a sample custom job that functions to access Ozone data with Apache Spark and to write the output to Ozone.

  • Custom Pyspark application using ofs to access data and write output:
Bash
Copy
  • Upload the sample input file to ofs:
Bash
Copy
  • Provide necessary permissions to the Spark user to access respective bucket and key, in case of Ranger authorization being enabled.
  • To run PySpark app in a secure cluster:
Bash
Copy
  • To validate the output in ofs:
Bash
Copy
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
  Last updated