Working with Ozone File System

ODP currently does not support Ozone2 as the default file system. Yet ODP Ozone2 is configured to work independently of HDFS.

Prerequisites

To enable ofs support with applications, configure applications to use necessary jars and ozone2-site.xml.

  • Add the ozone-filesystem-hadoop3.jar to the application classpath.
Bash
Copy
  • Add the following configs to core-site.xml
XML
Copy
  • Add following configs from ozone2-site.xml to hdfs-site.xml on Ambari.
ozone.om.service.idsomservice
ozone.om.address.omservice.om0<om-node1-host>:9862
ozone.om.address.omservice.om1<om-node2-host>:9862
ozone.om.address.omservice.om2<om-node3-host>:9862
ozone.om.nodes.omserviceom0,om1,om2
ozone.om.kerberos.keytab.file/etc/security/keytabs/ozone.om.service.keytab
ozone.om.kerberos.principalom/_HOST@ADSRE.COM
  • Restart hadoop services

HDFS with OFS

Access hdfs dfs operations with ozone2 storage :

Bash
Copy

Here are some examples :

  • List files
Bash
Copy
  • Create directory
Bash
Copy
  • Upload file
Bash
Copy
  • Reading file
Bash
Copy

retry.RetryInvocationHandler: com.google.protobuf.ServiceException: INFO logs may be ignored as client hits all OM hosts one by one to identify leader OM.

Bash
Copy

YARN with Ozone2

Yarn can be used to run jobs with jobs accessing data from or writing into ozone file system.

  • Add ozone-filesystem-hadoop3-1.4.0_.jar to _mapreduce.application.classpath* in mapred-site.xml .
  • If ranger authorization is enabled, provide necessary permissions to ofs buckets, hdfs path, yarn queues, to perform necessary operations as per job requirements.
  • Perform respective user kerberos authentication in case of secure cluster.
  • submit job

Here is a sample job, doing wordcount on data from file in ofs, and storing the output file with wordcount result in ofs.

Bash
Copy

Job failing with : INFO mapreduce.Job: Task Id : task-id , Status : FAILEDError: java.io.IOException: Cannot resolve OM host omservice in the URI``

Configure mapreduce job to use ozone-site.xml. Alternatively, you can pass configs during runtime:

Bash
Copy

HIVE with Ozone

Although HIVE installation and operations use HDFS as default file system, ozone can be configured to be parallel file system for HIVE operations.

Configure Hive to work with Ozone :

  • Navigate to the Ambari UI > Hive > Configs > Advanced Hive-env and add
Bash
Copy
  • Restart Hive and Tez.
  • If Ranger authorization is enabled, grant the necessary permissions to OFS (Ozone File System) buckets, HDFS paths, and Hive URL to allow the required operations as per the job requirements.
  • Authenticate users with Kerberos credentials when operating in a secure cluster.

If queries are failing with below error when run queries as end user is enabled in hive org.apache.hadoop.security.authorize.AuthorizationException: User: hive is not allowed to impersonate ... https://issues.apache.org/jira/browse/HDDS-664

Ambari UI > Ozone > Configurations > Custom Core-site: add the following configs and restart services :

hadoop.proxyuser.hive.groups*
hadoop.proxyuser.hive.hosts*
hadoop.proxyuser.hive.users*

Store tables in OFS

To create tables in OFS add LOCATION '<OFS_URI> to CREATE TABLE command. This will make Hive tables reside at the specified location in ozone. All data changes her after will be in effect at table in given OFS_URI.

Here are sample hive operations with Hive accessing OFS :

  • Connect to Beeline
  • Create new table in ofs
SQL
Copy
  • Validate new table
SQL
Copy
  • Add values to table
SQL
Copy
  • Validate newly added values
SQL
Copy

SPARK with Ozone2

Although Ozone2 can work independently, current Ambari does not support Spark installation without HDFS.

Apache Spark can access data from Apache Ozone2 and perform tasks. To access Apache Ozone2, configure spark :

  • Configure spark shell to use /usr/odp/current/ozone2-client/share/ozone/lib/ozone-filesystem-hadoop3-client-2.1.0.3.3.6.2-104.jar.

Accessing Apache Ozone data in Apache Spark3

  • Creating sample data to be read by Spark3
Bash
Copy
  • Upload the employee.csv file to Ozone2
Bash
Copy
  • Provide necessary permissions under ozone policies, for spark user to access respective bucket and file, if ranger authorization is enabled.
  • Allow spark user to submit yarn applications.
  • Launch spark-shell
Bash
Copy
  • Accessing csv file content in ozone as spark df
Bash
Copy

Custom PySpark Job

To run spark job using ofs use following command:

Bash
Copy

For a secure cluster, add --keytab <keytab> --principal <principal> values to above command.

Here is a sample custom job that functions to access Ozone data with Apache Spark and write output to Ozone.

  • Custom Pyspark application using ofs to access data and write output
Bash
Copy
  • Uploading sample input file to ofs
Bash
Copy
  • Provide necessary permissions to spark user to access respective bucket and key, in case of ranger authorization enabled.
  • Running PySpark app in secure cluster
Bash
Copy
  • Validate output in ofs
Bash
Copy
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
  Last updated