Kudu

Overview

Apache Kudu is an open-source distributed data storage engine that makes analytics on fast and changing data easy.

Supported Environment

  • Operating System - RHEL 8 / Rocky Linux 8
  • JDK - 8.x
  • Python - 2.7 and 3.11 or higher

Installation through Ambari Mpack

To install the Ambari Kudu mpack, do the following:

  1. Download the Kudu Mpack from the mirror here.
  2. SCP the Mpack to the Ambari Server.
  3. Run the following command to install it.
Bash
Copy
  1. Log in to the Ambari UI and add the Kudu service.
  2. Select the hosts for Kudu Masters and Tablet Servers. You must select at least one of each, with up to a recommendation of up to 100 tablet servers. You must select an odd number for a minimum of 3 master servers.

Uninstallation of Kudu Mpack

  1. Log in to the Ambari UI and navigate to the Kudu section.
  2. Stop the Kudu service, then delete the service.
  3. Run the following command to remove the Kudu mpack.
Bash
Copy

Configuration

You can set any of the stable or stable advanced flags from the Ambari UI, but at this time, you cannot set any of the Evolving Flags. If you wish to add evolving flags, add them to the jinja template in kudu-master-env and/or kudu-tablet-env.

Enabling SSL for the Web Server

Each Kudu master and tablet server, by default, runs a web server that you can access at either http(s)://$IP:$PORT (master default is 8051, tablet default is 8055). If you wish to use SSL, you must put the SSL key and certificate (in .pem format) on each machine, then set webserver_certificate_file to the path of your cert.pem file and set webserver_private_key_file to your key.pem file for both kudu-master-env and kudu-tablet-env. These two files must be present on each node that you may wish to view on the web server.

Superuser ACL Setting

If you are using Impala and Ranger, it is recommended that you add ‘impala’ to the comma separated list for superuser_aclin kudu-master-env, and use Impala ranger (internally Hive) service policies for governing access to Kudu tables.

Security

If Kerberos is enabled in the cluster, authentication between master and tablet servers will be enabled automatically. Kudu issues its own internally issued certificates to servers in the cluster, so no manual intervention for SSL certificates is required.

Ranger

Ranger can be used with Kudu. By default, the ‘kudu’ user is added to the superusers_acl list, which bypasses ranger permissions for any users that are in that list.

If you want to use ranger, make sure to check “enable_ranger”.

When using Ranger with SSL, there is an additional step required. You need to import a certificate. One such example is the following:

Bash
Copy

This needs to be done on each node with a Kudu master or tablet server.

If you are using Ranger with SSL, make sure to set your keystore and truststore password in the Ambari UI as well as set the keystore.ceredintal.file and truststore.credential.file as needed.

When using Ranger, you must keep the Impala user as a superuser and apply your Ranger policies to Impala instead of Kudu for any tables that should be accessible through Impala.

Example Ranger policy creation:

From the Kudu Documentation:

In addition to granting privileges to a user by username, privileges can also be granted to table owners using the special {OWNER} username. These policies are evaluated only when a user tries to act on a table they own. For example, a policy can be defined for the {OWNER} user and db=→table= resource, and it will automatically be applied when any table is accessed by its owner. This way, administrators don’t need to choose between creating policies one by one for each table and granting access to a wide range of users.

For more information about securing a Kudu cluster, see Apache Kudu Security page.

Audit Support

At this time, support for auditing HDFS is not available. However, solr auditing is enabled by default.

Any user that is given superuser privileges in Kudu (by default, the Kudu and Impala users) does not show up in the audits since any superuser bypasses Ranger.

The following changes are required in AMBARI_INFRA_ SOLR servers to enable auditing.

Bash
Copy

Encryption at Rest Support

Data can be encrypted at rest using Ranger kms. This requires enabling the enable_kms option in the Kudu mpack, as well as creating an encryption key in the Ranger. This key needs to be created before installing the Kudu mpack. Ensure you set ranger_kms_key_name to the name you gave your key in Ranger.

Encryption at rest is only supported during a fresh installation. So, ensure you have the encryption key ready before starting.

Example KMS key:

These are additional properties required for kudu made in kms-site in ranger-kms:

Add or update the Ranger KMS policy to allow kudu users to access the generated key.

Encryption in motion is already supported through the rpc_encryption option that is enabled by default with Kerberos.

Encryption Limitations

Enabling data at rest is only officially supported on newly created clusters. If you enable it with a cluster that already has data, the Kudu servers fail to start. Disabling encryption on an existing cluster is also unsupported.

User Guide

You can access the Kudu tables through a few different means:

  1. Kudu cli tool: This tool can be accessed at /usr/odp/$(odp-select --version)/kudu/bin/kudu or /usr/bin/kudu. You can view the tables you have in your cluster by running Kudutable list $COMMA_SEPARATED_LIST_OF_MASTERS.
  2. Impala: The information about using Kudu with Impala can be found in the Kudu documentation. When using the Impala Mpack, ensure that the checkbox labeled impala_disable_kudu is unchecked so that Impala enables Kudu integration.
  3. Kudu API: Documentation for using the Kudu C++, Java, and Python Client APIs can be found here. However, the Python libraries require a little more setup. Instructions for building the Kudu Python libraries can be found here. However, this requires building Kudu before you can build the library.
  4. Hive: There is some initial support for accessing Kudu tables through Hive, with some caveats. You cannot delete from a Kudu table through Hive, for instance. For more information on how to set it up, you can see the Apache docs here.

Additional Information

A new Impala mpack is required to integrate with Kudu as it contains flags that are required to make the Impala-Kudu integration work.

You must ideally have at least three master nodes so that Kudu can tolerate failure if one of the master servers goes down.

You do not need to enter the master and tablet addresses into either kudu-master-env or kudu-tablet-env. That information will be populated automatically – that is just in case you wish to override Ambari’s automatic configuration for master servers.

Known Issues

Kudu 1.17 supports using non unique primary keys, which automatically adds an autoincrementid column to any table using a non unique key. The version of Impala that ships with ODP 3.2.3.3-2 (v4.1.2) does not support creating tables with non-unique keys in Kudu. This feature is available in Impala v4.3.0 or later. This does not affect anything written using the Kudu client libraries.

[IMPALA-11809] Support non unique primary key for Kudu - ASF JIRA

Kudu supports specifying custom hash partitions at the range partition level, but Impala 4.1.2 does not support the syntax for it. Support for this feature is available in Impala 4.2.0 or later.

[IMPALA-11430] Support kudu custom hash partitions at the range level - ASF JIRA

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
  Last updated