Integrate HBase-Spark

This user guide demonstrates two approaches to connecting Apache Spark with HBase for data processing and analytics:

HBase-Spark Connector: Uses the official ODP HBase Spark connector to create DataFrames and perform Spark operations.
Native HBase Client Libraries: Uses HBase’s Java client APIs directly in Spark for fine-grained control over HBase operations.

Both approaches are valid. You can use them together in different parts of your data pipeline, depending on your requirements.

Method 1: HBase-Spark Connector

The HBase-Spark connector provides a native DataFrame API integration, allowing you to work with HBase tables as Spark DataFrames.

Prerequisites

Apache Spark 3.x
HBase 2.5.x or 2.6.x client
HBase-Spark connector tar.gz, hbase-connectors-1.1.0.3.3.6.x-x-bin.tar.gz

From version 3.3.6.2-1 onwards, it is automatically installed on every Spark 3 and HBase client.

Steps

Download and extract hbase-connectors-1.1.0.3.3.6.x-x-bin.tar.gz on the edge node that has both Spark and HBase clients installed.

Bash
    
​x
 
cd /usr/odp/3.3.6.2-1/​tar -zxvf hbase-connectors-1.1.0.3.3.6.2-1.tar.gz ​ls -lrt hbase-connectors-1.1.0.3.3.6.2-1/lib/​ln -s  /usr/odp/3.3.6.2-1/hbase-connectors-1.1.0.3.3.6.2-1 /usr/odp/3.3.6.2-1/hbase-connectors
Copy

With Ambari, install the Spark 3 client on all RegionServer hosts and configure SHC using the steps above (if the ODP version does not support it). Alternatively, install SHC along with the Scala language JAR on each RegionServer to make it available in the classpath—this requires additional setup. For spark-submit, installing Spark 3 only on the edge node is sufficient.

Append the HBase classpath in Ambari > hbase-env.sh by updating the HBASE_CLASSPATH variable with the required JAR paths.

Bash
    
export HBASE_CLASSPATH=${HBASE_CLASSPATH}:/usr/odp/3.3.6.0-101/hbase-connectors/lib/hbase-spark-protocol-shaded-1.1.0.3.3.6.2-1.jar:/usr/odp/3.3.6.2-1/hbase-connectors/lib/hbase-spark-1.1.0.3.3.6.0-101.jar:/usr/odp/3.3.6.0-101/spark3/jars/scala-library-2.12.18.jar
Copy

Start the Spark Shell with HBase Integration: Run the following commands to authenticate and launch the Spark Shell with the required HBase configurations.

Bash
    
 
kinit -kt /etc/security/keytabs/hbase.service.keytab hbase/`hostname -f`/usr/odp/3.3.6.2-1/spark3/bin/spark-shell --jars "/usr/odp/3.3.6.2-1/hbase-connectors/lib/*" --files /etc/hbase/conf/hbase-site.xml \--conf spark.driver.extraClassPath=/etc/hbase/conf
Copy

Configuration Setup for HBase–Spark Integration: Use the following code to configure HBase and initialize HBaseContext in Spark.

Bash
    
 
import org.apache.hadoop.hbase.spark.HBaseContextimport org.apache.hadoop.hbase.HBaseConfiguration​// Create HBase configurationval conf = HBaseConfiguration.create()conf.set("hbase.zookeeper.quorum", "<ZK_IPs>")  // Comma-separated ZooKeeper IPsconf.set("hbase.zookeeper.property.clientPort", "ZK_PORT")​// Initialize HBaseContext for Spark integrationnew HBaseContext(spark.sparkContext, conf)
Copy

Replace <ZK_IPs> and ZK_PORT with your actual ZooKeeper host IPs and port. This setup enables Spark to interact with HBase using the configured context.

Creating DataFrames from HBase Tables: Use the following code snippet to load data from an HBase table into a Spark DataFrame.

Bash
    
 
val hbaseDF = spark.read  .format("org.apache.hadoop.hbase.spark")  .option("hbase.columns.mapping",    "rowKey STRING :key," +    "firstName STRING Name:First, lastName STRING Name:Last," +    "country STRING Address:Country, state STRING Address:State"  )  .option("hbase.table", "Person")  .load()
Copy

This example reads data from the Person HBase table and maps the specified columns to the DataFrame. Adjust the column mappings and table name to match your HBase schema.

Working with the DataFrame: Use the following commands to explore and transform the DataFrame created from the HBase table.

Bash
    
 
// Display schemahbaseDF.schema​// Show datahbaseDF.show()​// Perform Spark transformationshbaseDF.filter($"country" === "USA").show()hbaseDF.groupBy("country").count().show()
Copy

This enables you to validate the data and perform transformations using Spark operations.

Method 2: Native HBase Client Libraries

This approach uses HBase's native Java client APIs within Spark to provide fine-grained control over HBase operations.

Start the Spark Shell: Run the following command to start the Spark Shell with the required HBase and telemetry JARs.

Bash
    
spark-shell \ --driver-class-path "/usr/odp/3.3.6.2-1/hbase/lib/*" \ --conf spark.executor.extraClassPath="/usr/odp/3.3.6.2-1/hbase/lib/*" \ --jars $(find /usr/odp/3.3.6.2-1/hbase/lib/ -name "*.jar" | grep -E "(opentelemetry|telemetry)" | tr '\n' ',' | sed 's/,$//') \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
Copy

This command configures the Spark Shell to include necessary telemetry dependencies and optimizes serialization for performance.

Configuration and Connection Setup: This configuration sets up the HBase client and connects to the cluster using the provided ZooKeeper settings.

Bash
    
 
import org.apache.hadoop.hbase.HBaseConfigurationimport org.apache.hadoop.hbase.client._import org.apache.hadoop.hbase.util.Bytesimport org.apache.hadoop.hbase.{TableName, ColumnFamilyDescriptorBuilder, TableDescriptorBuilder}​// Configure HBase connectionval hbaseConf = HBaseConfiguration.create()hbaseConf.set("hbase.zookeeper.quorum", "ZK_IPs")hbaseConf.set("hbase.zookeeper.property.clientPort", "ZK_PORT")hbaseConf.set("zookeeper.znode.parent", "/hbase-unsecure")  // Set correct znode parent​// Establish connectionval connection = ConnectionFactory.createConnection(hbaseConf)val admin = connection.getAdmin()
Copy

Table Operations: This demonstrates how to create a table in HBase, insert data, and read the data using Apache Spark with native HBase client APIs.

Creating a Table: The following code checks if the table exists. If it does, it deletes the existing table and creates a new one with a specified column family.

Bash
    
 
val tableName = TableName.valueOf("spark_hbase_test2")val columnFamily = "cf1"​println("Creating HBase table...")try {  if (admin.tableExists(tableName)) {    println("Table already exists, dropping it first...")    admin.disableTable(tableName)    admin.deleteTable(tableName)  }    // Create table descriptor  val tableDescriptor = TableDescriptorBuilder.newBuilder(tableName)    .setColumnFamily(ColumnFamilyDescriptorBuilder.of(columnFamily))    .build()    admin.createTable(tableDescriptor)  println("Table created successfully!")} catch {  case e: Exception => println(s"Error creating table: ${e.getMessage}")}
Copy

Inserting Data: Use the following code to insert sample records into the HBase table.

Bash
    
 
// Sample dataval data = Seq(  ("row1", "name", "John", "age", "25"),  ("row2", "name", "Jane", "age", "30"),  ("row3", "name", "Bob", "age", "35"),  ("row4", "name", "Alice", "age", "28"))​val table = connection.getTable(tableName)​// Insert recordsdata.foreach { case (rowKey, col1, val1, col2, val2) =>  val put = new Put(Bytes.toBytes(rowKey))  put.addColumn(Bytes.toBytes(columnFamily), Bytes.toBytes(col1), Bytes.toBytes(val1))  put.addColumn(Bytes.toBytes(columnFamily), Bytes.toBytes(col2), Bytes.toBytes(val2))  table.put(put)  println(s"Inserted row: $rowKey")}​println("Data inserted successfully!")
Copy

Reading Data: Use the following code to scan and display data from the table.

Bash
    
 
// Scan and retrieve dataval scan = new Scan()val scanner = table.getScanner(scan)​println("Scanning data:")var result = scanner.next()while (result != null) {  val rowKey = Bytes.toString(result.getRow)  val name = Bytes.toString(result.getValue(Bytes.toBytes(columnFamily), Bytes.toBytes("name")))  val age = Bytes.toString(result.getValue(Bytes.toBytes(columnFamily), Bytes.toBytes("age")))  println(s"RowKey: $rowKey, Name: $name, Age: $age")  result = scanner.next()}​// Clean up resourcesscanner.close()table.close()admin.close()connection.close()
Copy

Troubleshooting

Common Issues

ClassNotFound errors: Ensure that all required JAR files are included in the classpath.
Connection timeouts: Verify that the ZooKeeper quorum settings are correct.
Permission issues: Confirm that the user has appropriate permissions on the HBase tables.
Memory issues: Adjust Spark executor memory settings as needed.

Performance Tips

Batch size: Use appropriate batch sizes for bulk operations to optimize throughput.
Partitioning: Configure Spark with suitable partitioning to distribute load effectively.
Monitoring: Regularly monitor HBase RegionServer metrics for performance insights.
Caching: Implement effective caching strategies to reduce repeated data access.

Both approaches are valid. You can use them together in different parts of your data pipeline, depending on your requirements.

Last updated on Aug 15, 2025

Was this page helpful?