Documentation for Spark Notebook Examples

Estimating Pi Using PySpark

This program uses the Monte Carlo method to estimate the value of Pi. It demonstrates how to use PySpark to parallelize computations.

Bash
Copy

Creating and Displaying a DataFrame

This example showcases creating a Spark DataFrame using a list of Row objects and displaying it.

Bash
Copy

Complex Operations with PySpark

This example demonstrates joining DataFrames, applying User-Defined Functions (UDFs), and executing SQL queries.

Data Preparation and Joining:

Bash
Copy

Using UDFs:

Bash
Copy

SQL Queries:

Bash
Copy

Performing Word Count

A simple word count example demonstrating Spark RDD transformations.

Bash
Copy

Using ODP’s Spark version

Bash
Copy
Bash
Copy

Submit the Job to a Cluster (Optional)

If you’re using a cluster manager like YARN with JupyterHub, you need to adjust the configurations accordingly:

Bash
Copy

Show Hive Tables in Spark

Bash
Copy

How It Works:

  1. Environment Setup: Configures the necessary Spark environment variables.
  2. SparkSession: Initializes a SparkSession with Hive support (if needed).
  3. Switch Database (Optional): You can switch to a specific database if you want to list tables from it.
  4. Show Tables: Executes the SQL command SHOW TABLES, which returns a DataFrame of tables.
  5. Display Output: Displays the list of tables in the output.

Expected Output:

Note

  • Dependencies: Ensure Spark is correctly set up and configured for local or yarn mode depending on the example.
  • UDFs: Register user-defined functions (UDFs) as required for custom transformations.
  • SQL Queries: Use createOrReplaceTempView to run SQL queries on DataFrames.
  • Data Source: Replace hardcoded data with external sources like HDFS, databases, or files for real-world applications.
  • Resource Configurations: Tune spark.executor.memory and spark.driver.memory based on the cluster size and workload requirements.
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
  Last updated