Custom Packages Installation in JupyterHub Notebook

This document describes different methods to install Python packages in a JupyterHub environment configured with YarnSpawner and HDFSCM. The methods outlined here cover scenarios where packages need to be installed dynamically based on system paths, using subprocesses, or by utilizing a shared environment stored on HDFS.

Prerequisites

  • JupyterHub is configured with YarnSpawner and HDFSCM.
  • HDFS is used to store environment files, such as the jupyter-environment.tar.gz file.
  • Python 3.x environment is set up in the YARN containers.
  • Required packages and their versions are known and available for installation.

Methods to Install Python Package

Install Packages Using !pip install Based on System Path

In JupyterHub notebooks, you can check the current system's Python path and conditionally install packages using the pip command. This method is straightforward and works well when packages need to be installed only if they are not already present or if you want to install them dynamically based on the environment.

Steps:

  1. Check the system path: Before installing any package, check the Python environment's path to ensure it’s the correct one (especially when running in a containerized environment like YARN).
Bash
Copy
  1. Install the package using pip: If the environment path is correct, you can install the package using the !pip install command within the notebook.
Bash
Copy

For example, to install TensorFlow:

Bash
Copy

Considerations

  • This method works well for installing packages when running in interactive notebooks.
  • If the container environment is cleared or reset, the installed packages will be lost and need to be reinstalled.
  • Ensure the correct Python executable is being used for package installations to avoid conflicts.

Install Packages Using the Subprocess Module

The subprocess module allows more flexibility for installing packages programmatically within the notebook. This method is useful when you need to execute commands or control the installation process beyond simple !pip commands.

Steps:

  1. Import the subprocess module:
Bash
Copy
  1. Define the command to install the package: Use the subprocess module to run the pip install command. You can specify the full path to the Python executable to ensure the correct environment is used.
Bash
Copy

For example, to install TensorFlow:

Bash
Copy

Considerations:

  • This method provides better error handling and control over the installation process.
  • It is suitable for more complex workflows where additional logic or conditions might be required.
  • As with the !pip install method, if the environment is reset, the installed packages will need to be reinstalled.

Download and Install from the Shared JupyterHub Environment on HDFS

To ensure that required Python packages are available persistently across Yarn container restarts, you can use a shared JupyterHub environment stored on HDFS. The environment tarball can be created using the venv-pack option, which captures the entire Python virtual environment, including installed packages. This tarball can then be downloaded from HDFS, unpacked, and activated in the container.

Steps:

  1. Create the Python Environment Using venv: First, create a Python virtual environment and install the required packages within that environment. This step assumes you are working with a specific version of Python, like Python 3.11.
Bash
Copy
  1. Install Required Packages: Activate the virtual environment and install the necessary Python packages.
Bash
Copy

For example, to install TensorFlow:

Bash
Copy
  1. Create a Tarball of the Virtual Environment with venv-pack: Once the packages are installed, use the venv-pack tool to create a tarball of the entire environment. This ensures that all installed dependencies and configurations are captured in a single archive.
Bash
Copy

This creates a tarball (environment.tar.gz) containing the entire virtual environment, including all installed packages.

  1. Upload the Environment Tarball to HDFS: After creating the tarball, upload it to the appropriate HDFS path where it can be shared across different JupyterHub containers.
Bash
Copy
  1. Download the JupyterHub Environment Tarball from HDFS: In the YARN container, download the jupyter-environment.tar.gz file from HDFS.
Bash
Copy
  1. Extract the Environment Tarball: After downloading, untar the environment tarball to a location where it can be accessed by the JupyterHub container.
Bash
Copy
  1. Activate the Environment: Once extracted, activate the environment by sourcing the bin/activate script, which sets up the correct environment for using installed packages.
Bash
Copy
  1. Re-upload the Updated Environment File to HDFS: After making any changes or installing additional packages, re-create the tarball using venv-pack and upload it back to HDFS for future use.
Bash
Copy

Considerations:

  • Persistent Packages: This method ensures that the packages are installed persistently across container restarts since the virtual environment tarball is centrally stored on HDFS.
  • Centralized Management: Storing the environment tarball on HDFS ensures that the environment is shared across different containers, providing a consistent set of dependencies.
  • Versioning: Ensure that the virtual environment is correctly versioned to avoid inconsistencies between different instances of JupyterHub.
  • HDFS Permissions: This process requires appropriate HDFS permissions to read and write the environment tarball. Ensure that the JupyterHub process has the necessary access rights to the HDFS path.

Summary of Methods

MethodDescriptionUse Case
Install Using !pip installInstalls packages using pip based on the Python environment path.Quick, interactive installations in notebooks.
Install Using subprocess ModuleUses the subprocess module to execute package installations with better error handling and control.Programmatically controlling installation.
Install from Shared Environment on HDFSDownloads, extracts, and activates a pre-configured environment stored on HDFS, installs packages, and re-uploads the updated environment.Persistent package management across YARN restarts.
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
  Last updated