Custom Packages Installation in JupyterHub Notebook
This document describes different methods to install Python packages in a JupyterHub environment configured with YarnSpawner and HDFSCM. The methods outlined here cover scenarios where packages need to be installed dynamically based on system paths, using subprocesses, or by utilizing a shared environment stored on HDFS.
Prerequisites
- JupyterHub is configured with YarnSpawner and HDFSCM.
- HDFS is used to store environment files, such as the
jupyter-environment.tar.gz
file. - Python 3.x environment is set up in the YARN containers.
- Required packages and their versions are known and available for installation.
Methods to Install Python Package
Install Packages Using !pip install
Based on System Path
In JupyterHub notebooks, you can check the current system's Python path and conditionally install packages using the pip
command. This method is straightforward and works well when packages need to be installed only if they are not already present or if you want to install them dynamically based on the environment.
Steps:
- Check the system path: Before installing any package, check the Python environment's path to ensure it’s the correct one (especially when running in a containerized environment like YARN).
import sys
print(sys.executable)
- Install the package using
pip
: If the environment path is correct, you can install the package using the!pip install
command within the notebook.
!{sys.executable} -m pip install <package-name>
For example, to install TensorFlow:
!{sys.executable} -m pip install tensorflow
Considerations
- This method works well for installing packages when running in interactive notebooks.
- If the container environment is cleared or reset, the installed packages will be lost and need to be reinstalled.
- Ensure the correct Python executable is being used for package installations to avoid conflicts.
Install Packages Using the Subprocess Module
The subprocess
module allows more flexibility for installing packages programmatically within the notebook. This method is useful when you need to execute commands or control the installation process beyond simple !pip
commands.
Steps:
- Import the
subprocess
module:
import subprocess
- Define the command to install the package: Use the
subprocess
module to run thepip install
command. You can specify the full path to the Python executable to ensure the correct environment is used.
subprocess.check_call([sys.executable, '-m', 'pip', 'install', '<package-name>'])
For example, to install TensorFlow:
subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'tensorflow'])
Considerations:
- This method provides better error handling and control over the installation process.
- It is suitable for more complex workflows where additional logic or conditions might be required.
- As with the
!pip install
method, if the environment is reset, the installed packages will need to be reinstalled.
Download and Install from the Shared JupyterHub Environment on HDFS
To ensure that required Python packages are available persistently across Yarn container restarts, you can use a shared JupyterHub environment stored on HDFS. The environment tarball can be created using the venv-pack
option, which captures the entire Python virtual environment, including installed packages. This tarball can then be downloaded from HDFS, unpacked, and activated in the container.
Steps:
- Create the Python Environment Using
venv
: First, create a Python virtual environment and install the required packages within that environment. This step assumes you are working with a specific version of Python, like Python 3.11.
python3.11 -m venv /path/to/target/directory/environment
- Install Required Packages: Activate the virtual environment and install the necessary Python packages.
source /path/to/target/directory/environment/bin/activate
pip install <package-name>
For example, to install TensorFlow:
pip install tensorflow
- Create a Tarball of the Virtual Environment with
venv-pack
: Once the packages are installed, use thevenv-pack
tool to create a tarball of the entire environment. This ensures that all installed dependencies and configurations are captured in a single archive.
venv-pack -o /path/to/target/directory/environment.tar.gz
This creates a tarball (environment.tar.gz
) containing the entire virtual environment, including all installed packages.
- Upload the Environment Tarball to HDFS: After creating the tarball, upload it to the appropriate HDFS path where it can be shared across different JupyterHub containers.
hdfs dfs -put /path/to/target/directory/environment.tar.gz hdfs:///user/jupyterhub/environments/jupyter-environment.tar.gz
- Download the JupyterHub Environment Tarball from HDFS: In the YARN container, download the
jupyter-environment.tar.gz
file from HDFS.
hdfs dfs -get hdfs:///user/jupyterhub/environments/jupyter-environment.tar.gz /path/to/local/directory/
- Extract the Environment Tarball: After downloading, untar the environment tarball to a location where it can be accessed by the JupyterHub container.
tar -xzvf /path/to/local/directory/jupyter-environment.tar.gz -C /path/to/target/directory/
- Activate the Environment: Once extracted, activate the environment by sourcing the
bin/activate
script, which sets up the correct environment for using installed packages.
source /path/to/target/directory/environment/bin/activate
- Re-upload the Updated Environment File to HDFS: After making any changes or installing additional packages, re-create the tarball using
venv-pack
and upload it back to HDFS for future use.
venv-pack -o /path/to/target/directory/environment.tar.gz
hdfs dfs -put /path/to/target/directory/environment.tar.gz hdfs:///user/jupyterhub/environments/jupyter-environment.tar.gz
Considerations:
- Persistent Packages: This method ensures that the packages are installed persistently across container restarts since the virtual environment tarball is centrally stored on HDFS.
- Centralized Management: Storing the environment tarball on HDFS ensures that the environment is shared across different containers, providing a consistent set of dependencies.
- Versioning: Ensure that the virtual environment is correctly versioned to avoid inconsistencies between different instances of JupyterHub.
- HDFS Permissions: This process requires appropriate HDFS permissions to read and write the environment tarball. Ensure that the JupyterHub process has the necessary access rights to the HDFS path.
Summary of Methods
Method | Description | Use Case |
---|---|---|
Install Using !pip install | Installs packages using pip based on the Python environment path. | Quick, interactive installations in notebooks. |
Install Using subprocess Module | Uses the subprocess module to execute package installations with better error handling and control. | Programmatically controlling installation. |
Install from Shared Environment on HDFS | Downloads, extracts, and activates a pre-configured environment stored on HDFS, installs packages, and re-uploads the updated environment. | Persistent package management across YARN restarts. |