App Jupyter

JupyterHub Installation

JupyterHub Installation in xDP allows you to deploy a multi-user, production-ready Jupyter notebook environment directly onto your xDP Compute Clusters. This capability solves the challenge of providing data science and analytics teams with a secure, scalable, and pre-configured platform for interactive data exploration, analysis, and machine learning model development. By managing the deployment and configuration through the xDP application catalog, you eliminate infrastructure overhead and accelerate the time-to-value for data-driven projects.

Key Concepts

Application Catalog: The central repository within xDP where curated, pre-packaged data tools like JupyterHub, Trino, and Spark are available for one-click deployment onto your compute infrastructure.
Compute Cluster: A target Kubernetes environment managed by xDP where applications are deployed and executed. JupyterHub runs as a service within a selected Compute Cluster, leveraging its resources.
Data Store Integration: JupyterHub instances deployed via xDP can be configured to automatically access registered Data Stores. This provides seamless and secure connectivity to data sources like Amazon S3 or HDFS directly from within user notebooks.
Metadata Management: The installation process requires a PostgreSQL database to store essential application metadata, including user information, notebook states, and session data. This ensures the persistence and reliability of the JupyterHub service.

Capabilities

Simplified Provisioning: Deploy a fully configured, multi-user JupyterHub server in minutes using a guided installation wizard, removing the complexity of manual setup.
Pre-configured Kernels: Launch notebooks with out-of-the-box PySpark and Scala kernels that are pre-integrated with the cluster's Spark environment, enabling immediate big data processing.
Centralized Configuration: Manage JupyterHub's operational parameters through a simple Form Editor or a powerful YAML editor for advanced, granular control over your deployment.
Persistent State: Ensure operational continuity by connecting JupyterHub to an internal or external PostgreSQL database for reliable metadata and configuration storage.

Tutorial (Getting Started)

This tutorial guides you through installing your first JupyterHub application on an xDP Compute Cluster using the default settings.

Prerequisites

You have an active Compute Cluster available in your xDP environment.
Your user account has the necessary permissions to deploy applications.
You have navigated to the Apps section from the main xDP sidebar.
Supported version: JupyterHub 4.1.0

Your First Installation

From the application catalog, locate the Jupyter Hub card and click Install.
On the installation screen, select the target Compute Cluster from the dropdown menu. This is the cluster where JupyterHub will be deployed.
Configure the PostgreSQL connection required for storing application metadata. For this initial setup, keep the Use Internal PostgreSQL option enabled. This directs xDP to provision and manage a dedicated database instance for you automatically. Click Next.
Review the application configuration. You can switch between the Form Editor for basic settings and the YAML Editor for advanced customization. For a standard installation, the default values are sufficient.
Click Save & Continue to begin the deployment process.
The final screen confirms that your configuration has been applied and the installation is in progress. The JupyterHub application will be available on the Apps page once the deployment is complete.

How-to Guides

Configure JupyterHub with an External PostgreSQL Database

For production environments, it is recommended to use an external, managed PostgreSQL database.

On the Configure PostgreSQL step (Step 3 in the tutorial), toggle off the Use Internal PostgreSQL option.
The form expands to show fields for your external database connection details.
Enter the Host, Port, Username, Password, and Database Name for your existing PostgreSQL instance.
Ensure network connectivity is allowed from the xDP Compute Cluster to your database host.
Click Next to proceed with the installation using your specified database.

Customize Application Settings using the YAML Editor

The YAML editor provides full control over the JupyterHub Helm chart values for advanced customization.

Info

Example: You want to automatically shut down user notebook servers after 1 hour (3600 seconds) of inactivity to conserve cluster resources.

On the Configure Application step, select the YAML Editor tab.
Locate the cull section in the YAML configuration.
Modify the following parameters: cull: enabled: true timeout: 3600 every: 600
Click Save & Continue. The application will be deployed with your custom culling policy.

Check Application Health Status

After installation, you can monitor the status of your JupyterHub instance.

From the final installation screen, click Go to Applications.
Locate the Jupyter Hub card in the application catalog.
The status, which was previously "Not Installed," will now show "Installing," followed by "Running" or "Error."
Once the status is Running, the application is fully operational.

Reference

Configuration Options

The following are common parameters you can configure in the YAML Editor to customize your JupyterHub deployment.

Parameter	Description	Default	Required
`global.ecrCronEnabled`	Enables a cron job to refresh ECR credentials. Set to `true` if using private ECR images.	`true`	No
`jupyterhub.cull.enabled`	If `true`, idle notebook servers and kernels are automatically shut down.	`true`	No
`jupyterhub.cull.timeout`	The time in seconds a server or kernel can be idle before it is culled.	`3600`	No
`jupyterhub.cull.concurrency`	The number of concurrent culling requests that can be active at once.	`10`	No
`hub.baseUrl`	The base URL path for the JupyterHub application. For example, `/jupyter`.	`/xdp/dp/{{...}}/jupyter`	Yes
`hub.allowNamedServers`	Allows users to create multiple named notebook servers.	`false`	No

Best Practices

Info

Tip: Actively manage resource consumption by enabling and configuring the idle culling feature (cull.enabled: true). This prevents idle notebooks from consuming valuable cluster CPU and memory.

Use External Databases in Production: For any production or critical use case, always configure JupyterHub with an external, managed PostgreSQL database that has robust backup, recovery, and high-availability policies.
Version Your Configurations: When using the YAML editor for customization, save your configurations in a version control system like Git. This practice enables you to track changes, roll back if necessary, and maintain repeatable deployments.
Leverage Data Store Integration: Before deploying, configure your required data sources in the Data Store section of xDP. This allows you to easily mount them into JupyterHub, providing users with secure and simplified data access.

Last updated on

Was this page helpful?