Store MLflow Artifacts in HDFS

This page shows how to configure and use MLflow to log experiments and store artifacts on HDFS, using both local Python scripts and JupyterHub notebooks.

Requirements

To run MLflow with HDFS artifact storage, ensure the following:

  • Hadoop and HDFS are installed and accessible from the machine running MLflow.

  • Environment variables are set correctly:

    • HADOOP_HOME
    • HADOOP_CONF_DIR
    • CLASSPATH (should include HDFS client libraries)
  • MLflow tracking server is running with HDFS configured as the default artifact root.

  • Python environment includes the required packages:

Bash
Copy

Part 1: Run MLflow Using a Local Python Script with HDFS Artifact Storage

This example shows how to run the MLflow tracking server and log experiment artifacts directly to HDFS using a local Python script.

  1. Set the required environment variables: Before starting the MLflow server, configure your environment to work with HDFS.
Bash
Copy
  1. Start the MLflow Tracking server: Launch the MLflow server with MySQL as the backend store and HDFS as the artifact root.
Bash
Copy
  1. Run a sample Python script (hdfstest.py): Create and run a script that logs an MLflow experiment to the HDFS artifact store.
Bash
Copy
  1. Run the script: Execute the Python script to log the experiment to MLflow and store artifacts in HDFS.
Bash
Copy
  1. Validate artifacts in HDFS: After the script runs successfully, verify that the artifacts were written to HDFS.
Bash
Copy

You can see the logged model and metadata files in the specified directory (e.g., /tmp/4/<run_id>/artifacts).

Run MLflow with Kerberos-Enabled HDFS Using a Local Python File

To log experiments to HDFS in a secure (Kerberos-enabled) Hadoop environment, follow these steps:

  1. Set the ticket cache location for MLflow to use.
Bash
Copy

Alternatively, if you know the UID:

Bash
Copy
  1. Authenticate with Kerberos using keytab.

Run the following command to get a Kerberos ticket.

Bash
Copy

Authenticates as the HDFS user using a keytab file. Make sure the principal name matches the one in klist -kt.

  1. Start MLflow Tracking server.
Bash
Copy

Launches MLflow server with MySQL backend and HDFS as the default artifact store.

Bash
Copy
  1. Set Hadoop and Java environment variables.
Bash
Copy

This prepares the environment to allow Python (and MLflow) to interact with HDFS.

  1. Sample MLflow script: hdfs.py
Bash
Copy
  1. Run the script.
Bash
Copy
  1. Verify artifacts in HDFS.
Bash
Copy

This checks that the model and artifacts (e.g., plot image, model files) were successfully logged to HDFS.

Part 2: Use MLflow with Kerberos-Enabled HDFS from JupyterHub or Python

  1. Authenticate with Kerberos: Use a valid Kerberos principal and keytab to authenticate your session.
Bash
Copy

This step ensures MLflow and HDFS access is Kerberos-authenticated.

  1. Set up Hadoop environment for Python: Configure Hadoop and Java-related environment variables in Python.
Bash
Copy

This is required for Python to interact with HDFS for storing MLflow artifacts.

  1. Log MLflow experiment with HDFS artifacts: Track and log a scikit-learn model run with MLflow, saving artifacts to HDFS.
Bash
Copy

This automates model training, logging, and stores everything in HDFS for traceability and collaboration.

Part 2: Run MLflow in JupyterHub Using HDFS

  1. Set Hadoop environment in Notebook.
Bash
Copy
  1. Adjust HDFS permissions.
Bash
Copy
  1. MLflow Experiment Code in Jupyter.
Bash
Copy
  1. Final HDFS validation.
Bash
Copy

You can see the artifacts as below:

Bash
Copy
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
  Last updated