Title
Create new category
Edit page index title
Edit category
Edit link
Store MLflow Artifacts in HDFS
This page shows how to configure and use MLflow to log experiments and store artifacts on HDFS, using both local Python scripts and JupyterHub notebooks.
Requirements
To run MLflow with HDFS artifact storage, ensure the following:
Hadoop and HDFS are installed and accessible from the machine running MLflow.
Environment variables are set correctly:
HADOOP_HOMEHADOOP_CONF_DIRCLASSPATH(should include HDFS client libraries)
MLflow tracking server is running with HDFS configured as the default artifact root.
Python environment includes the required packages:
xxxxxxxxxxpip install mlflow scikit-learn pyarrowPart 1: Run MLflow Using a Local Python Script with HDFS Artifact Storage
This example shows how to run the MLflow tracking server and log experiment artifacts directly to HDFS using a local Python script.
- Set the required environment variables: Before starting the MLflow server, configure your environment to work with HDFS.
xxxxxxxxxxexport HADOOP_HOME=/usr/odp/3.3.6.3-1/hadoopexport HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoopexport LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATHexport CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath --glob)- Start the MLflow Tracking server: Launch the MLflow server with MySQL as the backend store and HDFS as the artifact root.
xxxxxxxxxxmlflow server \ --backend-store-uri mysql+pymysql://<username>:<password>@<mysql-host>:<port>/<database-name> \ --default-artifact-root hdfs://<hdfs-host>:<port>/<artifact-path> \ --host <IP Address> \ --port 5000- Run a sample Python script (hdfstest.py): Create and run a script that logs an MLflow experiment to the HDFS artifact store.
xxxxxxxxxxcat hdfstest.py # 1. Setupimport mlflowimport mlflow.sklearnfrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_error, r2_scorefrom sklearn.model_selection import train_test_splitfrom sklearn.datasets import make_regressionimport pandas as pdimport numpy as npimport osimport matplotlib.pyplot as plt # 2. Configure MLflow Tracking URI (point to your MLflow server)mlflow.set_tracking_uri("http://10.100.11.39:5000")mlflow.set_experiment("HDFS_LR_Experiment") # 3. Generate dummy regression dataX, y = make_regression(n_samples=1000, n_features=2, noise=15, random_state=42)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 4. Start MLflow Runwith mlflow.start_run(): # Log parameters mlflow.log_param("fit_intercept", True) # Train model model = LinearRegression(fit_intercept=True) model.fit(X_train, y_train) # Predict and evaluate predictions = model.predict(X_test) mse = mean_squared_error(y_test, predictions) r2 = r2_score(y_test, predictions) # Log metrics mlflow.log_metric("mse", mse) mlflow.log_metric("r2", r2) # Log model mlflow.sklearn.log_model(model, "model") # Save and log a plot plt.figure(figsize=(6, 6)) plt.scatter(y_test, predictions, alpha=0.7) plt.xlabel("Actual") plt.ylabel("Predicted") plt.title("Actual vs Predicted") plot_path = "actual_vs_predicted.png" plt.savefig(plot_path) mlflow.log_artifact(plot_path) print("Run complete. Artifacts saved to HDFS.")- Run the script: Execute the Python script to log the experiment to MLflow and store artifacts in HDFS.
xxxxxxxxxxpython /usr/odp/3.3.6.3-1/mlflow/hdfstest.py- Validate artifacts in HDFS: After the script runs successfully, verify that the artifacts were written to HDFS.
xxxxxxxxxxhdfs dfs -ls /tmp/4You can see the logged model and metadata files in the specified directory (e.g., /tmp/4/<run_id>/artifacts).
Run MLflow with Kerberos-Enabled HDFS Using a Local Python File
To log experiments to HDFS in a secure (Kerberos-enabled) Hadoop environment, follow these steps:
- Set the ticket cache location for MLflow to use.
xxxxxxxxxxexport MLFLOW_KERBEROS_TICKET_CACHE=/tmp/krb5cc_1025Alternatively, if you know the UID:
xxxxxxxxxxexport MLFLOW_KERBEROS_TICKET_CACHE=/tmp/krb5cc_$(id -u)- Authenticate with Kerberos using keytab.
Run the following command to get a Kerberos ticket.
kinit -kt /etc/security/keytabs/hdfs.headless.keytab kinit: Cannot determine realm for host (principal host/kafkaingestion2.acceldata.ce@)(mlflow) [root@kafkaingestion2 mlflow]# klist -kt /etc/security/keytabs/hdfs.headless.keytab Keytab name: FILE:/etc/security/keytabs/hdfs.headless.keytabKVNO Timestamp Principal---- ------------------- ------------------------------------------------------ 2 2025-06-25T16:23:58 hdfs-kafkaingestion@ADSRE.COM 2 2025-06-25T16:23:58 hdfs-kafkaingestion@ADSRE.COM(mlflow) [root@kafkaingestion2 mlflow]# (mlflow) [root@kafkaingestion2 mlflow]# kinit -kt /etc/security/keytabs/hdfs.headless.keytab hdfs-kafkaingestion@ADSRE.COMAuthenticates as the HDFS user using a keytab file. Make sure the principal name matches the one in klist -kt.
- Start MLflow Tracking server.
mlflow server --backend-store-uri mysql+pymysql://mlflow:mlflow@10.100.11.70:3306/mlflow --default-artifact-root hdfs://kafkaingestion.acceldata.ce:8020/tmp --host 10.100.11.72 --port 50002025/07/04 18:07:59 INFO mlflow.store.db.utils: Creating initial MLflow database tables...2025/07/04 18:07:59 INFO mlflow.store.db.utils: Updating database tablesINFO [alembic.runtime.migration] Context impl MySQLImpl.INFO [alembic.runtime.migration] Will assume non-transactional DDL.Launches MLflow server with MySQL backend and HDFS as the default artifact store.
xxxxxxxxxxlsbin include lib lib64 pyvenv.cfg share[root@kafkaingestion2 mlflow]# [root@kafkaingestion2 mlflow]# [root@kafkaingestion2 mlflow]# source bin/activate- Set Hadoop and Java environment variables.
xxxxxxxxxxexport HADOOP_HOME=/usr/odp/3.3.6.3-1/hadoopexport HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoopexport LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATHexport CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath --glob)This prepares the environment to allow Python (and MLflow) to interact with HDFS.
- Sample MLflow script: hdfs.py
xxxxxxxxxxcat hdfs.py import mlflowimport mlflow.sklearnfrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_error, r2_scorefrom sklearn.model_selection import train_test_splitfrom sklearn.datasets import make_regressionimport matplotlib.pyplot as pltimport osimport time def log(msg): print(f"[{time.strftime('%Y-%m-%d %H:%M:%S')}] {msg}", flush=True) # 1. Setup MLflow trackinglog("Setting MLflow tracking URI...")mlflow.set_tracking_uri("http://10.100.11.72:5000") experiment_name = "HDFS_LR_Experiment_test"artifact_location = "hdfs://kafkaingestion.acceldata.ce:8020/tmp/mlflow-jupyter" # 2. Create experiment (safe creation)log(f"Trying to create experiment '{experiment_name}' with artifact location '{artifact_location}'...")try: mlflow.create_experiment(name=experiment_name, artifact_location=artifact_location) log("Experiment created successfully.")except mlflow.exceptions.MlflowException as e: log(f"Experiment already exists or failed to create: {str(e)}") log(f"Setting experiment to '{experiment_name}'...")mlflow.set_experiment(experiment_name) # 3. Generate datalog("Generating regression dataset...")X, y = make_regression(n_samples=1000, n_features=2, noise=15, random_state=42)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 4. Start MLflow runlog("Starting MLflow run...")with mlflow.start_run() as run: log("Logging parameters...") mlflow.log_param("fit_intercept", True) log("Training LinearRegression model...") model = LinearRegression(fit_intercept=True) model.fit(X_train, y_train) log("Generating predictions and calculating metrics...") predictions = model.predict(X_test) mse = mean_squared_error(y_test, predictions) r2 = r2_score(y_test, predictions) log(f"Logging metrics: mse={mse}, r2={r2}...") mlflow.log_metric("mse", mse) mlflow.log_metric("r2", r2) log("Logging model...") mlflow.sklearn.log_model(model, "model") log("Generating and logging plot...") plt.figure(figsize=(6, 6)) plt.scatter(y_test, predictions, alpha=0.7) plt.xlabel("Actual") plt.ylabel("Predicted") plt.title("Actual vs Predicted") plot_path = "actual_vs_predicted.png" plt.savefig(plot_path) mlflow.log_artifact(plot_path) log("MLflow run complete. Artifacts should be saved to HDFS.")- Run the script.
- Verify artifacts in HDFS.
This checks that the model and artifacts (e.g., plot image, model files) were successfully logged to HDFS.
Part 2: Use MLflow with Kerberos-Enabled HDFS from JupyterHub or Python
- Authenticate with Kerberos: Use a valid Kerberos principal and keytab to authenticate your session.
This step ensures MLflow and HDFS access is Kerberos-authenticated.
- Set up Hadoop environment for Python: Configure Hadoop and Java-related environment variables in Python.
This is required for Python to interact with HDFS for storing MLflow artifacts.
- Log MLflow experiment with HDFS artifacts: Track and log a scikit-learn model run with MLflow, saving artifacts to HDFS.
This automates model training, logging, and stores everything in HDFS for traceability and collaboration.
Part 2: Run MLflow in JupyterHub Using HDFS
- Set the Hadoop environment in the Notebook.

- Adjust HDFS permissions.
- MLflow Experiment Code in Jupyter.

- Final HDFS validation.
You can see the artifacts as below: