ML template with dvc and MLflow

Our machine learning template uses both dvc and MLflow to help you track your experiments and datasets. It is set up such that an experiment run can be started using a dvc pipeline. A pipeline segregates the experiment run into a sequence of stages and it is defined by a file located in src/experiments/dvc/dvc.yaml. The default template pipeline has 5 stages:

Initialize
Preprocess
Train
Test
Finalize

Each stage runs a corresponding python file (which can be found in the src/experiments/stages directory) and defines input parameter, input artifacts and output artifacts. Take a look at the dvc documentation to read more about how pipelines work, how to modify them, and why they are useful.

When you have opened your project, inside the commandline in VS Code, run the command sh /scripts/run_experiment.sh -c "my commit message". A dialog will show, prompting to initialize a Git repository for your project. Answer yes to this prompt to get your experiment running.

sh /scripts/run_experiment.sh -c "my commit message"Emily could not find a .git directory necessary to run an experiment with DVC. Initialize git? [Y/N]yInitializing git repository...Initialized empty Git repository in /workspace/.git/[master (root-commit) 6457671] Emily - initial commit

The run_experiment.sh script does a few things to make MLflow and dvc work seamlessly with emily. It makes sure your project is a Git repository, it makes an initial commit, it starts the pipeline which runs your experiment, and lastly it pushes your dataset to the dvc remote. To read more about why Git and commits are needed for running the experiments, see the section "Data versioning" later in this guide.

The template comes with a dummy model, and dummy dataset so that the template can run straight out of the box.

After running the script you can go to the server and port where your MLflow reporting server is running (i.e. http://<my-server-host>:<my-port>). If you are unsure where is your reporting server running, read the environment variable MLFLOW_TRACKING_URI defined in the /environments/dev directory. Once on the MlFlow dashboard, click on the Experiments tab. Here you will see a list of all the experiments you have run. The name of the experiment you just ran would be emily_experiment_<your emily project name>. You can expand an experiment run to see the different steps of the pipeline, and which metadata is tracked in each step. The train step will show the metrics of the training process. The test step will also include your model object, as well as the performance metrics related to it.

To easily differentiate between runs, you can change the MLflow "run name" by changing the run_name variable in the /src/experiments/dvc/params.yaml file.

Additional logging with MLflow

The MLflow library supports multiple different machine learning libraries and extraction of data from them. While running the training loop, MLflow will send a bunch of model parameters and metrics calculated during the run to the reporting server.

If you want more data logged to the reporting server, you can take a look at the MLflow logging documentation.

For many usecases, the logging functions of most interest are:

You can see a definition of the different logging concepts in MLflow and read more about how MLflow is used here.

Get started with your own model

To get started with your own model you have to modify the template with the following steps (you can search for the keyword "TODO" in the template):

Place your data in the data folder
Prepare training data for the model in src/experiments/stages/preprocess.py
Define your model in the model class in src/experiments/model.py
Define a training function for the model in src/experiments/train.py
Define a test function in src/experiments/stages/test.py

Changing model parameters

The parameters used for a training run are read from the src/experiments/dvc/params.yaml file, so you should change the model parameters in said file when needed. You can add more parameters to the file as well, and then update the parameter parser accordingly. Each stage of the pipeline has its own parameter parser function called parse_dvc_params()

Code versioning and MLflow

Along with the parameters and metrics, MLflow also keeps track of which Git commit you are on at the time of training. This enables a nice way of keeping track of which code lead to which model. MLflow saves the current commit hash on the experiment run. You can see the commit of a run by going to the Experiments tab, click an experiment, and click a parent run. If a commit hash is not showing on the run, make sure that you have actually created a Git commit. This also means that ideally you would want to make a commit before you start a training run. Otherwise, multiple potentially different runs, will be logged with the same commit hash.

Data versioning

As mentioned, dvc is used for dataset versioning. When choosing to use this template, the remote dvc repository is also set up during the project creation process. Dvc stores the different versions of the datasets in the remote repository (note that the dvc remote is separate from the Git repository). When you have made changes to your dataset either manually or by saving a preprocessed version (see the folders data/raw and data/preprocessed directories), you can make dvc keep track of this version of the data with terminal commands.

Make a Git commit to commit the dvc metadata about the current stage of your data. The metadata is kept in the .dvc folder in the project root.
Run the command dvc push to push the changed dataset to the dvc remote for later retrieval.

When you need to retrieve a previous version of the dataset, you simply checkout the git commit containing the dvc metadata, and run the command dvc pull. The data is then fetched to the project and ready to use for a new training run.

You do not need to worry about pushing your data to dvc when running experiments through the scripts - either /scripts/run_experiment.sh or /scripts/run_experiment_with_caching.sh - in the scripts directory. The scripts will automatically push your data to dvc when you run the experiment.

Deploying the project

As with all Emily projects, you can deploy it easily with the command emily deploy <project name | project id>. However, with this template there is a bit more to it.

There is a few relative env variables that will determine which model is loaded on deployment.

MODEL_PATH
MODEL_NAME
MODEL_VERSION
MODEL_STAGE

MODEL_PATH has priority when deploying, such that if MODEL_PATH is defined it will attempt to load the model from there. Otherwise it will use the MODEL_NAME, MODEL_VERSION and MODEL_STAGE such that:

If MODEL_NAME is defined and MODEL_VERSION and MODEL_STAGE are not defined, the newest model is fetched from the MLflow tracker.
If both MODEL_NAME and MODEL_VERSION are defined, the defined version is fetched.
If both MODEL_NAME and MODEL_STAGE are defined, the latest model of that stage is fetched (e.g., latest model in the production stage).

You can read more about the stages of models in MLflow in here.