ClearML — From Data to Model: Navigating ClearML for Efficient Dataset Management and Model Training with Pipelines

8 min readJan 28, 2024

In this article, I am going to provide a step-by-step guide starting from preparing datasets to training a machine learning model. I will explore in the below subjects.

Quick introduction about ClearML
Create Project on ClearML
Configuring MinIO storage for outputs
Managing dataset with ClearML
Creating task for preparing dataset and train a model
Creating a pipeline from task

Prerequisites

Pull sample project to follows the codes examples.

git clone https://github.com/menendes/clearml-ml-pipeline-deployment.git

Quick Introduction About ClearML

What is ClearML ?

ClearML is a comprehensive machine learning operations (MLOps) platform designed to streamline and enhance the end-to-end process of developing, training, and deploying machine learning models.

Why we need it ?

It provides a centralized platform for managing experiments, tracking model performance, and collaborating on projects, thereby improving team productivity and facilitating efficient model deployment.

What are the benefits of using ClearML?

The benefits of using ClearML include automated experiment tracking, version control for models and datasets, seamless integration with popular ML frameworks, and robust collaboration tools, ultimately enabling teams to accelerate the development and deployment of high-quality machine learning solutions.

Create Project on ClearML

A project is a logical container for organizing and managing machine learning experiments and related resources. Share datasets, models, and configurations specific to a project, making it easier for team members to collaborate and reproduce results.

In ClearML dashboard, you can create a project like in below.

If you realize that I also give default output address.It refers to the location where ClearML will store the outputs and artifacts generated during the execution of your machine learning experiments. This could include model checkpoints, logs, visualizations, and any other relevant outputs. If you want to specify a custom output destination, such as an AWS S3 or Minio address, you can replace the default output URI with your desired location.

Configuring Storage

ClearML can store the outputs and artifacts generated during the execution of your machine learning experiments. This could include model checkpoints, logs, visualizations, and any other relevant outputs. ClearML supports many storage solutions such as AWS S3, MinIO, Google Cloud Storage, Azure Storage. In this article I am going to use MinIO, so to configure storage to use MinIO update “clearml.conf” like in below.

aws {
        s3 {
            # default, used for any bucket not specified below
            key: ""
            secret: ""
            region: ""

            credentials: [
                {
                    # This will apply to all buckets in this host (unless key/value is specifically provided for a given bucket)
                    host: "my-minio-host:9000" #your minio instance address
                    key: "" #your minio instance access key
                    secret: "" #your minio instance access secret
                    multipart: false
                    secure: false
                }
            ]
        } 
}

I would like to warn you in you will use clearml-agent to run your task remotely you should also update “clearml.conf” file in the agent machine. The reason is that your task running on the agent machine and produce an output so your agent should be able to access your configured storage otherwise you will get an access error. I am going to show you in the next articles how you create clearml-agents to scale workflows on multiple target machines.

Managing Dataset with ClearML

ClearML simplifies dataset management by offering a comprehensive platform for organizing, versioning, and collaborating on datasets within machine learning workflows. It allows users to register datasets, automatically version them, explore data statistics, and synchronize datasets across tasks. I will deep dive into dataset management in ClearML in the another articles, so I just show you how to create and upload dateset with ClearML SDK.

# download the data
local_iris_pkl = StorageManager.get_local_copy(
    remote_url='https://github.com/allegroai/events/raw/master/odsc20-east/generic/iris_dataset.pkl')

# create dataset and upload the dataset
# you can give version via dataset_version parameter while creating dataset
dataset = Dataset.create(dataset_name=dataset_name, dataset_project=project_name, output_uri="s3://192.168.1.22:9000/clearml-outputs") #specify your storage address for the output
dataset.add_files(local_iris_pkl)
dataset.upload()
dataset.finalize()

You can find piece of code in the above that belongs to “step1_dataset_artifact.py” script. In here I downloaded the data and then created the dataset to manage this data. When we added data to dataset and uploaded, it will stored in the MinIO storage. When we start the pipeline and “step_1” invoked we are expecting two outputs. The first step is to ensure the successful creation of the dataset, which can be verified through the ClearML UI, as shown below.

And the second output is that this dataset successfully uploaded to MinIO storage. It should stored under the your bucket’s. In my case it’s name “clearml-outputs”.

The iris_dataset stored in the MinIO Storage Example 2.1

In “step_2” task which is represented in the “step2_data_processing.py” script, we need to split out to dataset to prepare test and train data. For that purpose we need to get to copy of data from our dataset. Lets take a look closer piece of code in below.

    # get dataset instance
    dataset = Dataset.get(
        dataset_project=project_name,
        dataset_name=args['dataset_name'],
        # dataset_version= "1.0.1" # you can retrieve specific version of dataset
    )
    # download the data via dataset, it returns the path where data downloaded
    dataset_directory = dataset.get_local_copy()
    # open the local copy
    iris = pickle.load(open(dataset_directory + '/' + dataset.list_files()[0], 'rb'))

    # "process" data to split out as test and train
    X = iris.data
    y = iris.target
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=args['test_size'], random_state=args['random_state'])

Creating Task For Preparing Dataset and Train a Model

Until now we learn how to download data, create dataset, upload data to dataset, and get data from dataset. But I did not express a point about which component responsible from manage and track entire process. In that point “Task” comes in. “Task” is a fundamental concept that represents a single execution of a machine learning experiment or a computational job. A task encapsulates various components, including the code, configuration, input data, output models, and associated metadata.

By organizing experiments into tasks, ClearML enables users to manage and reproduce their machine learning workflows efficiently. Tasks can be monitored, compared, and shared among team members, contributing to better collaboration and reproducibility in machine learning projects.

In our example all of steps executed and tracked with task so lets continue with how to create task in scripts.

# create an dataset experiment task
task = Task.init(project_name=project_name, task_name="step_1")

Thats all ! We just write one line code to initialize a task :)

Since we want to execute our tasks via pipeline controller, we only create the task in draft status to execute them later. Also storing task arguments to be able to change them later from outside the code we connect them to task. So lets examine below codes to how to implement this things.

task = Task.init(project_name=project_name, task_name="step_2")
args = {
    'dataset_name': '',
    'random_state': 42,
    'test_size': 0.2,
}

# store arguments, later we will be able to change them from outside the code
task.connect(args)
print('Arguments: {}'.format(args))

# only create the task, we will actually execute it later
task.execute_remotely()

Creating a Pipeline From Tasks

PipelineController class can be used to create a pipeline from tasks. PipelineController help us to manage and monitor pipeline. Lets continue step by step over code.

#Create a new pipeline controller.
#The newly created object will launch and monitor the new experiments.
pipe = PipelineController(
    name="Pipeline demo", project=project_name, version="0.0.1", add_pipeline_tags=False
)

In above the code, we created the PipelineController. You can specify pipeline name, project and it’s version basically. You can also pass docker environment information to run pipeline in docker but I will talk about these parameters in the later articles.

pipe.add_parameter("dataset_name", "iris_dataset", "dataset",)
pipe.add_parameter("train_test_dataset_name", "train_test_data", "train and test data",)

In here you can add a parameter to your pipeline and you can pass them to any of task.

pipe.add_step(
    name="stage_data",
    base_task_project=project_name,
    base_task_name="step_1",
    parameter_override={"General/dataset_name": "${pipeline.dataset_name}"},
)

You can add a step to your pipeline like in above code. If you realize that “base_task_name” value is represent the first task. While creating first task we gave a name as “step_1”. When invoked this step, pipeline will take a template task “step_1" and creates a copy of it and execute it with overrided parameters.

In task “step_1” we create a args with name “dataset_name” and using “parameter_override” parameter, we are overriding with pipeline parameter value in this case “iris_dataset”.

def pre_execute_callback_example(a_pipeline, a_node, current_param_override):
    # type (PipelineController, PipelineController.Node, dict) -> bool
    print(
        "Cloning Task id={} with parameters: {}".format(
            a_node.base_task_id, current_param_override
        )
    )
    # if we want to skip this node (and subtree of this node) we return False
    # return True to continue DAG execution
    return True


def post_execute_callback_example(a_pipeline, a_node):
    # type (PipelineController, PipelineController.Node) -> None
    print("Completed Task id={}".format(a_node.executed))
    # if we need the actual executed Task: Task.get_task(task_id=a_node.executed)
    return

pipe.add_step(
    name="stage_process",
    parents=["stage_data"],
    base_task_project=project_name,
    base_task_name="step_2",
    parameter_override={
        "General/dataset_name": "${pipeline.dataset_name}",
        "General/test_size": 0.25,
    },
    pre_execute_callback=pre_execute_callback_example,
    post_execute_callback=post_execute_callback_example,
)

If you want to implement piece of codes based on your logic before/after your step running you can add logics like in above.

Thats all ! After adding each tasks via “add_step” method we are ready to start our pipeline.

pipe.start(queue="ihk") # start your pipeline

After executing the “pipeline_from_task.py” script you can monitor your pipeline from ClearML UI and monitor it like in below.

You can also see your plots of your model from the ClearML UI.

In the next articles, I am going to explain how to run your experiments distributed over the GPUs machine and run each experiment in docker container.

Github : https://github.com/menendes

Linkedin : https://www.linkedin.com/in/ibrahim-halil-koyuncu-b1030516a/

What is ClearML? | ClearML

ClearML is an open source platform that automates and simplifies developing and managing machine learning solutions

clear.ml

GitHub - allegroai/clearml: ClearML - Auto-Magical CI/CD to streamline your ML workflow. Experiment…

ClearML - Auto-Magical CI/CD to streamline your ML workflow. Experiment Manager, MLOps and Data-Management - GitHub …

github.com