Test

Please respond to the below questions:

Has this content piece had a technical review by a TL (from a content accuracy and editorial perspective) Technical / Product content onlyThe editorial review must follow guidance from the Docs team

)

If part of the Tech Lead team - please use the tag 'Tech-Lead-Content'Who is your piece aimed at? (Data Scientist)Can the article be linked to another piece/area of the Community?:A Specialist User Group (Data Management

What is the purpose of your piece? Please include a summary line at the top of the blog.Have you reviewed the content classification guidelines? Can the content be open to the public? I.e. Google. Please note

Is there an urgency to publishing or preferential timeframe? NoPlease specify the most suitable section of the Community Library for publishing. UnsureEditorial checklist:Have you capitalized any Quantexa terms like 'Entity'

Please include aliases

This piece is designed to highlight how Quantexa is meeting ML Operations best practice and be seen as a thought leader in this space.

In the dynamic landscape of machine learning (ML) development

What is MLflow?

MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It's designed to help data scientists and machine learning engineers with tracking experiments

At Quantexa

What issues is MLflow solving?

MLflow addresses several critical challenges in our machine learning workflow. This enhances our overall efficiency and effectiveness. These challenges will also be faced by any client building models on top of Quantexa data. Here's how it tackles three of our most key issues:

Reproducibility and traceability

At Quantexa

MLflow helps us with these issues. When models are trained

Collaborative insight

We want data scientists to be able to collaborate easily.  MLflow fosters collaboration by providing a centralized repository for tracking and sharing experiments. This is particularly useful when it comes to gaining insight from non-technical stakeholder. Stakeholders can use the MLflow UI to access all the information they need to assess and give feedback on models. This dramatically speeds up the feedback cycle and leads to better models. It also saves time for all involved. MLflow enables data scientists to collaborate better with one another. This encourages a culture of knowledge exchange and innovation within the team.

Data correctness

At Quantexa

To avoid errors slipping through

How do we use it?

In practice

Feature generation

There is a pre-processing step that transforms data produced by Quantexa Entity Resolution into a different tabular format. This new format is then used by our models for training and prediction. We record aggregate information about data produced during this pre-processing step in the form of plots and CSVs. This aggregate information is compared to previous runs to automatically highlight big discrepancies in easy-to-read files. As mentioned earlier

Code versions and parameters for this pipeline are recorded. This allows us to always re-create the pre-processing step. It also provides an understanding of the source of the data we are training our models on.

Here is a list of some of the information we record at this stage:

Mean

Center

Large

Accessibility

Delete

A plot produced to visualize the top drifting features across countries. Country and feature names have been omitted.

Center

Large

Accessibility

Delete

A plot produced to visualize percentage of fields that have missing values in the Documents. These fields are used to produce features. Country and field names have been omitted.

Model training

We use MLflow to record model training. This can be to track multiple experiments during the prototyping phase of a project or be used as reference for models in production. All the required information is recorded such that models can be re-produced if required.

Here is a list of some of information we record when training models

Model hyperparametersTest metrics such as precision and recallConfusion matrix

Center

Large

Accessibility

Delete

An ROC curve (receiver operating characteristic curve) produced against the test set during training.

Center

Large

Accessibility

Delete

A cumulative gains curve produced against the test set during training.

Staging model evaluation

We are often iterating on our production machine learning models to improve their performance. We do this by adding more labelled examples to their training datasets or by adding more features. When we train a new model that we think should replace the existing model

Before upgrading to any new model

Here is a list of some of the information we record during the evaluation:

Average size of differences in model score between the old model and new modelIndividual examples with the biggest differences in the model score between the old model and new modelIndividual examples with SHAP explainability plots

Center

Large

Accessibility

Delete

Explainability plot produced when the model is used for inference across unlabelled data. Feature names have been omitted.

Center

Large

Accessibility

Delete

Visualization of the different scores produced by a model currently in production compared to the model selected to replace it.

Conclusion

In the dynamic realm of machine learning

Editorial Resources

Become a better writer ↗

Content models ↗

Quantexa writing guide ↗

Editing tools: Grammarly and Hemingway

Comments

  • Please respond to the below questions:

    1. Has this content piece had a technical review by a TL (from a content accuracy and editorial perspective) Technical / Product content only
      • The editorial review must follow guidance from the Docs team, please use the resources linked at the bottom. (Yes - @Ben_Houghton )
    2. If part of the Tech Lead team - please use the tag 'Tech-Lead-Content'
    3. Who is your piece aimed at? (Data Scientist)
    4. Can the article be linked to another piece/area of the Community?:
      1. Specialist User Group (Data Management, FinCrime, KYC, or Insurance) - Don't think so
      2. Another related article in the Community
    5. What is the purpose of your piece? Please include a summary line at the top of the blog.
    6. Have you reviewed the content classification guidelines? Can the content be open to the public? I.e. Google.
      • Please note, the assumption is all content should be made public where possible due to the increased impact benefits
      • For private content - Please provide a rationale for why this must be private.
        • Create a short summary which can be posted in a public area of the Community - this short summary will link to the article and promote it.
    7. Is there an urgency to publishing or preferential timeframe? No
    8. Please specify the most suitable section of the Community Library for publishing. Unsure
    9. Editorial checklist:
      1. Have you capitalized any Quantexa terms like 'Entity', 'Document' and 'Entity Resolution'? (check the docs site Glossary if you aren't sure)
      2. Have you included a short intro paragraph at the top of the article, outlining the purpose and target audience, and a concluding paragraph including any relevant links to associated content?
      3. Have you used American (US) English - for example, 'analyze' not 'analyse'?
      4. Have you used headings, bullet points and tables to break up the content and grey boxes for code snippets?
    10. Please include aliases, otherwise known as lines or alternate keywords for any terms you have used that are known by multiple names. This helps people find your piece even if they have not used the exact terms you've included. E.g. 'Elasticsearch (or Elastic)'
    11. If you need further editorial guidance (beyond the guides below)tag James Parry or Ffion Owen.

    This piece is designed to highlight how Quantexa is meeting ML Operations best practice and be seen as a thought leader in this space.

    In the dynamic landscape of machine learning (ML) development, the need for effective experiment tracking is paramount. This is especially true as organizations scale their operations. As the complexity of ML projects grows, so does the necessity for comprehensive tools to manage experimentation, iterations, and model versions efficiently. At Quantexa, we encountered this challenge head-on and sought a robust solution to streamline our ML workflows. We found a big part of the solution to these challenges was to use MLflow. MLflow is a powerful platform designed to simplify the end-to-end machine learning lifecycle. In this blog, we delve into how we leverage MLflow at Quantexa.

    What is MLflow?

    MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It's designed to help data scientists and machine learning engineers with tracking experiments, packaging code into reproducible runs, and sharing and deploying models. It comes with a graphical user interface which makes it easy to use.

    At Quantexa, we primarily use MLflow for experiment tracking and reproducibility. This will be the focus of this post. If you are interested to learn more about how you can use MLflow for model deployment, we recommend you spend time reading collateral on their website.

    What issues is MLflow solving?

    MLflow addresses several critical challenges in our machine learning workflow. This enhances our overall efficiency and effectiveness. These challenges will also be faced by any client building models on top of Quantexa data. Here's how it tackles three of our most key issues:

    Reproducibility and traceability

    At Quantexa, we have a considerable number of machine learning models. It is critical that the origin of all these models are well documented and readily available. This is important from a compliance perspective but also for internal purposes. The whole team as well as clients should be aware of performance metrics along with the strengths and weaknesses of each model. The models are also valuable intellectual property for the company, we need to be able to be able to re-produce results.

    MLflow helps us with these issues. When models are trained, we record versions of code, model parameters and the source of the original data. We can then use this information to perfectly re-create a model if ever needed. Furthermore, we have a record of any experiments which were used in selecting models. This means that a data scientist in the future can understand why certain decisions were made during the model prototyping phase.   

    Collaborative insight

    We want data scientists to be able to collaborate easily.  MLflow fosters collaboration by providing a centralized repository for tracking and sharing experiments. This is particularly useful when it comes to gaining insight from non-technical stakeholder. Stakeholders can use the MLflow UI to access all the information they need to assess and give feedback on models. This dramatically speeds up the feedback cycle and leads to better models. It also saves time for all involved. MLflow enables data scientists to collaborate better with one another. This encourages a culture of knowledge exchange and innovation within the team.

    Data correctness

    At Quantexa, we are dealing with very large datasets when running predictions using our models. This scale can make it difficult to assess the quality and correctness of data. Manual sense checks are always performed but mistakes can be missed. This is particularly true if errors are only affecting a certain portion of the dataset (e.g. one country).

    To avoid errors slipping through, we produce lots of metrics and plots when we receive new refreshes of data. We automatically compare this data to what we have seen previously. We log all of these metrics in MLflow. This makes it easy for the whole team to audit the data and spot mistakes such as missing data. Large differences are highlighted automatically even if it is only happening in one geographical region or data source. This information can then be easily forwarded to upstream teams to resolve issues. This process dramatically reduces the risk of erroneous predictions from incorrect data.

    How do we use it?

    In practice, we mainly use MLflow for three different workflows.

    Feature generation

    There is a pre-processing step that transforms data produced by Quantexa Entity Resolution into a different tabular format. This new format is then used by our models for training and prediction. We record aggregate information about data produced during this pre-processing step in the form of plots and CSVs. This aggregate information is compared to previous runs to automatically highlight big discrepancies in easy-to-read files. As mentioned earlier, this reduces the risk of downstream tasks using erroneous data which is critical for model training and prediction.

    Code versions and parameters for this pipeline are recorded. This allows us to always re-create the pre-processing step. It also provides an understanding of the source of the data we are training our models on.

    Here is a list of some of the information we record at this stage:

    • Mean, median and variance of features
    • Data drift scores for each feature
    • Percentage of null values in a column

    A plot produced to visualize the top drifting features across countries. Country and feature names have been omitted.

    A plot produced to visualize percentage of fields that have missing values in the Documents. These fields are used to produce features. Country and field names have been omitted.

    Model training

    We use MLflow to record model training. This can be to track multiple experiments during the prototyping phase of a project or be used as reference for models in production. All the required information is recorded such that models can be re-produced if required.

    Here is a list of some of information we record when training models

    • Model hyperparameters
    • Test metrics such as precision and recall
    • Confusion matrix

    An ROC curve (receiver operating characteristic curve) produced against the test set during training.

    A cumulative gains curve produced against the test set during training.

    Staging model evaluation

    We are often iterating on our production machine learning models to improve their performance. We do this by adding more labelled examples to their training datasets or by adding more features. When we train a new model that we think should replace the existing model, we want to ensure that the new model is superior to the model we are replacing. We can look at the performance of the model across a test set but sometimes this does not tell the full story. This particularly true if the amount of labelled data we have is limited.

    Before upgrading to any new model, we run an evaluation across the whole unlabeled dataset to see how the models differ in practice. This is important for understanding the real business impact of a change. This evaluation is recorded in MLflow such that it can be reviewed by the data scientists and any relevant stakeholders.

    Here is a list of some of the information we record during the evaluation:

    • Average size of differences in model score between the old model and new model
    • Individual examples with the biggest differences in the model score between the old model and new model
    • Individual examples with SHAP explainability plots

    Explainability plot produced when the model is used for inference across unlabelled data. Feature names have been omitted.

    Visualization of the different scores produced by a model currently in production compared to the model selected to replace it.

    Conclusion

    In the dynamic realm of machine learning, effective experiment tracking is indispensable, especially with growing organizational scale. At Quantexa, MLflow has emerged as our solution, simplifying the ML lifecycle and addressing critical challenges in reproducibility, collaboration, and data correctness. Leveraging MLflow not only enhances our internal workflows but also empowers clients building models on Quantexa data to navigate complexities seamlessly.

    Editorial Resources

    Become a better writer ↗

    Content models ↗

    Quantexa writing guide ↗

  • This is a long code block. This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.
    This is a long code block.