Tool Description Links

mltrace - a python package to make ML models observable

mltrace is a lightweight, open-source Python tool to get 'bolt-on' observability in ML pipelines. It offers an interface to define data and ML tests for components in pipelines, a Python API to log versions of data and pipeline components and a database to store information about component runs.

Amazon SageMaker Debugger

Amazon SageMaker Debugger automates the debugging process of machine learning training jobs. It allows you to run your own training scripts, have flexibility to build customized Hooks and Rules for configuring tensors, and make the tensors available for analysis by saving in an Amazon S3 bucket.

DataPrep - prepare your data with a few lines of code

DataPrep allows you to collect data from common data sources, perform exploratory data analysis and clean and standardize data

AutoAugment - Learning Augmentation Policies from Data

AutoAugment is a reinforcement learning algorithm developed by Google Brain, which increases both the amount and diversity of data in an existing training dataset. It can be used to teach a model about image invariances in the data domain in a way that makes a neural network invariant to these important symmetries, thus improving its performance.

Meerkat - handling data for Machine Learning models

Meerkat makes it easier for ML practitioners to interact with high-dimensional, multi-modal data. It provides simple abstractions for data inspection, model evaluation and model training supported by efficient and robust IO under the hood.

CleanLab - clean dataset labels

Cleanlab is the data-centric ML-ops package for machine learning with noisy labels. Cleanlab cleans labels and supports finding, quantifying, and learning with label errors in datasets.

Knodle - Knowledge-supervised Deep Learning Framework

A framework for weak supervision with neural networks. It provides a modularization for separating weak data annotations, powerful deep learning models, and methods for improving weakly supervised training.

skweak - Weak supervision for NLP

A framework to define a set of labelling functions to automatically label documents, and then aggregate their results to obtain a labelled version of the corpus.

Albumentations - Image augmentation

A fast and flexible python library for image augmentation with a large set of built-in augmentations. Widely used in he industry.

nlpaug - Augment NLP datasets

Allows you to generate synthetic data for improving model performance without manual effort. Support both textual and audio input.

HoloClean: A ML System for Data Enrichment

Built on top of PyTorch and PostgreSQL, this is a statistical inference engine to impute, clean, and enrich data. It is a weakly supervised machine learning system that uses available quality rules, value correlations, reference data, and other signals to build a probabilistic model that accurately captures the data generation process, and uses the model in a variety of data curation tasks.