Tool Description Links

AutoAugment - Learning Augmentation Policies from Data

AutoAugment is a reinforcement learning algorithm developed by Google Brain, which increases both the amount and diversity of data in an existing training dataset. It can be used to teach a model about image invariances in the data domain in a way that makes a neural network invariant to these important symmetries, thus improving its performance.

Meerkat - handling data for Machine Learning models

Meerkat makes it easier for ML practitioners to interact with high-dimensional, multi-modal data. It provides simple abstractions for data inspection, model evaluation and model training supported by efficient and robust IO under the hood.

CleanLab - clean dataset labels

Cleanlab is the data-centric ML-ops package for machine learning with noisy labels. Cleanlab cleans labels and supports finding, quantifying, and learning with label errors in datasets.

Knodle - Knowledge-supervised Deep Learning Framework

A framework for weak supervision with neural networks. It provides a modularization for separating weak data annotations, powerful deep learning models, and methods for improving weakly supervised training.

skweak - Weak supervision for NLP

A framework to define a set of labelling functions to automatically label documents, and then aggregate their results to obtain a labelled version of the corpus.

Albumentations - Image augmentation

A fast and flexible python library for image augmentation with a large set of built-in augmentations. Widely used in he industry.

nlpaug - Augment NLP datasets

Allows you to generate synthetic data for improving model performance without manual effort. Support both textual and audio input.

HoloClean: A ML System for Data Enrichment

Built on top of PyTorch and PostgreSQL, this is a statistical inference engine to impute, clean, and enrich data. It is a weakly supervised machine learning system that uses available quality rules, value correlations, reference data, and other signals to build a probabilistic model that accurately captures the data generation process, and uses the model in a variety of data curation tasks.