ID Tool Description Links

26

Renumics Spotlight: Explore unstructured datasets

Create interactive visualizations from unstructured datasets and explore them.

25

Dialog Studio: A diverse collection of conversational datasets

A wide collection of conversational datasets and a python based library to load them.

24

Simple offline image classification tool

A simple python based image classification tool you can run locally.

23

Aurum: Discovering Data in Lakes, Clouds and Databases

Identify relevant content among multiple data sources that may consist of tabular files such as CSV, and relational tables

22

g-mixup : Graph data augmentation

Mixup data augmentation for classification of graph like data.

21

Fantastic Data and How to Query Them

A unified framework for combining different datasets in order to query them easily using SPARQL.

20

PyHard: Generate hardness embeddings for easy data-centric analysis

PyHard employes Instance Space Analysis (ISA) to analyse performance at the instance level, providing an alternative method for visualizing algorithm performance on each data sample.

19

AlphaClean: Automatic generation of data cleaning pipelines

AlphaClean declaratively synthesizes data cleaning programs. Given a specification of quality and a list of allowed data transformations, it searches to find a sequence of transformations that best satisfies the quality specification.

18

Spotlight: A data curation tool for unstructured data

Understand unstructured datasets by exploring them interactively. Spotlight helps you identify and fix data segments by using data enrichments (e.g. features, embeddings, uncertainties)

17

Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling

Weak supervision offers a promising alternative for producing labeled datasets without ground truth annotations. This tool first proposes heuristics and then learns from user feedback given on each proposed heuristic.

16

Sim2Real Docs: Domain Randomization for Documents in Natural Scenes

Sim2Real Docs is a python framework for synthesizing datasets and performing domain randomization of documents in natural scenes. It enables programmatic 3D rendering of documents using Blender.

15

Picket: Guarding Against Corrupted Data in Tabular Data

Picket is a system that safeguards against data corruptions during both training and deployment of machine learning models over tabular data.

14

Bootleg - Self-Supervision for Named Entity Disambiguation at the Tail

A self-supervised named entity disambiguation (NED) system for English built to improve disambiguation of entities that occur infrequently, or not at all, in training data - called tail entities.

13

Deequ - unit tests for data

A Python library built on top of Apache Spark for defining “unit tests for data”, which measure data quality in large datasets.

12

Raha & Baran - configuration free error detection tools

Raha and Baran use a library of error detectors, and treat the output of each as a feature in a holistic detection model. They then use clustering and active learning to train the holistic model with few labels.

11

mltrace - a python package to make ML models observable

mltrace is a lightweight, open-source Python tool to get 'bolt-on' observability in ML pipelines. It offers an interface to define data and ML tests for components in pipelines, a Python API to log versions of data and pipeline components and a database to store information about component runs.

10

Amazon SageMaker Debugger

Amazon SageMaker Debugger automates the debugging process of machine learning training jobs. It allows you to run your own training scripts, have flexibility to build customized Hooks and Rules for configuring tensors, and make the tensors available for analysis by saving in an Amazon S3 bucket.

9

DataPrep - prepare your data with a few lines of code

DataPrep allows you to collect data from common data sources, perform exploratory data analysis and clean and standardize data

8

AutoAugment - Learning Augmentation Policies from Data

AutoAugment is a reinforcement learning algorithm developed by Google Brain, which increases both the amount and diversity of data in an existing training dataset. It can be used to teach a model about image invariances in the data domain in a way that makes a neural network invariant to these important symmetries, thus improving its performance.

7

Meerkat - handling data for Machine Learning models

Meerkat makes it easier for ML practitioners to interact with high-dimensional, multi-modal data. It provides simple abstractions for data inspection, model evaluation and model training supported by efficient and robust IO under the hood.

6

CleanLab - clean dataset labels

Cleanlab is the data-centric ML-ops package for machine learning with noisy labels. Cleanlab cleans labels and supports finding, quantifying, and learning with label errors in datasets.

5

Knodle - Knowledge-supervised Deep Learning Framework

A framework for weak supervision with neural networks. It provides a modularization for separating weak data annotations, powerful deep learning models, and methods for improving weakly supervised training.

4

skweak - Weak supervision for NLP

A framework to define a set of labelling functions to automatically label documents, and then aggregate their results to obtain a labelled version of the corpus.

3

Albumentations - Image augmentation

A fast and flexible python library for image augmentation with a large set of built-in augmentations. Widely used in he industry.

2

nlpaug - Augment NLP datasets

Allows you to generate synthetic data for improving model performance without manual effort. Support both textual and audio input.

1

HoloClean: A ML System for Data Enrichment

Built on top of PyTorch and PostgreSQL, this is a statistical inference engine to impute, clean, and enrich data. It is a weakly supervised machine learning system that uses available quality rules, value correlations, reference data, and other signals to build a probabilistic model that accurately captures the data generation process, and uses the model in a variety of data curation tasks.