Open source tools in Data Centric AI

NEW

ID	Tool Description	Links
26	Renumics Spotlight: Explore unstructured datasets Create interactive visualizations from unstructured datasets and explore them.	Documentation Demo Link to project code
25	Dialog Studio: A diverse collection of conversational datasets A wide collection of conversational datasets and a python based library to load them.	Research paper Dataset Link to project code
24	Simple offline image classification tool A simple python based image classification tool you can run locally.	Link to project code
23	Aurum: Discovering Data in Lakes, Clouds and Databases Identify relevant content among multiple data sources that may consist of tabular files such as CSV, and relational tables	Research paper Quick start guide Link to project code
22	g-mixup : Graph data augmentation Mixup data augmentation for classification of graph like data.	Research paper Link to project code
21	Fantastic Data and How to Query Them A unified framework for combining different datasets in order to query them easily using SPARQL.	Research paper Demo Link to project code
20	PyHard: Generate hardness embeddings for easy data-centric analysis PyHard employes Instance Space Analysis (ISA) to analyse performance at the instance level, providing an alternative method for visualizing algorithm performance on each data sample.	Research paper Link to project code
19	AlphaClean: Automatic generation of data cleaning pipelines AlphaClean declaratively synthesizes data cleaning programs. Given a specification of quality and a list of allowed data transformations, it searches to find a sequence of transformations that best satisfies the quality specification.	Research paper Link to project code
18	Spotlight: A data curation tool for unstructured data Understand unstructured datasets by exploring them interactively. Spotlight helps you identify and fix data segments by using data enrichments (e.g. features, embeddings, uncertainties)	Demo Documentation Link to project code
17	Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling Weak supervision offers a promising alternative for producing labeled datasets without ground truth annotations. This tool first proposes heuristics and then learns from user feedback given on each proposed heuristic.	Research paper IPython Notebooks Link to project code
16	Sim2Real Docs: Domain Randomization for Documents in Natural Scenes Sim2Real Docs is a python framework for synthesizing datasets and performing domain randomization of documents in natural scenes. It enables programmatic 3D rendering of documents using Blender.	Research paper Link to project code
15	Picket: Guarding Against Corrupted Data in Tabular Data Picket is a system that safeguards against data corruptions during both training and deployment of machine learning models over tabular data.	Research paper IPython Notebooks Link to project code
14	Bootleg - Self-Supervision for Named Entity Disambiguation at the Tail A self-supervised named entity disambiguation (NED) system for English built to improve disambiguation of entities that occur infrequently, or not at all, in training data - called tail entities.	Research paper Blog Link to project code
13	Deequ - unit tests for data A Python library built on top of Apache Spark for defining “unit tests for data”, which measure data quality in large datasets.	Research paper Documentation Link to project code
12	Raha & Baran - configuration free error detection tools Raha and Baran use a library of error detectors, and treat the output of each as a feature in a holistic detection model. They then use clustering and active learning to train the holistic model with few labels.	Research paper Link to project code
11	mltrace - a python package to make ML models observable mltrace is a lightweight, open-source Python tool to get 'bolt-on' observability in ML pipelines. It offers an interface to define data and ML tests for components in pipelines, a Python API to log versions of data and pipeline components and a database to store information about component runs.	Documentation Blog post Link to project code
10	Amazon SageMaker Debugger Amazon SageMaker Debugger automates the debugging process of machine learning training jobs. It allows you to run your own training scripts, have flexibility to build customized Hooks and Rules for configuring tensors, and make the tensors available for analysis by saving in an Amazon S3 bucket.	Link to paper AWS documentation page Link to project code
9	DataPrep - prepare your data with a few lines of code DataPrep allows you to collect data from common data sources, perform exploratory data analysis and clean and standardize data	Documentation Google colab Link to project code
8	AutoAugment - Learning Augmentation Policies from Data AutoAugment is a reinforcement learning algorithm developed by Google Brain, which increases both the amount and diversity of data in an existing training dataset. It can be used to teach a model about image invariances in the data domain in a way that makes a neural network invariant to these important symmetries, thus improving its performance.	Link to paper Link to project code
7	Meerkat - handling data for Machine Learning models Meerkat makes it easier for ML practitioners to interact with high-dimensional, multi-modal data. It provides simple abstractions for data inspection, model evaluation and model training supported by efficient and robust IO under the hood.	Blog post Link to project code
6	CleanLab - clean dataset labels Cleanlab is the data-centric ML-ops package for machine learning with noisy labels. Cleanlab cleans labels and supports finding, quantifying, and learning with label errors in datasets.	Datasets cleaned with cleanlab Link to paper Link to project code
5	Knodle - Knowledge-supervised Deep Learning Framework A framework for weak supervision with neural networks. It provides a modularization for separating weak data annotations, powerful deep learning models, and methods for improving weakly supervised training.	Link to paper Link to project code
4	skweak - Weak supervision for NLP A framework to define a set of labelling functions to automatically label documents, and then aggregate their results to obtain a labelled version of the corpus.	Link to paper Link to project code
3	Albumentations - Image augmentation A fast and flexible python library for image augmentation with a large set of built-in augmentations. Widely used in he industry.	Link to Documentation Link to project code
2	nlpaug - Augment NLP datasets Allows you to generate synthetic data for improving model performance without manual effort. Support both textual and audio input.	Blog post Link to project code
1	HoloClean: A ML System for Data Enrichment Built on top of PyTorch and PostgreSQL, this is a statistical inference engine to impute, clean, and enrich data. It is a weakly supervised machine learning system that uses available quality rules, value correlations, reference data, and other signals to build a probabilistic model that accurately captures the data generation process, and uses the model in a variety of data curation tasks.	Link to paper Link to project code

Explore our list of research papers on Data Centric AI

Data labeling platform →

Renumics Spotlight: Explore unstructured datasets

Dialog Studio: A diverse collection of conversational datasets

Simple offline image classification tool

Aurum: Discovering Data in Lakes, Clouds and Databases

g-mixup : Graph data augmentation

Fantastic Data and How to Query Them

PyHard: Generate hardness embeddings for easy data-centric analysis

AlphaClean: Automatic generation of data cleaning pipelines

Spotlight: A data curation tool for unstructured data

Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling

Sim2Real Docs: Domain Randomization for Documents in Natural Scenes

Picket: Guarding Against Corrupted Data in Tabular Data

Bootleg - Self-Supervision for Named Entity Disambiguation at the Tail

Deequ - unit tests for data

Raha & Baran - configuration free error detection tools

mltrace - a python package to make ML models observable

Amazon SageMaker Debugger

DataPrep - prepare your data with a few lines of code

AutoAugment - Learning Augmentation Policies from Data

Meerkat - handling data for Machine Learning models

CleanLab - clean dataset labels

Knodle - Knowledge-supervised Deep Learning Framework

skweak - Weak supervision for NLP

Albumentations - Image augmentation

nlpaug - Augment NLP datasets

HoloClean: A ML System for Data Enrichment