Open Source tools in Data Centric AI
Open source tools in Data centric AI you can start using right away.
ID | Tool Description | Links |
---|---|---|
26 |
Renumics Spotlight: Explore unstructured datasetsCreate interactive visualizations from unstructured datasets and explore them. |
|
25 |
Dialog Studio: A diverse collection of conversational datasetsA wide collection of conversational datasets and a python based library to load them. |
|
24 |
Simple offline image classification toolA simple python based image classification tool you can run locally. |
|
23 |
Aurum: Discovering Data in Lakes, Clouds and DatabasesIdentify relevant content among multiple data sources that may consist of tabular files such as CSV, and relational tables |
|
22 |
g-mixup : Graph data augmentationMixup data augmentation for classification of graph like data. |
|
21 |
Fantastic Data and How to Query ThemA unified framework for combining different datasets in order to query them easily using SPARQL. |
|
20 |
PyHard: Generate hardness embeddings for easy data-centric analysisPyHard employes Instance Space Analysis (ISA) to analyse performance at the instance level, providing an alternative method for visualizing algorithm performance on each data sample. |
|
19 |
AlphaClean: Automatic generation of data cleaning pipelinesAlphaClean declaratively synthesizes data cleaning programs. Given a specification of quality and a list of allowed data transformations, it searches to find a sequence of transformations that best satisfies the quality specification. |
|
18 |
Spotlight: A data curation tool for unstructured dataUnderstand unstructured datasets by exploring them interactively. Spotlight helps you identify and fix data segments by using data enrichments (e.g. features, embeddings, uncertainties) |
|
17 |
Interactive Weak Supervision: Learning Useful Heuristics for Data LabelingWeak supervision offers a promising alternative for producing labeled datasets without ground truth annotations. This tool first proposes heuristics and then learns from user feedback given on each proposed heuristic. |
|
16 |
Sim2Real Docs: Domain Randomization for Documents in Natural ScenesSim2Real Docs is a python framework for synthesizing datasets and performing domain randomization of documents in natural scenes. It enables programmatic 3D rendering of documents using Blender. |
|
15 |
Picket: Guarding Against Corrupted Data in Tabular DataPicket is a system that safeguards against data corruptions during both training and deployment of machine learning models over tabular data. |
|
14 |
Bootleg - Self-Supervision for Named Entity Disambiguation at the TailA self-supervised named entity disambiguation (NED) system for English built to improve disambiguation of entities that occur infrequently, or not at all, in training data - called tail entities. |
|
13 |
Deequ - unit tests for dataA Python library built on top of Apache Spark for defining “unit tests for data”, which measure data quality in large datasets. |
|
12 |
Raha & Baran - configuration free error detection toolsRaha and Baran use a library of error detectors, and treat the output of each as a feature in a holistic detection model. They then use clustering and active learning to train the holistic model with few labels. |
|
11 |
mltrace - a python package to make ML models observablemltrace is a lightweight, open-source Python tool to get 'bolt-on' observability in ML pipelines. It offers an interface to define data and ML tests for components in pipelines, a Python API to log versions of data and pipeline components and a database to store information about component runs. |
|
10 |
Amazon SageMaker DebuggerAmazon SageMaker Debugger automates the debugging process of machine learning training jobs. It allows you to run your own training scripts, have flexibility to build customized Hooks and Rules for configuring tensors, and make the tensors available for analysis by saving in an Amazon S3 bucket. |
|
9 |
DataPrep - prepare your data with a few lines of codeDataPrep allows you to collect data from common data sources, perform exploratory data analysis and clean and standardize data |
|
8 |
AutoAugment - Learning Augmentation Policies from DataAutoAugment is a reinforcement learning algorithm developed by Google Brain, which increases both the amount and diversity of data in an existing training dataset. It can be used to teach a model about image invariances in the data domain in a way that makes a neural network invariant to these important symmetries, thus improving its performance. |
|
7 |
Meerkat - handling data for Machine Learning modelsMeerkat makes it easier for ML practitioners to interact with high-dimensional, multi-modal data. It provides simple abstractions for data inspection, model evaluation and model training supported by efficient and robust IO under the hood. |
|
6 |
CleanLab - clean dataset labelsCleanlab is the data-centric ML-ops package for machine learning with noisy labels. Cleanlab cleans labels and supports finding, quantifying, and learning with label errors in datasets. |
|
5 |
Knodle - Knowledge-supervised Deep Learning FrameworkA framework for weak supervision with neural networks. It provides a modularization for separating weak data annotations, powerful deep learning models, and methods for improving weakly supervised training. |
|
4 |
skweak - Weak supervision for NLPA framework to define a set of labelling functions to automatically label documents, and then aggregate their results to obtain a labelled version of the corpus. |
|
3 |
Albumentations - Image augmentationA fast and flexible python library for image augmentation with a large set of built-in augmentations. Widely used in he industry. |
|
2 |
nlpaug - Augment NLP datasetsAllows you to generate synthetic data for improving model performance without manual effort. Support both textual and audio input. |
|
1 |
HoloClean: A ML System for Data EnrichmentBuilt on top of PyTorch and PostgreSQL, this is a statistical inference engine to impute, clean, and enrich data. It is a weakly supervised machine learning system that uses available quality rules, value correlations, reference data, and other signals to build a probabilistic model that accurately captures the data generation process, and uses the model in a variety of data curation tasks. |