DATA CENTRIC AI
Open Source tools in Data Centric AI
Open source tools you can start using right away.
Tool Description | Links |
---|---|
AutoAugment - Learning Augmentation Policies from DataAutoAugment is a reinforcement learning algorithm developed by Google Brain, which increases both the amount and diversity of data in an existing training dataset. It can be used to teach a model about image invariances in the data domain in a way that makes a neural network invariant to these important symmetries, thus improving its performance. |
|
Meerkat - handling data for Machine Learning modelsMeerkat makes it easier for ML practitioners to interact with high-dimensional, multi-modal data. It provides simple abstractions for data inspection, model evaluation and model training supported by efficient and robust IO under the hood. |
|
CleanLab - clean dataset labelsCleanlab is the data-centric ML-ops package for machine learning with noisy labels. Cleanlab cleans labels and supports finding, quantifying, and learning with label errors in datasets. |
|
Knodle - Knowledge-supervised Deep Learning FrameworkA framework for weak supervision with neural networks. It provides a modularization for separating weak data annotations, powerful deep learning models, and methods for improving weakly supervised training. |
|
skweak - Weak supervision for NLPA framework to define a set of labelling functions to automatically label documents, and then aggregate their results to obtain a labelled version of the corpus. |
|
Albumentations - Image augmentationA fast and flexible python library for image augmentation with a large set of built-in augmentations. Widely used in he industry. |
|
nlpaug - Augment NLP datasetsAllows you to generate synthetic data for improving model performance without manual effort. Support both textual and audio input. |
|
HoloClean: A ML System for Data EnrichmentBuilt on top of PyTorch and PostgreSQL, this is a statistical inference engine to impute, clean, and enrich data. It is a weakly supervised machine learning system that uses available quality rules, value correlations, reference data, and other signals to build a probabilistic model that accurately captures the data generation process, and uses the model in a variety of data curation tasks. |