Evaluate Data annotation service providers
Evaluating and choosing an annotation partner is no easy task. There are lot of options, and it's not straightforward to know who will be the best fit for your project. The following tool, inspired by a paper by Andrew Marc Greene*, helps you quickly contrast and compare vendors.
Please note that every metric may not be relevant to your use-case. Simply leave them marked as N/A, so they won't be considered in the final score.
Annotation Guide Versioning |
Uses semantic versioning for the instructions.
|
Uses timestamp or other linear version number for the instructions. Maintains a change log.
|
Lacks a formal process for tracking changes and ensuring client agreement on changes to the instructions.
|
Makes unilateral changes; makes it difficult to keep client and provider versions “in sync”.
|
|
Annotation Guide Language |
Manually translates instructions into the annotators’ native language, when appropriate. The client is given the opportunity to review the translation.
|
When annotators are not fluent in the language in which the client has written guidelines, ensures that the instructions are easy to understand, with client collaboration and approval.
|
Does not review instructions for readability by non-fluent speakers of the language in which the guidelines are written, even when that is needed.
|
Relies on machine translation of instructions without manual verification.
|
|
Annotation Guide Questions |
Has a UI in the annotation tool for annotator questions and client responses. Considers annotator feedback important and includes that time in their pay.
|
Can collect annotators’ questions in the UI, but responses delivered externally. Provides client with opportunity to test-drive the annotators’ experience.
|
Uses an out-of-context system (e.g., a shared spreasheet) for annotator queries.
|
Lacks any way for annotators to ask questions.
|
|
Assessing annotation quality |
Provider’s quality team manually reviews a random sample of data frequently. Provides dashboard (updated frequently) for client to monitor detailed metrics. In multi-stage workflows, annotators can flag bad results from previous stages.
|
Identifies high-risk items for additional manual review. Sets and meets higher quality goals for test data. Provides UI for “acceptance testing” by client on a rapid cadence.
|
Uses only simplistic inter-annotator agreement to monitor dataset consistency. Fails to revisit earliest annotations once annotators have gained experience.
|
Flawed data is discarded (which can introduce bias) instead of being corrected. Task-level accuracy not reported or lacks an explanation of how it is computed.
|
|
Assessing annotator reliability |
Uses statistical tests to identify outlier annotators for each question.
|
Regularly adds items whose correct answer is known, to monitor annotator quality throughout the project. Uses statistical tests to monitor/identify questions with high disagreement.
|
Uses simplistic IAA to identify outlier annotators. Rejects minority opinions out of hand (instead of trying to understand the cause for the disagreement).
|
Lacks IAA or statistical monitoring. Fails to exclude or review data previously obtained from annotators who turn out to be unreliable.
|
|
Merging individual annotations |
Merging strategy accounts for individual annotators’ previous accuracy. Annotators can review/revisit their work before it is finalized.
|
Client can specify merging strategy (e.g., median, or priority voting, etc.) Number of annotators per task clearly specified and statistically justified.
|
Decisions depend only on majority vote of annotators.
|
Some decisions are a single annotator’s opinion. (Note: When annotator has demontrated high reliability or is a designated SME, this may be defensible.)
|
|
Data Delivery |
Provides detailed raw data including annotator ID, date+time of annotation, elapsed time for annotation, annotator’s location and/or locale (if relevant), previous versions for this item from this annotator, version num- ber of the instructions under which each datum was collected and annotated (plus merged annotation data).
|
Provides merged data and individual responses but with incomplete metadata.
|
Provides merged data plus individual responses but with minimal metadata.
|
Provides merged data only. Does not exclude or revisit previously gathered data from annotators who turn out to be unreliable
|
|
Data ingestion methods |
Offers a variety of methods for data ingestion, including:
|
Offer API/SDK support as well ability to connect to client's storage, but no option to choose which geographical region the data is stored in.
|
Offer only 1 of the 3 data ingestion methods.
|
Need the client to share the data through google drive or other shared storage.
|
|
Annotators Payment |
Full-time, paid salary, with benefits.
|
Part-time, paid hourly, including time spent in training and on breaks.
|
Paid per annotation, works out to a living wage in the annotator’s location.
|
Paid per annotation, works out to below living wage in the annotator’s location.
|
|
Work conditions |
Annotation interface is accessible. Coaches underperforming annotators, firing them only as a last resort.
|
Provides annotators with appropriate equipment (e.g., large monitors). Provides breaks and variety to avoid fatigue.
|
Neglects ergonomic needs of annotators. Sets unrealistic quotas.
|
Lacks appropriate pandemic safety precautions.
|
Total score: N/A
Want to find out how much your annotation project will cost? Estimate the cost using our free tool.
Get in touch with us