Evaluate Data annotation service providers

Evaluating and choosing an annotation partner is no easy task. There are lot of options, and it's not straighforward to know who will be the best fit for your project. The following tool, inspired by a paper by Andrew Marc Greene*, helps you quickly contrast and compare vendors.

Please note that every metric may not be relevant to your use-case. Simply leave them marked as N/A, so they won't be considered in the final score.

Annotation Guide Versioning
Uses semantic versioning for the instructions.
Uses timestamp or other linear version number for the instructions. Maintains a change log.
Lacks a formal process for tracking changes and ensuring client agreement on changes to the instructions.
Makes unilateral changes; makes it difficult to keep client and provider versions “in sync”.
Annotation Guide Language
Manually translates instructions into the annotators’ native language, when appropriate. The client is given the opportunity to review the translation.
When annotators are not fluent in the language in which the client has written guidelines, ensures that the instructions are easy to understand, with client collaboration and approval.
Does not review instructions for readability by non-fluent speakers of the language in which the guidelines are written, even when that is needed.
Relies on machine translation of instructions without manual verification.
Annotation Guide Questions
Has a UI in the annotation tool for annotator questions and client responses. Considers annotator feedback important and includes that time in their pay.
Can collect annotators’ questions in the UI, but responses delivered externally. Provides client with opportunity to test-drive the annotators’ experience.
Uses an out-of-context system (e.g., a shared spreasheet) for annotator queries.
Lacks any way for annotators to ask questions.
Assessing annotation quality
Provider’s quality team manually reviews a random sample of data frequently. Provides dashboard (updated frequently) for client to monitor detailed metrics. In multi-stage workflows, annotators can flag bad results from previous stages.
Identifies high-risk items for additional manual review. Sets and meets higher quality goals for test data. Provides UI for “acceptance testing” by client on a rapid cadence.
Uses only simplistic inter-annotator agreement to monitor dataset consistency. Fails to revisit earliest annotations once annotators have gained experience.
Flawed data is discarded (which can introduce bias) instead of being corrected. Task-level accuracy not reported or lacks an explanation of how it is computed.
Assessing annotator reliability
Uses statistical tests to identify outlier annotators for each question.
Regularly adds items whose correct answer is known, to monitor annotator quality throughout the project. Uses statistical tests to monitor/identify questions with high disagreement.
Uses simplistic IAA to identify outlier annotators. Rejects minority opinions out of hand (instead of trying to understand the cause for the disagreement).
Lacks IAA or statistical monitoring. Fails to exclude or review data previously obtained from annotators who turn out to be unreliable.
Merging individual annotations
Merging strategy accounts for individual annotators’ previous accuracy. Annotators can review/revisit their work before it is finalized.
Client can specify merging strategy (e.g., median, or priority voting, etc.) Number of annotators per task clearly specified and statistically justified.
Decisions depend only on majority vote of annotators.
Some decisions are a single annotator’s opinion. (Note: When annotator has demontrated high reliability or is a designated SME, this may be defensible.)
Data Delivery
Provides detailed raw data including annotator ID, date+time of annotation, elapsed time for annotation, annotator’s location and/or locale (if relevant), previous versions for this item from this annotator, version num- ber of the instructions under which each datum was collected and annotated (plus merged annotation data).
Provides merged data and individual responses but with incomplete metadata.
Provides merged data plus individual responses but with minimal metadata.
Provides merged data only. Does not exclude or revisit previously gathered data from annotators who turn out to be unreliable
Data ingestion methods
Offers a variety of methods for data ingestion, including:
  • API/SDK to upload data to the provider's servers
  • Server locations in various regions around the globe
  • Connect to the client's cloud storage
Offer API/SDK support as well ability to connect to client's storage, but no option to choose which geographical region the data is stored in.
Offer only 1 of the 3 data ingestion methods.
Need the client to share the data through google drive or other shared storage.
Annotators Payment
Full-time, paid salary, with benefits.
Part-time, paid hourly, including time spent in training and on breaks.
Paid per annotation, works out to a living wage in the annotator’s location.
Paid per annotation, works out to below living wage in the annotator’s location.
Work conditions
Annotation interface is accessible. Coaches underperforming annotators, firing them only as a last resort.
Provides annotators with appropriate equipment (e.g., large monitors). Provides breaks and variety to avoid fatigue.
Neglects ergonomic needs of annotators. Sets unrealistic quotas.
Lacks appropriate pandemic safety precautions.
Total score: N/A