Skip to main content Link Menu Expand (external link) Document Search Copy Copied

Specimen Data Best Practices

Table of contents

Overview

Specimen data should provide information about why data fields have been left incomplete, and should tell the user if Optical Character Recognition (OCR) was applied. More information on why this is important can be found in the Manual Transcription Guidance.

Recommendations

Specimen Data (DD1, DD2)

Level: Advanced

Use Case: As a researcher I want to know if data is reliable/complete so that I can determine if it can be included to my research.

Recommendation:

DD1: When data is extracted from the digitalisation platform to CMS, make sure there is information available about a missing datafield: (1) if the field is marked empty/missing by the digitation operator or (2) if the field was not databased at all by the operator.

DD2: If Optical Character Recognition (OCR) is applied during the ETL process, the CMS should support marking the data field to be "automatically filled" and the ETL process should make sure to fill in this information.

Discussion

Data field value can be one of the following:

  • Absent: information has not been documented at time of collection event and can not be later resolved
  • Unknown: information is documented but is not yet databased
  • Unknown:missing: the information could have been databased but is absent
  • Unknown:indecipherable: the information appears to be present but failed to be captured
  • Automatically filled: information has been databased using automated methods (OCR) but not yet cleaned/verified by a human
  • Default: information is present and has no known problems
  • Erroneous: information is present but contains errors/marked as unreliable by a human
  • Unknown:withheld: information is databased but has been withheld by the provider (Note: not a factor for ETL processes; this is a data publishing problem)

Implementation

See Manual Transcription Guidance for more information

References

Dillen M, Groom Q, & Hardisty A. (2019). Interoperability of Collection Management Systems. Zenodo. https://doi.org/10.5281/zenodo.3361598 (p5 recommendation #8)

Groom Q et al. (2019) Improved standardization of transcribed digital specimen data. Database, Volume 2019, 2019, baz129. https://doi.org/10.1093/database/baz129 (table 2)

Authors

Zhengzhe Wu and Esko Piirainen
Finnish Museum of Natural History (Luomus)

Contributors

Lisa French, Laurence Livermore

References

Dillen M, Groom Q, & Hardisty A. (2019). Interoperability of Collection Management Systems. Zenodo. https://doi.org/10.5281/zenodo.3361598
Groom Q et al. (2019) Improved standardization of transcribed digital specimen data. Database, Volume 2019, 2019, baz129. https://doi.org/10.1093/database/baz129

Citation

References

Licence

Document Control

Version:
Changes since last version: N/A
Last Updated: 28 June 2022

Edit This Page

You can suggest changes to this page on our GitHub