Specimen Data Best Practices
Table of contents
Overview
Specimen data should provide information about why data fields have been left incomplete, and should tell the user if Optical Character Recognition (OCR) was applied. More information on why this is important can be found in the Manual Transcription Guidance.
Recommendations
Specimen Data (DD1, DD2)
Level: Advanced
Use Case: As a researcher I want to know if data is reliable/complete so that I can determine if it can be included to my research.
Recommendation:
DD1: When data is extracted from the digitalisation platform to CMS, make sure there is information available about a missing datafield: (1) if the field is marked empty/missing by the digitation operator or (2) if the field was not databased at all by the operator.
DD2: If Optical Character Recognition (OCR) is applied during the ETL process, the CMS should support marking the data field to be "automatically filled" and the ETL process should make sure to fill in this information.
Discussion
Data field value can be one of the following:
- Absent: information has not been documented at time of collection event and can not be later resolved
- Unknown: information is documented but is not yet databased
- Unknown:missing: the information could have been databased but is absent
- Unknown:indecipherable: the information appears to be present but failed to be captured
- Automatically filled: information has been databased using automated methods (OCR) but not yet cleaned/verified by a human
- Default: information is present and has no known problems
- Erroneous: information is present but contains errors/marked as unreliable by a human
- Unknown:withheld: information is databased but has been withheld by the provider (Note: not a factor for ETL processes; this is a data publishing problem)
Implementation
See Manual Transcription Guidance for more information
References
Dillen M, Groom Q, & Hardisty A. (2019). Interoperability of Collection Management Systems. Zenodo. https://doi.org/10.5281/zenodo.3361598 (p5 recommendation #8)
Groom Q et al. (2019) Improved standardization of transcribed digital specimen data. Database, Volume 2019, 2019, baz129. https://doi.org/10.1093/database/baz129 (table 2)
Authors
Zhengzhe Wu and Esko Piirainen
Finnish Museum of Natural History (Luomus)
Contributors
Lisa French, Laurence Livermore
References
Dillen M, Groom Q, & Hardisty A. (2019). Interoperability of Collection Management Systems. Zenodo. https://doi.org/10.5281/zenodo.3361598
Groom Q et al. (2019) Improved standardization of transcribed digital specimen data. Database, Volume 2019, 2019, baz129. https://doi.org/10.1093/database/baz129
Citation
References
Licence
Document Control
Version:
Changes since last version: N/A
Last Updated: 28 June 2022
Edit This Page
You can suggest changes to this page on our GitHub