Internet of Samples: Creating and Mapping Controlled Vocabularies for
Specimen Type, Material Type, and Sampled Feature
Abstract
Material samples are vital across multiple scientific disciplines with
samples collected for one project often proving valuable for additional
studies. The Internet of Samples (iSamples) project aims to integrate
large, diverse, cross-discipline sample repositories and enable access
and discovery of material samples as FAIR data (Findable, Accessible,
Interoperable, and Reusable). Here we report our recent progress in
controlled vocabulary development and mapping. In addition to a core
metadata schema to integrate SESAR, GEOME, Open Context, and Smithsonian
natural history collections, three small but important controlled
vocabularies (CVs) describing specimen type, material type, and sampled
feature were created. The new CVs provide consistent semantics for
high-level integration of existing vocabularies used in the source
collections. Two methods were used to map source record properties to
terms in the new CVs: Keyword-based heuristic rules were manually
created where existing terminologies were similar to the new CVs, such
as in records from SESAR, GEOME, and Open Context and some aspects of
Smithsonian Darwin Core records. For example specimen type
=liquid>aqueous in SESAR records mapped to specimen
type = liquid or gas sample and material type = liquid
water. A machine learning approach was applied to Smithsonian Darwin
Core records to infer sampled feature terms from record text describing
habitat, locality, higher geography, and higher classification fields.
Applying fastText with a 600-billion-token corpus in the general domain,
we provided the machine a level of “understanding” of English words.
With 200 and 995-record training sets, 87%, 94% precision and 85%,
92% recall were obtained respectively, yielding performance sufficient
for production use. Applying these approaches, more than
3x106 records of the four large collections have been
mapped successfully to a common core data model facilitating
cross-domain discovery and retrieval of the sample records.