Advancing the allergenicity assessment of new proteins using a text
mining resource
Abstract
BACKGROUND: With a society increasingly demanding alternative protein
food sources, new strategies for evaluating protein safety issues, such
as their allergenic potential, are needed. Large-scale and systemic
studies on allergenic proteins are hindered by the limited and
non-harmonized clinical information available for these substances in
dedicated databases. A clearly missing key information is that
representing the symptomatology of the allergens, especially given in
terms of standard vocabularies, that would allow connecting with other
biomedical resources to carry out different studies related to human
health. In this work, we have generated the first resource with a
comprehensive annotation of allergens’ symptomatology, using a
text-mining approach that extracts significant co-mentions between these
entities from the scientific literature. METHODS: The main resource of
biomedical literature (PubMed, ~36 million abstracts)
was mined to automatically extract relationships between allergens and
clinical symptoms. The annotations are given in terms of standard
vocabularies in widely used biomedical databases. The method identifies
statistically significant co-mentions between the textual descriptions
of the two types of entities in the literature as indication of
relationship. RESULTS: 1,180 clinical signs extracted from the Human
Phenotype Ontology (HPO), the Medical Subject Heading (MeSH) terms of
PubMed together with other allergen-specific symptoms, were linked to
1,036 unique allergens annotated in the two main allergen-related public
databases via 14,009 relationships. CONCLUSIONS: This resource could
serve as a starting point for a future manually curated compilation of
allergen symptomatology. The annotations are publicly available through
an interactive web interface at
[https://csbg.cnb.csic.es/CoMent_allergen/](https://csbg.cnb.csic.es/CoMent_allergen/).