Supporting findability of COVID-19 research with large-scale text mining of scientific publications

[en] When the COVID-19 pandemic hit in early 2020, a lot of research efforts were quickly redirected towards studies on SARS-CoV2 and COVID-19 disease, from the sequencing and assembly of viral genomes to the elaboration of robust testing methodologies and the development of treatment and vaccination strategies. At the same time, a flurry of scientific publications around SARS-CoV-2 and COVID-19 began to appear, making it increasingly difficult for researchers to stay up-to-date with latest trends and developments in this rapidly evolving field. The BioKB platform is a pipeline which, by exploiting text mining and semantic technologies, helps researchers easily access semantic content of thousands of abstracts and full text articles. The content of the articles is analysed and concepts from a range of contexts, including proteins, species, chemicals, diseases and biological processes are tagged based on existing dictionaries of controlled terms. Co-occurring concepts are classified based on their asserted relationship and the resulting subject-relation-object triples are stored in a publicly accessible human- and machine-readable knowledge base. All concepts in the BioKB dictionaries are linked to stable, persistent identifiers, either a resource accession such as an Ensembl, Uniprot or PubChem ID for genes, proteins and chemicals, or an ontology term ID for diseases, phenotypes and other ontology terms. In order to improve COVID-19 related text mining, we extended the underlying dictionaries to include many additional viral species (via NCBI Taxonomy identifiers), phenotypes from the Human Phenotype Ontology (HPO), COVID-related concepts including clinical and laboratory tests from the COVID-19 ontology, as well as additional diseases (DO) and biological processes (GO). We also added all viral proteins found in UniProt and gene entries from EntrezGene to increase the sensitivity of the text mining pipeline to viral data. To date, BioKB has indexed over 270’000 sentences from 21’935 publications relating to coronavirus infections, with publications dating from 1963 to 2021, 3’863 of which were published this year. We are currently working to further refine the text mining pipeline by training it on the extraction of increasingly complex relations such as protein-phenotype relationships. We are also regularly adding new terms to our dictionaries for areas where coverage is currently low, such as clinical and laboratory tests and procedures and novel drug treatments.

Centre de recherche :

- Luxembourg Centre for Systems Biomedicine (LCSB): Bioinformatics Core (R. Schneider Group)

Disciplines :

Sciences du vivant: Multidisciplinaire, généralités & autres

Auteur, co-auteur :

WELTER, Danielle ; University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB) > Bioinformatics Core

VEGA MORENO, Carlos Gonzalo ; University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB) > Bioinformatics Core

BIRYUKOV, Maria ; University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB) > Bioinformatics Core

GROUES, Valentin ; University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB) > Bioinformatics Core

GHOSH, Soumyabrata ; University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB) > Bioinformatics Core

SCHNEIDER, Reinhard ; University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB) > Bioinformatics Core

SATAGOPAM, Venkata ; University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB) > Bioinformatics Core

Co-auteurs externes :

Langue du document :

Anglais

Titre :

Supporting findability of COVID-19 research with large-scale text mining of scientific publications

Date de publication/diffusion :

27 novembre 2020

Nom de la manifestation :

International FAIR Convergence Symposium

Organisateur de la manifestation :

CODATA and GO FAIR

Date de la manifestation :

27-11-2020 to 4-12-2020

Manifestation à portée :

International

Focus Area :

Computational Sciences
Systems Biomedicine

URL complémentaire :

https://zenodo.org/record/4300199#.X8ZPXqpKhMY

Disponible sur ORBilu :

depuis le 01 janvier 2021

Statistiques

Nombre de vues

341 (dont 33 Unilu)

Nombre de téléchargements

86 (dont 11 Unilu)

Voir plus de statistiques

citations OpenAlex