Supporting findability of COVID-19 research with large-scale text mining of scientific publications

WELTER, Danielle; VEGA MORENO, Carlos Gonzalo; BIRYUKOV, Maria; GROUES, Valentin; GHOSH, Soumyabrata; SCHNEIDER, Reinhard; SATAGOPAM, Venkata

doi:10.5281/zenodo.4300199

Download

Poster (Scientific congresses, symposiums and conference proceedings)

Supporting findability of COVID-19 research with large-scale text mining of scientific publications

WELTER, Danielle; VEGA MORENO, Carlos Gonzalo; BIRYUKOV, Maria et al.

2020 • International FAIR Convergence Symposium

Permalink
https://hdl.handle.net/10993/45287

DOI
10.5281/zenodo.4300199

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

FAIRBioKB_poster.pdf

Author postprint (2.03 MB)

Poster PDF

Download

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

FAIR principles; COVID-19; text mining

Abstract :

[en] When the COVID-19 pandemic hit in early 2020, a lot of research efforts were quickly redirected towards studies on SARS-CoV2 and COVID-19 disease, from the sequencing and assembly of viral genomes to the elaboration of robust testing methodologies and the development of treatment and vaccination strategies. At the same time, a flurry of scientific publications around SARS-CoV-2 and COVID-19 began to appear, making it increasingly difficult for researchers to stay up-to-date with latest trends and developments in this rapidly evolving field. The BioKB platform is a pipeline which, by exploiting text mining and semantic technologies, helps researchers easily access semantic content of thousands of abstracts and full text articles. The content of the articles is analysed and concepts from a range of contexts, including proteins, species, chemicals, diseases and biological processes are tagged based on existing dictionaries of controlled terms. Co-occurring concepts are classified based on their asserted relationship and the resulting subject-relation-object triples are stored in a publicly accessible human- and machine-readable knowledge base. All concepts in the BioKB dictionaries are linked to stable, persistent identifiers, either a resource accession such as an Ensembl, Uniprot or PubChem ID for genes, proteins and chemicals, or an ontology term ID for diseases, phenotypes and other ontology terms. In order to improve COVID-19 related text mining, we extended the underlying dictionaries to include many additional viral species (via NCBI Taxonomy identifiers), phenotypes from the Human Phenotype Ontology (HPO), COVID-related concepts including clinical and laboratory tests from the COVID-19 ontology, as well as additional diseases (DO) and biological processes (GO). We also added all viral proteins found in UniProt and gene entries from EntrezGene to increase the sensitivity of the text mining pipeline to viral data. To date, BioKB has indexed over 270’000 sentences from 21’935 publications relating to coronavirus infections, with publications dating from 1963 to 2021, 3’863 of which were published this year. We are currently working to further refine the text mining pipeline by training it on the extraction of increasingly complex relations such as protein-phenotype relationships. We are also regularly adding new terms to our dictionaries for areas where coverage is currently low, such as clinical and laboratory tests and procedures and novel drug treatments.

Research center :

- Luxembourg Centre for Systems Biomedicine (LCSB): Bioinformatics Core (R. Schneider Group)

Disciplines :

Life sciences: Multidisciplinary, general & others

Author, co-author :

WELTER, Danielle ; University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB) > Bioinformatics Core

VEGA MORENO, Carlos Gonzalo ; University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB) > Bioinformatics Core

BIRYUKOV, Maria ; University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB) > Bioinformatics Core

GROUES, Valentin ; University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB) > Bioinformatics Core

GHOSH, Soumyabrata ; University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB) > Bioinformatics Core

SCHNEIDER, Reinhard ; University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB) > Bioinformatics Core

SATAGOPAM, Venkata ; University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB) > Bioinformatics Core

External co-authors :

Language :

English

Title :

Supporting findability of COVID-19 research with large-scale text mining of scientific publications

Publication date :

27 November 2020

Event name :

International FAIR Convergence Symposium

Event organizer :

CODATA and GO FAIR

Event date :

27-11-2020 to 4-12-2020

Audience :

International

Focus Area :

Computational Sciences
Systems Biomedicine

Additional URL :

https://zenodo.org/record/4300199#.X8ZPXqpKhMY

Available on ORBilu :

since 01 January 2021

Statistics

Number of views

383 (33 by Unilu)

Number of downloads

99 (11 by Unilu)

More statistics

OpenAlex citations