Sirajzade, Joshgun[University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS) >]
Bouvry, Pascal[University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS) >]
Schommer, Christoph[University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS) >]
2022
Applied Informatics, 5th International Conference, ICAI 2022, Arequipa, Peru, October 27–29, 2022, Proceedings
Springer Cham
121–133
Yes
International
978-3-031-19647-8
International Conference on Applied Informatics (ICAI 2022)
from 27-10-2022 to 29-10-2022
[en] Covid-19 ; Text Mining ; Topic Modeling ; CORD19 ; Medical Publication ; Latent Dirichlet Allocation
[en] In this paper we investigate how scientific and medical papers about Covid-19 can be effectively mined. For this purpose we use the CORD19 dataset which is a huge collection of all papers published about and around the SARS-CoV2 virus and the pandemic it caused. We discuss how classical text mining algorithms like Latent Semantic Analysis (LSA) or its modern version Latent Drichlet Allocation (LDA) can be used for this purpose and also touch more modern variant of these algorithms like word2vec which came with deep learning wave and show their advantages and disadvantages each. We finish the paper with showing some topic examples from the corpus and answer questions such as which topics are the most prominent for the corpus or how many percentage of the corpus is dedicated to them. We also give a discussion of how topics around RNA research in connection with Covid-19 can be examined.