Deep Mining Covid-19 Literature

SIRAJZADE, Joshgun; BOUVRY, Pascal; SCHOMMER, Christoph

doi:10.1007/978-3-031-19647-8_9

Download

Paper published in a book (Scientific congresses, symposiums and conference proceedings)

Deep Mining Covid-19 Literature

SIRAJZADE, Joshgun; BOUVRY, Pascal; SCHOMMER, Christoph

2022 • In Applied Informatics, 5th International Conference, ICAI 2022, Arequipa, Peru, October 27–29, 2022, Proceedings

Peer reviewed

Permalink
https://hdl.handle.net/10993/52855

DOI
10.1007/978-3-031-19647-8_9

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

DeepMining_ICAI2022.pdf

Publisher postprint (360.24 kB)

Download

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Covid-19; Text Mining; Topic Modeling; CORD19; Medical Publication; Latent Dirichlet Allocation

Abstract :

[en] In this paper we investigate how scientific and medical papers about Covid-19 can be effectively mined. For this purpose we use the CORD19 dataset which is a huge collection of all papers published about and around the SARS-CoV2 virus and the pandemic it caused. We discuss how classical text mining algorithms like Latent Semantic Analysis (LSA) or its modern version Latent Drichlet Allocation (LDA) can be used for this purpose and also touch more modern variant of these algorithms like word2vec which came with deep learning wave and show their advantages and disadvantages each. We finish the paper with showing some topic examples from the corpus and answer questions such as which topics are the most prominent for the corpus or how many percentage of the corpus is dedicated to them. We also give a discussion of how topics around RNA research in connection with Covid-19 can be examined.

Disciplines :

Computer science

Author, co-author :

SIRAJZADE, Joshgun ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)

BOUVRY, Pascal ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)

SCHOMMER, Christoph ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)

External co-authors :

Language :

English

Title :

Deep Mining Covid-19 Literature

Publication date :

2022

Event name :

International Conference on Applied Informatics (ICAI 2022)

Event date :

from 27-10-2022 to 29-10-2022

Audience :

International

Main work title :

Applied Informatics, 5th International Conference, ICAI 2022, Arequipa, Peru, October 27–29, 2022, Proceedings

Publisher :

Springer Cham

ISBN/EAN :

978-3-031-19647-8

Pages :

121–133

Peer reviewed :

Peer reviewed

Focus Area :

Computational Sciences

Additional URL :

https://link.springer.com/chapter/10.1007/978-3-031-19647-8_9

Available on ORBilu :

since 23 November 2022

Statistics

Number of views

230 (6 by Unilu)

Number of downloads

126 (0 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

Bibliography

Bhatia, P., et al.: AWS CORD-19 search: a neural search engine for COVID-19 literature. http://arxiv.org/abs/2007.09186
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Chang, J., Boyd-Graber, J., Gerrish, S., Wang, C., Blei, D.M.: Reading tea leaves: How humans interpret topic models. In: Proceedings of the 22nd International Conference on Neural Information Processing Systems, pp. 288–296. NIPS 2009, Curran Associates Inc., Red Hook, NY, USA (2009)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding (2018). https://doi.org/10.48550/ARXIV.1810.04805, https://arxiv.org/abs/1810.04805
Dumais, S.T.: Latent semantic analysis. Ann. Rev. Inf. Sci. Technol. 38, 188–230 (2005)
Hofmann, T.: Probabilistic latent semantic analysis. CoRR abs/1301.6705 (2013). http://arxiv.org/abs/1301.6705
Karami, A., Bookstaver, B., Nolan, M.S., Bozorgi, P.: Investigating diseases and chemicals in Covid-19 literature with text mining. Int. J. Inf. Manag. Data Insights 1, 100016–100016 (2021)
McCallum, A.K.: Mallet: a machine learning for language toolkit (2002). http://mallet.cs.umass.edu
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Bengio, Y., LeCun, Y. (eds.) 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, 2–4 May 2013, Workshop Track Proceedings (2013). http://arxiv.org/abs/1301. 3781
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. CoRR abs/1310.4546 (2013). http://arxiv.org/abs/1310.4546
Moody, C.E.: Mixing Dirichlet topic models and word embeddings to make LDA2vec. CoRR abs/1605.02019 (2016). http://arxiv.org/abs/1605.02019
Nogueira, R., Yang, W., Lin, J., Cho, K.: Document expansion by query prediction. arXiv preprint arXiv:1904.08375 (2019)
Otmakhova, Y., Verspoor, K., Baldwin, T., Šuster, S.: Improved topic representations of medical documents to assist COVID-19 literature exploration. In: Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.nlpcovid19-2.12, https://www.aclweb.org/anthology/2020.nlpcovid19-2.12
Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. ELRA, Valletta, Malta, May 2010. http://is.muni.cz/publication/884893/en
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multitask benchmark and analysis platform for natural language understanding (2018). https://doi.org/10.48550/ARXIV.1804.07461, https://arxiv.org/abs/1804.07461
Wang, L.L., et al.: Cord-19: The COVID-19 open research dataset. ArXiv (2020)
Zhai, C., Massung, S.: Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, 1st edn. ACM Books, San Rafael (2016). OCLC: ocn957355971
Zhang, E., et al.: Covidex: neural ranking models and keyword search infrastructure for the COVID-19 open research dataset. In: Proceedings of the First Workshop on Scholarly Document Processing, pp. 31–41. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.sdp-1.5, https://www.aclweb.org/anthology/2020.sdp-1.5
Zhao, H., Phung, D., Huynh, V., Jin, Y., Du, L., Buntine, W.: Topic modelling meets deep neural networks: a survey (2021). https://doi.org/10.48550/ARXIV. 2103.00498, https://arxiv.org/abs/2103.00498