Covid-19; Text Mining; Topic Modeling; CORD19; Medical Publication; Latent Dirichlet Allocation
Abstract :
[en] In this paper we investigate how scientific and medical papers about Covid-19 can be effectively mined. For this purpose we use the CORD19 dataset which is a huge collection of all papers published about and around the SARS-CoV2 virus and the pandemic it caused. We discuss how classical text mining algorithms like Latent Semantic Analysis (LSA) or its modern version Latent Drichlet Allocation (LDA) can be used for this purpose and also touch more modern variant of these algorithms like word2vec which came with deep learning wave and show their advantages and disadvantages each. We finish the paper with showing some topic examples from the corpus and answer questions such as which topics are the most prominent for the corpus or how many percentage of the corpus is dedicated to them. We also give a discussion of how topics around RNA research in connection with Covid-19 can be examined.
Disciplines :
Computer science
Author, co-author :
Sirajzade, Joshgun ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)
Bouvry, Pascal ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)
Schommer, Christoph ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)
External co-authors :
no
Language :
English
Title :
Deep Mining Covid-19 Literature
Publication date :
2022
Event name :
International Conference on Applied Informatics (ICAI 2022)
Event date :
from 27-10-2022 to 29-10-2022
Audience :
International
Main work title :
Applied Informatics, 5th International Conference, ICAI 2022, Arequipa, Peru, October 27–29, 2022, Proceedings
Chang, J., Boyd-Graber, J., Gerrish, S., Wang, C., Blei, D.M.: Reading tea leaves: How humans interpret topic models. In: Proceedings of the 22nd International Conference on Neural Information Processing Systems, pp. 288–296. NIPS 2009, Curran Associates Inc., Red Hook, NY, USA (2009)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding (2018). https://doi.org/10.48550/ARXIV.1810.04805, https://arxiv.org/abs/1810.04805
Hofmann, T.: Probabilistic latent semantic analysis. CoRR abs/1301.6705 (2013). http://arxiv.org/abs/1301.6705
Karami, A., Bookstaver, B., Nolan, M.S., Bozorgi, P.: Investigating diseases and chemicals in Covid-19 literature with text mining. Int. J. Inf. Manag. Data Insights 1, 100016–100016 (2021)
McCallum, A.K.: Mallet: a machine learning for language toolkit (2002). http://mallet.cs.umass.edu
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Bengio, Y., LeCun, Y. (eds.) 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, 2–4 May 2013, Workshop Track Proceedings (2013). http://arxiv.org/abs/1301. 3781
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. CoRR abs/1310.4546 (2013). http://arxiv.org/abs/1310.4546
Moody, C.E.: Mixing Dirichlet topic models and word embeddings to make LDA2vec. CoRR abs/1605.02019 (2016). http://arxiv.org/abs/1605.02019
Nogueira, R., Yang, W., Lin, J., Cho, K.: Document expansion by query prediction. arXiv preprint arXiv:1904.08375 (2019)
Otmakhova, Y., Verspoor, K., Baldwin, T., Šuster, S.: Improved topic representations of medical documents to assist COVID-19 literature exploration. In: Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.nlpcovid19-2.12, https://www.aclweb.org/anthology/2020.nlpcovid19-2.12
Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. ELRA, Valletta, Malta, May 2010. http://is.muni.cz/publication/884893/en
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multitask benchmark and analysis platform for natural language understanding (2018). https://doi.org/10.48550/ARXIV.1804.07461, https://arxiv.org/abs/1804.07461
Wang, L.L., et al.: Cord-19: The COVID-19 open research dataset. ArXiv (2020)
Zhai, C., Massung, S.: Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, 1st edn. ACM Books, San Rafael (2016). OCLC: ocn957355971
Zhang, E., et al.: Covidex: neural ranking models and keyword search infrastructure for the COVID-19 open research dataset. In: Proceedings of the First Workshop on Scholarly Document Processing, pp. 31–41. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.sdp-1.5, https://www.aclweb.org/anthology/2020.sdp-1.5
Zhao, H., Phung, D., Huynh, V., Jin, Y., Du, L., Buntine, W.: Topic modelling meets deep neural networks: a survey (2021). https://doi.org/10.48550/ARXIV. 2103.00498, https://arxiv.org/abs/2103.00498