No full text
Eprint already available on another site (E-prints, Working papers and Research blog)
Automating Historical Insight Extraction from Large-Scale Newspaper Archives via Neural Topic Modeling
MURUGARAJ, Keerthana; LAMSIYAH, Salima; DURING, Marten et al.
2025
 

Files


Full Text
No document available.

Send to



Details



Keywords :
Computer Science - Computation and Language; Computer Science - Artificial Intelligence; Computer Science - Information Retrieval
Abstract :
[en] Extracting coherent and human-understandable themes from large collections of unstructured historical newspaper archives presents significant challenges due to topic evolution, Optical Character Recognition (OCR) noise, and the sheer volume of text. Traditional topic-modeling methods, such as Latent Dirichlet Allocation (LDA), often fall short in capturing the complexity and dynamic nature of discourse in historical texts. To address these limitations, we employ BERTopic. This neural topic-modeling approach leverages transformerbased embeddings to extract and classify topics, which, despite its growing popularity, still remains underused in historical research. Our study focuses on articles published between 1955 and 2018, specifically examining discourse on nuclear power and nuclear safety. We analyze various topic distributions across the corpus and trace their temporal evolution to uncover long-term trends and shifts in public discourse. This enables us to more accurately explore patterns in public discourse, including the co-occurrence of themes related to nuclear power and nuclear weapons and their shifts in topic importance over time. Our study demonstrates the scalability and contextual sensitivity of BERTopic as an alternative to traditional approaches, offering richer insights into historical discourses extracted from newspaper archives. These findings contribute to historical, nuclear, and social-science research while reflecting on current limitations and proposing potential directions for future work.
Disciplines :
Computer science
Author, co-author :
MURUGARAJ, Keerthana  ;  University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)
LAMSIYAH, Salima  ;  University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)
DURING, Marten  ;  University of Luxembourg > Luxembourg Centre for Contemporary and Digital History (C2DH) > Digital History and Historiography
THEOBALD, Martin ;  University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)
Language :
English
Title :
Automating Historical Insight Extraction from Large-Scale Newspaper Archives via Neural Topic Modeling
Publication date :
2025
Commentary :
This is a preprint of a manuscript submitted to Digital Scholarship in the Humanities (Oxford University Press). The paper is currently under peer review
Available on ORBilu :
since 19 December 2025

Statistics


Number of views
26 (2 by Unilu)
Number of downloads
0 (0 by Unilu)

Bibliography


Similar publications



Contact ORBilu