Extractive Text Summarization; Historical Domain; Pre-trained Language Models; HistBERT; Transfer Learning
Abstract :
[en] In recent years, pre-trained language models (PLMs) have shown remarkable advancements in the extractive summarization
task across diverse domains. However, there remains a lack of research specifically in the historical domain. In this paper, we propose a
novel method for extractive historical single-document summarization that leverages the potential of a domain-aware historical
bidirectional language model, pre-trained on a large-scale historical corpus. Subsequently, we fine-tune the language model specifically
for the task of extractive historical single-document summarization. One major challenge for this task is the lack of annotated datasets
for historical summarization. To address this issue, we construct a dataset by collecting archived historical documents from the Centre
Virtuel de la Connaissance sur l’Europe (CVCE) group at the University of Luxembourg. Furthermore, to better learn the structural
features of the input documents, we use a sentence position embedding mechanism that enables the model to learn the position information
of sentences. The overall experimental results on our historical dataset collected from the CVCE group show that our method outperforms
recent state-of-the-art methods in terms of ROUGE-1, ROUGE-2, and ROUGE-L F1 scores. To the best of our knowledge, this is the
first work on extractive historical text summarization.
Disciplines :
Computer science
Author, co-author :
LAMSIYAH, Salima ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)
MURUGARAJ, Keerthana ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)
SCHOMMER, Christoph ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)
External co-authors :
no
Language :
English
Title :
Historical-Domain Pre-trained Language Model for Historical Extractive Text Summarization
Publication date :
04 August 2023
Event name :
8th International Conference on Computer and Information Science and Technology (CIST 2023)
Event date :
3-5 August 2023
Main work title :
Historical-Domain Pre-trained Language Model for Historical Extractive Text Summarization