Paper published in a book (Scientific congresses, symposiums and conference proceedings)
Historical-Domain Pre-trained Language Model for Historical Extractive Text Summarization
LAMSIYAH, Salima; MURUGARAJ, Keerthana; SCHOMMER, Christoph
2023In Historical-Domain Pre-trained Language Model for Historical Extractive Text Summarization
Peer reviewed
 

Files


Full Text
CIST_152.pdf
Publisher postprint (414.83 kB)
Download

All documents in ORBilu are protected by a user license.

Send to



Details



Keywords :
Extractive Text Summarization; Historical Domain; Pre-trained Language Models; HistBERT; Transfer Learning
Abstract :
[en] In recent years, pre-trained language models (PLMs) have shown remarkable advancements in the extractive summarization task across diverse domains. However, there remains a lack of research specifically in the historical domain. In this paper, we propose a novel method for extractive historical single-document summarization that leverages the potential of a domain-aware historical bidirectional language model, pre-trained on a large-scale historical corpus. Subsequently, we fine-tune the language model specifically for the task of extractive historical single-document summarization. One major challenge for this task is the lack of annotated datasets for historical summarization. To address this issue, we construct a dataset by collecting archived historical documents from the Centre Virtuel de la Connaissance sur l’Europe (CVCE) group at the University of Luxembourg. Furthermore, to better learn the structural features of the input documents, we use a sentence position embedding mechanism that enables the model to learn the position information of sentences. The overall experimental results on our historical dataset collected from the CVCE group show that our method outperforms recent state-of-the-art methods in terms of ROUGE-1, ROUGE-2, and ROUGE-L F1 scores. To the best of our knowledge, this is the first work on extractive historical text summarization.
Disciplines :
Computer science
Author, co-author :
LAMSIYAH, Salima  ;  University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)
MURUGARAJ, Keerthana  ;  University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)
SCHOMMER, Christoph  ;  University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)
External co-authors :
no
Language :
English
Title :
Historical-Domain Pre-trained Language Model for Historical Extractive Text Summarization
Publication date :
04 August 2023
Event name :
8th International Conference on Computer and Information Science and Technology (CIST 2023)
Event date :
3-5 August 2023
Main work title :
Historical-Domain Pre-trained Language Model for Historical Extractive Text Summarization
Publisher :
https://avestia.com/, London, United Kingdom
Peer reviewed :
Peer reviewed
Available on ORBilu :
since 16 October 2023

Statistics


Number of views
202 (30 by Unilu)
Number of downloads
106 (15 by Unilu)

OpenAlex citations
 
1

Bibliography


Similar publications



Contact ORBilu