Historical-Domain Pre-trained Language Model for Historical Extractive Text Summarization

LAMSIYAH, Salima; MURUGARAJ, Keerthana; SCHOMMER, Christoph

doi:10.11159/cist23.152

Download

Paper published in a book (Scientific congresses, symposiums and conference proceedings)

Historical-Domain Pre-trained Language Model for Historical Extractive Text Summarization

LAMSIYAH, Salima; MURUGARAJ, Keerthana; SCHOMMER, Christoph

2023 • In Historical-Domain Pre-trained Language Model for Historical Extractive Text Summarization

Peer reviewed

Permalink
https://hdl.handle.net/10993/56061

DOI
10.11159/cist23.152

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

CIST_152.pdf

Publisher postprint (414.83 kB)

Download

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Extractive Text Summarization; Historical Domain; Pre-trained Language Models; HistBERT; Transfer Learning

Abstract :

[en] In recent years, pre-trained language models (PLMs) have shown remarkable advancements in the extractive summarization task across diverse domains. However, there remains a lack of research specifically in the historical domain. In this paper, we propose a novel method for extractive historical single-document summarization that leverages the potential of a domain-aware historical bidirectional language model, pre-trained on a large-scale historical corpus. Subsequently, we fine-tune the language model specifically for the task of extractive historical single-document summarization. One major challenge for this task is the lack of annotated datasets for historical summarization. To address this issue, we construct a dataset by collecting archived historical documents from the Centre Virtuel de la Connaissance sur l’Europe (CVCE) group at the University of Luxembourg. Furthermore, to better learn the structural features of the input documents, we use a sentence position embedding mechanism that enables the model to learn the position information of sentences. The overall experimental results on our historical dataset collected from the CVCE group show that our method outperforms recent state-of-the-art methods in terms of ROUGE-1, ROUGE-2, and ROUGE-L F1 scores. To the best of our knowledge, this is the first work on extractive historical text summarization.

Disciplines :

Computer science

Author, co-author :

LAMSIYAH, Salima ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)

MURUGARAJ, Keerthana ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)

SCHOMMER, Christoph ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)

External co-authors :

Language :

English

Title :

Historical-Domain Pre-trained Language Model for Historical Extractive Text Summarization

Publication date :

04 August 2023

Event name :

8th International Conference on Computer and Information Science and Technology (CIST 2023)

Event date :

3-5 August 2023

Main work title :

Historical-Domain Pre-trained Language Model for Historical Extractive Text Summarization

Publisher :

https://avestia.com/, London, United Kingdom

Peer reviewed :

Peer reviewed

Additional URL :

https://avestia.com/EECSS2023_Proceedings/files/paper/CIST/CIST_152.pdf

Available on ORBilu :

since 16 October 2023

Statistics

Number of views

230 (30 by Unilu)

Number of downloads

114 (15 by Unilu)

More statistics

OpenAlex citations