Paper published in a book (Scientific congresses, symposiums and conference proceedings)
Biographical Semi-Supervised Relation Extraction Dataset
PLUM, Alistair; Ranasinghe, Tharindu; Jones, Spencer et al.
2022In SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
Peer reviewed
 

Files


Full Text
3477495.3531742.pdf
Publisher postprint (1.06 MB)
Request a copy

All documents in ORBilu are protected by a user license.

Send to



Details



Keywords :
biographical information extraction; relation extraction; transformers; Annotated datasets; Biographical information extraction; Digital humanities; Neural modelling; On-line documents; Relation extraction; Research topics; Semi-supervised; Transformer; Wikipedia articles; Computer Graphics and Computer-Aided Design; Information Systems; Software
Abstract :
[en] Extracting biographical information from online documents is a popular research topic among the information extraction (IE) community. Various natural language processing (NLP) techniques such as text classification, text summarisation and relation extraction are commonly used to achieve this. Among these techniques, RE is the most common since it can be directly used to build biographical knowledge graphs. RE is usually framed as a supervised machine learning (ML) problem, where ML models are trained on annotated datasets. However, there are few annotated datasets for RE since the annotation process can be costly and time-consuming. To address this, we developedBiographical, the first semi-supervised dataset for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata. By exploiting the structure of Wikipedia articles and robust named entity recognition (NER), we match information with relatively high precision in order to compile annotated relation pairs for ten different relations that are important in the DH domain. Furthermore, we demonstrate the effectiveness of the dataset by training a state-of-the-art neural model to classify relation pairs, and evaluate it on a manually annotated gold standard set.Biographical is primarily aimed at training neural models for RE within the domain of digital humanities and history, but as we discuss at the end of this paper, it can be useful for other purposes as well.
Disciplines :
Languages & linguistics
Computer science
Author, co-author :
PLUM, Alistair  ;  University of Wolverhampton > Research Group in Computational Linguistics
Ranasinghe, Tharindu;  University of Wolverhampton, Wolverhampton, United Kingdom
Jones, Spencer;  University of Wolverhampton, Wolverhampton, United Kingdom
Orasan, Constantin;  University of Surrey, Guildford, United Kingdom
Mitkov, Ruslan;  University of Wolverhampton, Wolverhampton, United Kingdom
External co-authors :
yes
Language :
English
Title :
Biographical Semi-Supervised Relation Extraction Dataset
Publication date :
06 July 2022
Event name :
Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
Event place :
Madrid, Esp
Event date :
11-07-2022 => 15-07-2022
By request :
Yes
Audience :
International
Main work title :
SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
Publisher :
Association for Computing Machinery, Inc
ISBN/EAN :
978-1-4503-8732-3
Peer reviewed :
Peer reviewed
Funders :
ACM SIGIR
Available on ORBilu :
since 21 November 2023

Statistics


Number of views
70 (1 by Unilu)
Number of downloads
0 (0 by Unilu)

Scopus citations®
 
13
Scopus citations®
without self-citations
10
OpenCitations
 
4
OpenAlex citations
 
8
WoS citations
 
9

Bibliography


Similar publications



Contact ORBilu