WikiDoMiner: Wikipedia Domain-Specific Miner

EZZINI, Saad; ABUALHAIJA, Sallam; Sabetzadeh, Mehrdad

doi:10.1145/3540250.3558916

Download

Paper published in a book (Scientific congresses, symposiums and conference proceedings)

WikiDoMiner: Wikipedia Domain-Specific Miner

EZZINI, Saad; ABUALHAIJA, Sallam; Sabetzadeh, Mehrdad

2022 • In ACM SIGSOFT CONFERENCE ON THE FOUNDATIONS OF SOFTWARE ENGINEERING

Peer reviewed

Permalink
https://hdl.handle.net/10993/52638

DOI
10.1145/3540250.3558916

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

WikiDoMiner.pdf

Publisher postprint (807.64 kB)

Download

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Requirements Engineering; Natural-language Requirements; Natural Language Processing; Domain-specific Corpus Generation; Wikipedia

Abstract :

[en] We introduce WikiDoMiner -- a tool for automatically generating domain-specific corpora by crawling Wikipedia. WikiDoMiner helps requirements engineers create an external knowledge resource that is specific to the underlying domain of a given requirements specification (RS). Being able to build such a resource is important since domain-specific datasets are scarce. WikiDoMiner generates a corpus by first extracting a set of domain-specific keywords from a given RS, and then querying Wikipedia for these keywords. The output of WikiDoMiner is a set of Wikipedia articles relevant to the domain of the input RS. Mining Wikipedia for domain-specific knowledge can be beneficial for multiple requirements engineering tasks, e.g., ambiguity handling, requirements classification, and question answering. WikiDoMiner is publicly available on Zenodo under an open-source license (https://doi.org/10.5281/zenodo.6672682)

Research center :

Interdisciplinary Centre for Security, Reliability and Trust (SnT) > SVV - Software Verification and Validation

Disciplines :

Computer science

Author, co-author :

EZZINI, Saad ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SVV

ABUALHAIJA, Sallam ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SVV

Sabetzadeh, Mehrdad

External co-authors :

yes

Language :

English

Title :

WikiDoMiner: Wikipedia Domain-Specific Miner

Publication date :

2022

Event name :

ACM SIGSOFT CONFERENCE ON THE FOUNDATIONS OF SOFTWARE ENGINEERING

Event date :

from 14-11-2022 to 18-11-2022

Main work title :

ACM SIGSOFT CONFERENCE ON THE FOUNDATIONS OF SOFTWARE ENGINEERING

Publisher :

Association for Computing Machinery

Peer reviewed :

Peer reviewed

FnR Project :

FNR12632261 - Early Quality Assurance Of Critical Systems, 2018 (01/01/2019-31/12/2021) - Mehrdad Sabetzadeh

Funders :

FNR - Fonds National de la Recherche

Available on ORBilu :

since 07 November 2022

Statistics

Number of views

84 (15 by Unilu)

Number of downloads

54 (3 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

Bibliography

Sallam Abualhaija, Chetan Arora, Mehrdad Sabetzadeh, Lionel Briand, and Eduardo Vaz. 2019. A Machine Learning-Based Approach for Demarcating Requirements in Textual Specifications. In Proceedings of the 27th IEEE International Requirements Engineering Conference (RE'19).
Chetan Arora, Mehrdad Sabetzadeh, Lionel Briand, and Frank Zimmer. 2017. Automated Extraction and Clustering of Requirements Glossary Terms. IEEE Transactions on Software Engineering 43, 10 (2017).
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. (2018). ArXiv:arXiv:1810. 04805
Saad Ezzini, Sallam Abualhaija, Chetan Arora, and Mehrdad Sabetzadeh. 2022. Automated Handling of Anaphoric Ambiguity: A multi-solution Study. In 2022 IEEE/ACM 44th International Conference on Software Engineering.
Saad Ezzini, Sallam Abualhaija, Chetan Arora, Mehrdad Sabetzadeh, and Lionel C Briand. 2021. Using domain-specific corpora for improved handling of ambiguity in requirements. In 2021 IEEE/ACM 43rd International Conference on Software Engineering.
Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database (1st ed.). The MIT Press.
Alessio Ferrari and Andrea Esuli. 2019. An NLP approach for cross-domain ambiguity detection in requirements engineering. Automated Software Engineering 26, 3 (2019).
Alessio Ferrari, Gloria Gori, Benedetta Rosadini, Iacopo Trotta, Stefano Bacherini, Alessandro Fantechi, and Stefania Gnesi. 2018. Detecting requirements defects with NLP patterns: An industrial experience in the railway domain. Empirical Software Engineering 23, 6 (2018).
Alessio Ferrari, Giorgio Oronzo Spagnolo, and Stefania Gnesi. 2017. Pure: A dataset of public requirements documents. In 2017 IEEE 25th International Requirements Engineering Conference.
Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. 2020. spaCy: Industrial-strength Natural Language Processing in Python. https: //doi. org/10. 5281/zenodo. 1212303
Dan Jurafsky and James H. Martin. 2020. Speech and Language Processing (3rd ed.). https://web. stanford. edu/~jurafsky/slp3/(visited 2021-06-04).
Edward Loper and Steven Bird. 2002. NLTK: The Natural Language Toolkit. In Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics.
M. McGill and G. Salton. 1983. Introduction to Modern Information Retrieval. McGraw-Hill.
George Miller. 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (1995).
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825-2830.
Guido Van Rossum and Fred L. Drake. 2009. Python 3 Reference Manual. CreateSpace.
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. 353-355.
Jonas Winkler and Andreas Vogelsang. 2018. Using Tools to Assist Identification of Non-requirements in Requirements Specifications-A Controlled Experiment. In Proceedings of the 24th International Working Conference on Requirements Engineering: Foundation for Software Quality (REFSQ'18).
Liping Zhao, Waad Alhoshan, Alessio Ferrari, Keletso J Letsholo, Muideen A Ajagbe, Erol-Valeriu Chioasca, and Riza T Batista-Navarro. 2021. Natural language processing for requirements engineering: A systematic mapping study. ACM Computing Surveys (CSUR) 54, 3 (2021), 1-41.