[en] Software repositories is one of the sources of data in Empirical Software Engineering, primarily in the Mining Software Repositories field, aimed at extracting knowledge from the dynamics and practice of software projects. With the emergence of social coding platforms such as GitHub, researchers have now access to millions of software repositories to use as source data for their studies. With this massive amount of data, sampling techniques are needed to create more manageable datasets. The creation of these datasets is a crucial step, and researchers have to carefully select the repositories to create representative samples according to a set of variables of interest. However, current sampling methods are often based on random selection or rely on variables which may not be related to the research study (e.g., popularity or activity). In this paper, we present a methodology for creating representative samples of software repositories, where such representativeness is properly aligned with both the characteristics of the population of repositories and the requirements of the empirical study. We illustrate our approach with use cases based on Hugging Face repositories.
Centre de recherche :
Interdisciplinary Centre for Security, Reliability and Trust (SnT) > Other
Disciplines :
Sciences informatiques
Auteur, co-auteur :
Gorostidi, June ; Universitat Oberta de Catalunya (UOC), IN3, Barcelona, Spain
AIT-MIMOUNE FONOLLA, Adem ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > PI Cabot
CABOT, Jordi ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > PI Cabot ; Luxembourg Institute of Science and Technology (LIST), Esch-sur-Alzette, Luxembourg
Canovas Izquierdo, Javier Luis ; Universitat Oberta de Catalunya (UOC), IN3, Barcelona, Spain
Co-auteurs externes :
yes
Langue du document :
Anglais
Titre :
On the Creation of Representative Samples of Software Repositories
Date de publication/diffusion :
24 octobre 2024
Nom de la manifestation :
Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement
Lieu de la manifestation :
Barcelona, Esp
Date de la manifestation :
24-10-2024 => 25-10-2024
Titre de l'ouvrage principal :
Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM 2024
MCIN/AEI/10.13039/501100011033 and European Union NextGenerationEU/PRTR FNR - Luxembourg National Research Fund
Subventionnement (détails) :
This work is part of the project TED2021-130331B-I00 funded by MCIN/AEI/10.13039/501100011033 and European Union NextGenerationEU/ PRTR; and BESSER, funded by the Luxembourg National Research Fund (FNR) PEARL program, grant agreement 16544475.
Adem Ait, Javier Luis Cánovas Izquierdo, and Jordi Cabot. 2024. HFCommunity: An extraction process and relational database to analyze Hugging Face Hub data. Sci. Comput. Program. 234 (2024), 103079.
Bilal Amir and Paul Ralph. 2018. There is no random sampling in software engineering research. In Int. Conf. on Software Engineering. 344-345.
Claudia P. Ayala, Burak Turhan, Xavier Franch, and Natalia Juristo. 2022. Use and Misuse of the Term "Experiment" in Mining Software Repositories Research. IEEE Trans. Software Eng. 48, 11 (2022), 4229-4248.
Sebastian Baltes and Paul Ralph. 2022. Sampling in software engineering research: A critical review and guidelines. Empir. Softw. Eng. 27, 4 (2022), 94.
Joel Castaño, Silverio Martínez-Fernández, and Xavier Franch. 2024. Lessons Learned from Mining the Hugging Face Repository. CoRR abs/2402. 07323 (2024).
Valerio Cosentino, Javier Luis Cánovas Izquierdo, and Jordi Cabot. 2016. Findings from GitHub: methods, datasets and limitations. In Int. Conf. on Mining Software. 137-141.
Ozren Dabic, Emad Aghajani, and Gabriele Bavota. 2021. Sampling Projects in GitHub for MSR Studies. In Int. Conf. on Mining Software Repositories. 560-564.
Rafael Maiani de Mello and Guilherme Horta Travassos. 2015. Characterizing Sampling Frames in Software Engineering Surveys. In IberoAmerican Conf. on Software Engineering. 267.
Rafael Maiani de Mello and Guilherme Horta Travassos. 2016. Surveys in Software Engineering: Identifying Representative Samples. In Int. Symposium on Empirical Software Engineering and Measurement. 55: 1-55: 6.
Sharon L Lohr. 2021. Sampling: design and analysis. Chapman and Hall/CRC.
Meiyappan Nagappan, Thomas Zimmermann, and Christian Bird. 2013. Diversity in software engineering research. In Symp. on the Foundations of Software Engineering. 466-476.
H. Dieter Rombach, Victor R. Basili, and Richard W. Selby (Eds.). 1993. Experimental Software Engineering Issues: Critical Assessment and Future Directions. Int. Workshop Dagstuhl Castle, Vol. 706.
Steven K Thompson. 2012. Sampling. Vol. 755. John Wiley & Sons.
Marco Torchiano, Daniel Méndez Fernández, Guilherme Horta Travassos, and Rafael Maiani de Mello. 2017. Lessons Learnt in Conducting Survey Research. In Int. Workshop on Conducting Empirical Studies in Industry. 33-39.