Mining software repositories; Data analysis; Empirical study; ML; Hugging face hub
Résumé :
[en] Context. Empirical studies in software engineering mainly rely on the data available on code-hosting platforms, being GitHub the most representative. Nevertheless, in the last years, the emergence of Machine Learning (ML) has led to the development of platforms specifically designed for hosting ML-based projects, with Hugging Face Hub (HFH) as the most popular one. So far, there have been no studies evaluating the potential of HFH for such studies.
Objective. We aim at performing an exploratory study of the current state of HFH and its suitability to be used as a source platform for empirical studies.
Method. We conduct a qualitative and quantitative analysis of HFH. The former will be performed by comparing the features of HFH with those of other code-hosting platforms, such as GitHub and GitLab. The latter will be performed by analyzing the data available in HFH.
Results. We propose a feature framework to characterize HFH and report on the current usage of the platform, both in terms of number and types of projects (and surrounding community) and the features they mostly rely on.
Conclusions. The results confirm that HFH offers enough features and diverse enough data to be the source of relevant empirical studies on the development, evolution and usage of AI-related projects. The results also triggered a discussion on aspects of HFH that should be considered when performing such empirical studies.
Ministerio de Ciencia e Innovación Fonds National de la Recherche Luxembourg
Subventionnement (détails) :
This work is part of the project TED2021-130331B-I00 funded by MCIN/AEI/10.13039/501100011033 and European Union NextGenerationEU/PRTR; and BESSER, funded by the Luxembourg National Research Fund (FNR) PEARL program, grant agreement 16544475.
Ait A, Izquierdo JLC, Cabot J (2022) An empirical study on the survival rate of github projects. In: Int. Conf. on Mining Software Repositories, pp 365–375
A. Ait J.L. Cánovas Izquierdo J. Cabot HFCommunity: a Tool to Analyze the Hugging Face Hub Community Int Evolution and Reengineering Conf. on Software Analysis 728 732
Ait A, Izquierdo JLC, Cabot J (2023b) On the suitability of hugging face hub for empirical studies. arXiv:2307.14841
Akhtar M, Benjelloun O, Conforti C, Gijsbers P, Giner-Miguelez J, Jain N, Kuchnik M, Lhoest Q, Marcenac P, Maskey M, Mattson P, Oala L, Ruyssen P, Shinde R, Simperl E, Thomas G, Tykhonov S, Vanschoren J, van der Velde J, Vogler S, Wu C (2024) Croissant: A metadata format for ml-ready datasets. In: Workshop on Data Management for End-to-End Machine Learning, pp 1–6
Al-Rubaye A, Sukthankar G (2023) Improving Code Review with GitHub Issue Tracking. In: Int. Conf. on advances in social networks analysis and mining, p 210-217
G. Alamer S. Alyahya Open Source Software Hosting Platforms: A Collaborative Perspective’s Review J Softw 12 4 274 291 10.17706/jsw.12.4.274-291
Baltes S, Kiefer R, Diehl S (2017) Attribution Required: Stack Overflow Code Snippets in GitHub Projects. In: Int. conf. on software engineering Companion, pp 161–163
Baltes S, Knack J, Anastasiou D, Tymann R, Diehl S (2018) (No) Influence of Continuous Integration on the Commit Activity in GitHub Projects. In: ACM SIGSOFT Int. Workshop on Software Analytics, pp 1–7
L. Bao X. Xia D. Lo G.C. Murphy A Large Scale Study of Long-Time Contributor Prediction for GitHub Projects IEEE Trans Software Eng 47 6 1277 1298 10.1109/TSE.2019.2918536 1243.35028
Bäumer FS, Dollmann M, Geierhos M (2017) Studying Software Descriptions in SourceForge and App Stores for a Better Understanding of Real-Life Requirements. In: ACM SIGSOFT Int. Workshop on App Market Analytics, pp 19–25
Biazzini M, Baudry B (2014) “May the Fork Be with You”: Novel Metrics to Analyze Collaboration on GitHub. In: Int. Workshop on Emerging Trends in Software Metrics, pp 37–43
H. Borges M. Tulio Valente What’s in a GitHub Star? Understanding Repository Starring Practices in a Social Coding Platform J Syst Softw 146 112 129 10.1016/j.jss.2018.09.016
X. Cai J. Zhu B. Shen Y. Chen GRETA: Graph-Based Tag Assignment for GitHub Repositories Annual computer software and applications conference 1 63 72 1419.62444
Casalnuovo C, Suchak Y, Ray B, Rubio-González C (2017) GitcProc: a tool for processing and classifying GitHub commits. In: ACM SIGSOFT Int. symposium on software testing and analysis, pp 396–399
CastañoJ, Martínez-FernándezS, FranchX, Bogner J (2023a)Analyzing the evolution and maintenance of ML models on hugging face. arXiv:2311.13380
Castaño J, Martínez-Fernández S, Franch X, Bogner J (2023b) Exploring the carbon footprint of hugging face’s ML models: A repository mining study. In: Int. symposium on empirical software engineering and measurement, pp 1–12
Chen D, Stolee KT, Menzies T (2019) Replication Can Improve Prior Results: A GitHub Study of Pull Request Acceptance. In: Int. Conf. on Program Comprehension, pp 179–190
Cosentino V, Cánovas Izquierdo JL, Cabot J (2016) Findings from GitHub: Methods, Datasets and Limitations. In: Int. conf. on mining software repositories, pp 137–141
V. Cosentino J.L. Cánovas Izquierdo J. Cabot A Systematic Mapping Study of Software Development with GitHub IEEE Access 5 7173 7192 10.1109/ACCESS.2017.2682323 1369.62117
R. Croft Y. Xie M. Zahedi M.A. Babar C. Treude An empirical study of developers’ discussions about security challenges of different programming languages Empir Softw Eng 27 1 27 10.1007/s10664-021-10054-w
Dabbish LA, Stuart HC, Tsay J, Herbsleb JD (2012) Social coding in github: transparency and collaboration in an open software repository. In: Conf. on computer supported cooperative work, pp 1277–1286
O. Dabic E. Aghajani G. Bavota Sampling projects in github for MSR studies Int IEEE Conf. on mining software repositories 560 564
Decan A, Mens T, Claes M, Grosjean P (2016) When GitHub Meets CRAN: An Analysis of Inter-Repository Package Dependency Problems. In: Int. Conf. on Software Analysis, Evolution, and Reengineering, pp 493–504
Demeyer S, Murgia A, Wyckmans K, Lamkanfi A (2013) Happy Birthday! a Trend Analysis on Past Msr Papers. In: Int. working conf. on mining software repositories, pp 353–362
Destefanis G, Ortu M, Bowes D, Marchesi M, Tonelli R (2018) On Measuring Affects of Github Issues’ Commenters. In: Int. workshop on emotion awareness in software engineering, pp 14–19
Dyer R, Nguyen HA, Rajan H, Nguyen TN (2015) Boa: Ultra-Large-Scale Software Repository and Source-Code Mining. ACM Trans Softw Eng Methodol 25(1)
Eibl G, Thurnay L (2023) The Promises and Perils of Open Source Software Release and Usage by Government - Evidence from GitHub and Literature. In: Int. conf. on digital government research, pp 180–190
English R, Schweik CM (2007) Identifying Success and Tragedy of FLOSS Commons: A Preliminary Classification of Sourceforge.net Projects. In: Int. Workshop on emerging trends in floss research and development, pp 11–11
S. Eraslan K. Kopec-Harding C. Jay S.M. Embury R. Haines J.C. Cortés Ríos P. Crowther Integrating GitLab metrics into coursework consultation sessions in a software engineering course J Syst Softw 167 10.1016/j.jss.2020.110613 110613
J. Fairbanks A. Tharigonda N.U. Eisty Analyzing the Effects of CI/CD on Open Source Repositories in GitHub and GitLab Int Management and Applications Conf. on Software Engineering Research 176 181
S.W. Flint J. Chauhan R. Dyer Pitfalls and Guidelines for Using Time-based Git Data Empir Softw Eng 27 7 194 10.1007/s10664-022-10200-y 1384.94041
Foushee B, Krein JL, Wu J, Buck R, Knutson CD, Pratt LJ, MacLean AC (2013) Reflexivity, Raymond, and the Success of Open Source Software Development: A SourceForge Empirical Study. In: Int. conf. on evaluation and assessment in software engineering, pp 246–251
Gajanayake R, Hiras M, Gunathunga P, Janith Supun EG, Karunasenna A, Bandara P (2020) Candidate Selection for the Interview using GitHub Profile and User Analysis for the Position of Software Engineer. In: Int. conf. on advancements in computing, pp 168–173
J. Giner-Miguelez A. Gómez J. Cabot Describeml: A dataset description tool for machine learning Sci Comput Program 231 10.1016/j.scico.2023.103030 103030
M. Golzadeh A. Decan D. Legay T. Mens A ground-truth dataset and classification model for detecting bots in GitHub issue and PR comments J Syst Softw 175 10.1016/j.jss.2021.110911 110911
GonzalezD, ZimmermannT, NagappanN (2020) The State of the ML-universe: 10 Years of Artificial Intelligence & Machine Learning Software Development on GitHub. In: Int. conf. on mining software repositories, pp 431–442
Gousios G, Spinellis D (2012) GHTorrent: Github’s data from a firehose. In: Working conf. of mining software repositories, pp 12–21
Gousios G, Pinzger M, van Deursen A (2014) An exploratory study of the pull-based software development model. In: Int. conf. on software engineering, pp 345–355
K.L. Gwebu J. Wang Adoption of Open Source Software: The role of social identification Decis Support Syst 51 220 229 10.1016/j.dss.2010.12.010 1215.42030
Hauff C, Gousios G (2015) Matching GitHub developer profiles to job advertisements. In: Working conf. on mining software repositories, p 362-366
R. He H. He Y. Zhang M. Zhou Automating Dependency Updates in Practice: An Exploratory Study on GitHub Dependabot IEEE Trans Softw Eng 49 8 4004 4022 10.1109/TSE.2023.3278129 1187.54023
Hove SE, Anda B (2005) Experiences from Conducting Semi-structured Interviews in Empirical Software Engineering Research. In: Int. Symposium on Software Metrics, p 23
Howison J, Crowston K (2004) The Perils and Pitfalls of Mining Sourceforge. In: Int. Workshop on Mining Software Repositories, pp 7–11
Imtiaz N, Middleton J, Chakraborty J, Robson N, Bai GR, Murphy-Hill ER (2019) Investigating the effects of gender bias on GitHub. In: Int. Conf. on Software Engineering, pp 700–711
J.L.C. Izquierdo J. Cabot On the analysis of non-coding roles in open source development Empir Softw Eng 27 1 18 10.1007/s10664-021-10061-x 1360.30033
Jiang W, Cheung C, Thiruvathukal GK, Davis JC (2023a) Exploring Naming Conventions (and Defects) of Pre-trained Deep Learning Models in Hugging Face and Other Model Hubs. arXiv:2310.01642
Jiang W, Synovic N, Hyatt M, Schorlemmer TR, Sethi R, Lu YH, Thiruvathukal GK, Davis JC (2023b) An Empirical Study of Pre-Trained Model Reuse in the Hugging Face Deep Learning Model Registry. In: Int. conf. on software engineering, pp 2463–2475
A. Joshi S. Kale S. Chandel D.K. Pal Likert scale: Explored and explained British J Appl Sci Technol 7 4 396 403 10.9734/BJAST/2015/14975
Joshi SD, Chimalakonda S (2019) RapidRelease: A Dataset of Projects and Issues on Github with Rapid Releases. In: Int. conf. on mining software repositories, p 587-591
Kaide K, Tamada H (2022) Argo: Projects’ Time-Series Data Fetching and Visualizing Tool for GitHub. In: Int. summer virtual conf. on software engineering, artificial intelligence, networking and parallel/distributed computing, pp 141–147
Kalliamvakou E, Gousios G, Blincoe K, Singer L, Germán DM, Damian DE (2014) The Promises and Perils of Mining GitHub. In: Int. working conf. on mining software repositories, pp 92–101
E. Kalliamvakou G. Gousios K. Blincoe L. Singer D.M. Germán D.E. Damian An In-depth Study of the Promises and Perils of Mining GitHub Empir Softw Eng 21 5 2035 2071 10.1007/s10664-015-9393-5
Kathikar A, Nair A, Lazarine B, Sachdeva A, Samtani S (2023) Assessing the vulnerabilities of the open-source artificial intelligence (AI) landscape: A large-scale analysis of the hugging face platform. In: Int. conf. on intelligence and security informatics, pp 1–6
Kleinbaum DG, Klein M (2005) Survival Analysis: A Self-Learning Text. Springer Science and Business Media, LLC
Kritikos A, Chatziasimidis F (2011) SFparser: A Tool for Selectively Parsing SourceForge. In: Panhellenic conf. on informatics, pp 161–165
Lazarine B, Zhang Z, Sachdeva A, Samtani S, Zhu H (2022) Exploring the Propagation of Vulnerabilities from GitHub Repositories Hosted by Major Technology Organizations. In: Workshop on cyber security experimentation and test, pp 145–150
Z. Liao M. Yi Y. Wang S. Liu H. Liu Y. Zhang Y. Zhou Healthy or not: A way to predict ecosystem health in github Symmetry 11 2 144 10.3390/sym11020144 1268.57011
Malan DJ (2022) Standardizing Students’ Programming Environments with Docker Containers: Using Visual Studio Code in the Cloud with GitHub Codespaces. In: ACM Conf. on innovation and technology in computer science education, pp 599–600
Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L, Hutchinson B, Spitzer E, Raji ID, Gebru T (2019) Model cards for model reporting. In: Conf. on fairness, accountability, and transparency, pp 220–229
J.E. Montandon M.T. Valente L.L. Silva Mining the Technical Roles of GitHub Users Inf Softw Technol 131 10.1016/j.infsof.2020.106485 1525.81008 106485
W. Mu Y. Bian J.L. Zhao The role of online leadership in open collaborative innovation Ind Manag Data Syst 119 9 1969 1987 10.1108/IMDS-03-2019-0136 1437.70040
Y. Özçevik O. Altay MetricHunter: A software metric dataset generator utilizing SourceMonitor upon public GitHub repositories SoftwareX 23 10.1016/j.softx.2023.101499 101499
Pina D, Goldman A, Seaman C (2022) Sonarlizer xplorer: a tool to mine github projects and identify technical debt items using SonarQube. In: Int. Conf. on Technical Debt, p 71-75
Qiu HS, Nolte A, Brown A, Serebrenik A, Vasilescu B (2019) Going farther together: the impact of social capital on sustained participation in open source. In: Int. conf. on software engineering, pp 688–699
Reimers N, Gurevych I (2019) Sentence-bert: Sentence embeddings using siamese bert-networks. In: Conf. on empirical methods in natural language processing, pp 3980–3990
Ren L, Zhou S, Kästner C (2018) Forks Insight: Providing an Overview of GitHub Forks. In: Int. conf. on software engineering: companion proceeedings, pp 179–180
Rigney D (2010) The Matthew effect: How advantage begets further advantage. Columbia University Press
Robles G (2010) Replicating MSR: a Study of the Potential Replicability of Papers Published in the Mining Software Repositories Proceedings. In: Int. working conf. on mining software repositories, pp 171–180
Robles G, Ho-Quang T, Hebig R, Chaudron MRV, Fernandez MA (2017) An Extensive Dataset of UML Models in GitHub. In: Int. conf. on mining software repositories, pp 519–522
Romano S, Caulo M, Buompastore M, Guerra L, Mounsif A, Telesca M, Baldassarre MT, Scanniello G (2021) G-Repo: a Tool to Support MSR Studies on GitHub. In: Int. Conf. on software analysis, evolution and reengineering, pp 551–555
Safari H, Sabri N, Shahsavan F, Bahrak B (2020) An Analysis of GitLab’s Users and Projects Networks. In: Int. Symposium onTelecommunications, pp 194–200
Sanh V, Wolf T, Ruder S (2019) A Hierarchical Multi-Task Approach for Learning Embeddings from Semantic Tasks. In: Conf. on artificial intelligence, pp 6949–6956
Souza I, Campello L, Rodrigues E, Guedes G, Bernardino M (2021) An Analysis of Automated Code Inspection Tools for Php Available on GitHub Marketplace. In: Symp. on systematic and automated software, pp 10–17
Spinellis D, Kotti Z, Mockus A (2020) A Dataset for GitHub Repository Deduplication. In: Int. conf. on mining software repositories, pp 523–527
Squire M (2017) The Lives and Deaths of Open Source Code Forges. In: Int. symposium on open collaboration, opensym, pp 15:1–15:8
Tsay J, Dabbish L, Herbsleb J (2014) Let’s Talk about It: Evaluating Contributions through Discussion in GitHub. In: ACM SIGSOFT Int. symposium on foundations of software engineering, pp 144–154
Valenzuela-Toledo P, Bergel A, Kehrer T, Nierstrasz O (2023) EGAD: A moldable tool for GitHub Action analysis. In: Int. conf. on mining software repositories, pp 260–264
J. Wachs M. Nitecki W. Schueller A. Polleres The Geography of Open Source Software: Evidence from GitHub Technological Forecasting Social Change 176 10.1016/j.techfore.2022.121478 121478
J. Wang X. Zhang L. Chen X. Xie Personalizing label prediction for GitHub issues Inf Soft Technol 145 10.1016/j.infsof.2022.106845 1541.93183 106845
Wessel M, Serebrenik A, Wiese I, Steinmacher I, Gerosa MA (2020) What to Expect from Code Review Bots on GitHub? A Survey with OSS Maintainers. In: Brazilian symposium on software engineering, pp 457–462
C. Wohlin P. Runeson M. Höst M.C. Ohlsson B. Regnell Experimentation in Software Engineering Springer 10.1007/978-3-642-29044-2 1069.68547
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Davison J, Shleifer S, von Platen P, Ma C, Jernite Y, Plu J, Xu C, Scao TL, Gugger S, Drame M, Lhoest Q, Rush AM (2020) Transformers: State-of-the-Art Natural Language Processing. In: Conf. on empirical methods in natural language processing, pp 38–45
Wolter T, Barcomb A, Riehle D, Harutyunyan N (2023) Open Source License Inconsistencies on GitHub. ACM Trans Softw Eng Methodol 32(5)
Wu J, He H, Xiao W, Gao K, Zhou M (2022) Demystifying Software Release Note Issues on GitHub. In: Int. conf. on program comprehension, pp 602–613
YangX, LiangW, ZouJ (2024) Navigating dataset documentations in AI: A large-scale analysis of dataset cards on hugging face. arXiv:2401.13822
You K, Liu Y, Zhang Z, Wang J, Jordan MI, Long M (2022) Ranking and tuning pre-trained models: A new paradigm for exploiting model hubs. J Mach Learn Res 23:209:1–209:47
Yu Y, Yin G, Wang H, Wang T (2014) Exploring the patterns of social behavior in GitHub. In: Int. workshop on crowd-based software development methods and technologies, pp 31–36
Yu Y, Wang H, Filkov V, Devanbu P, Vasilescu B (2015) Wait for It: Determinants of Pull Request Evaluation Latency on GitHub. In: Working conf. on mining software repositories, pp 367–371
Yu Y, Li Z, Yin G, Wang T, Wang H (2018) A Dataset of Duplicate Pull-Requests in Github. In: Int. Conf. on Mining Software Repositories, p 22-25
Zou W, Zhang W, Xia X, Holmes R, Chen Z (2019) Branch Use in Practice: A Large-Scale Empirical Study of 2,923 Projects on GitHub. In: Int. conf. on software quality, reliability and security, pp 306–317