[en] Applying deep learning (DL) to science is a new trend in recent years, which leads DL engineering to become an important problem. Although training data preparation, model architecture design, and model training are the normal processes to build DL models, all of them are complex and costly. Therefore, reusing the open-sourced pre-trained model is a practical way to bypass this hurdle for developers. Given a specific task, developers can collect massive pre-trained deep neural networks from public sources for reusing. However, testing the performance (e.g., accuracy and robustness) of multiple deep neural networks (DNNs) and recommending which model should be used is challenging regarding the scarcity of labeled data and the demand for domain expertise. In this article, we propose a labeling-free (LaF) model selection approach to overcome the limitations of labeling efforts for automated model reusing. The main idea is to statistically learn a Bayesian model to infer the models’ specialty only based on predicted labels. We evaluate LaF using nine benchmark datasets, including image, text, and source code, and 165 DNNs, considering both the accuracy and robustness of models. The experimental results demonstrate that LaF outperforms the baseline methods by up to 0.74 and 0.53 on Spearman’s correlation and Kendall’s τ, respectively.
Disciplines :
Sciences informatiques
Auteur, co-auteur :
HU, Qiang ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal
GUO, Yuejun ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > SerVal > Team Yves LE TRAON ; Luxembourg Institute of Science and Technology, Luxembourg
AOJ: Online Programming Challenge. 2018. AIZU online judge. Retrieved from https://judge.u-aizu.ac.jp/onlinejudge/. Accessed 10 January 2021.
SDS. 2021. DNN models for Fashion-MNIST. Retrieved from github.com/Testing-Multiple-DL-Models/SDS/tree/ main/models. Accessed 2 November 2021.
LaF project site. 2021. Project website of ranking multiple DNNs. Retrieved from https://sites.google.com/view/ ranking-of-multiple-DNNs
GitHub. 2022. Retrieved from https://github.com/. Accessed 25 January 2022.
MLOps. 2022. Machine Learning Model Operationalization Management. Retrieved from https://ml-ops.org/
SciPy. 2022. Retrieved from https://scipy.org/. Accessed 27 January 2022.
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard et al. 2016. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16), Vol. 16, 265–283.
Rob Alexander, Rob Ashmore, Andrew Banks, Ben Bradshaw, John Bragg, John Clegg, Christopher Harper, Catherine Menon, Roger Rivett, Philippa Ryan, Nick Tudor, Stuart Tushingham, John Birch, Lavinia Burski, Timothy Coley, Neil Lewis, Ken Neal, Ashley Price, Stuart Reid, and Rod Steel. 2020. Safety Assurance Objectives for Autonomous Systems.
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learning distributed representations of code. Proc. ACM Program. Lang. 3 (2019), 1–29. https://doi.org/10.1145/3290353
Sara Beery, Elijah Cole, and Arvi Gjoka. 2020. The iWildCam 2020 competition dataset. Retrieved from https://arXiv: 2004.10340
David Berend, Xiaofei Xie, Lei Ma, Lingjun Zhou, Yang Liu, Chi Xu, and Jianjun Zhao. 2020. Cats are Not Fish: Deep Learning Testing Calls For Out-of-distribution Awareness. Association for Computing Machinery, New York, NY, USA, 1041–1052. https://doi.org/10.1145/3324884.3416609
Junjie Chen, Zhuo Wu, Zan Wang, Hanmo You, Lingming Zhang, and Ming Yan. 2020. Practical accuracy estimation for efficient deep neural network testing. ACM Trans. Softw. Eng. Methodol. 29, 4, Article 30 (Oct. 2020), 35 pages. https://doi.org/10.1145/3394112
W. Wayne Daniel. 1990. Applied Nonparametric Statistics. PWS-KENT Pub. Retrieved from https://books.google.lu/books?id=0hPvAAAAMAAJ
A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the em algorithm. J. Roy. Stat. Soc. Ser. B (Methodol.) 39, 1 (1977), 1–38. Retrieved from http://www.jstor.org/stable/2984875
Swaroopa Dola, Matthew B. Dwyer, and Mary Lou Soffa. 2021. Distribution-aware testing of neural networks using generative models. In Proceedings of the IEEE/ACM 43rd International Conference on Software Engineering (ICSE’21). IEEE, 226–237. https://doi.org/10.1109/ICSE43902.2021.00032
Robert L. Ebel. 1954. Procedures for the analysis of classroom tests. Edu. Psychol. Measure. 14, 2 (1954), 352–364. https://doi.org/10.1177/001316445401400215
Yang Feng, Qingkai Shi, Xinyu Gao, Jun Wan, Chunrong Fang, and Zhenyu Chen. 2020. Deepgini: Prioritizing massive tests to enhance the robustness of deep neural networks. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 177–188.
Xinyu Gao, Yang Feng, Yining Yin, Zixi Liu, Zhenyu Chen, and Baowen Xu. 2022. Adaptive test selection for deep neural networks. In Proceedings of the 44th International Conference on Software Engineering. 73–85.
Jianmin Guo, Yu Jiang, Yue Zhao, Quan Chen, and Jiaguang Sun. 2018. Dlfuzz: Differential fuzzing testing of deep learning systems. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 739–743.
Yuejun Guo, Qiang Hu, Maxime Cordy, Michail Papadakis, and Yves Le Traon. 2021. MUTEN: Boosting gradient-based adversarial attacks via mutant-based ensembles. Retrieved from https://arXiv:2109.12838
Dan Hendrycks and Thomas Dietterich. 2019. Benchmarking neural network robustness to common corruptions and perturbations. Proceedings of the International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=HJz6tiCqYm
Q. Hu, Y. Guo, X. Xie, M. Cordy, M. Papadakis, L. Ma, and Y. Traon. 2023. Aries: Efficient testing of deep neural networks via labeling-free accuracy estimation. In Proceedings of the IEEE/ACM 45th International Conference on Software Engineering (ICSE’23). IEEE Computer Society, Los Alamitos, CA, 1776–1787. https://doi.org/10.1109/ICSE48619.2023. 00152
Rui Hu, Jitao Sang, Jinqiang Wang, and Chaoquan Jiang. 2021. Understanding and testing generalization of deep networks on out-of-distribution data. Retrieved from https://arXiv:2111.09190
Xinyu Huang, Xinjing Cheng, Qichuan Geng, Binbin Cao, Dingfu Zhou, Peng Wang, Yuanqing Lin, and Ruigang Yang. 2018. The apolloscape dataset for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 954–960. https://doi.org/10.1109/CVPRW.2018.00141
Irena Jovanović. 2006. Software testing methods and techniques. IPSI BgD Trans. Internet Res. 30 (2006).
S. Kavitha and D. Jeevitha. 2014. Software testing methods and techniques. In Proceedings of the International Conference on Information and Image Processing (ICIIP’14).
Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton A. Earnshaw, Imran S. Haque, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang. 2021. WILDS: A benchmark of in-the-wild distribution shifts. In Proceedings of the International Conference on Machine Learning (ICML’21).
Alex Krizhevsky. 2009. Learning Multiple Layers of Features From Tiny Images. Technical Report. University of Toronto, Toronto.
Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (Nov. 1998), 2278–2324.
Yu Li, Min Li, Qiuxia Lai, Yannan Liu, and Qiang Xu. 2021. TestRank: Bringing order into unlabeled test instances for deep learning tasks. Adv. Neural Inf. Process. Syst. 34 (2021), 20874–20886.
Zenan Li, Xiaoxing Ma, Chang Xu, Chun Cao, Jingwei Xu, and Jian Lü. 2019. Boosting operational DNN testing efficiency through conditioning. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE’19). Association for Computing Machinery, New York, NY, 499–509. https://doi.org/10.1145/3338906.3338930
Zenan Li, Xiaoxing Ma, Chang Xu, Chun Cao, Jingwei Xu, and Jian Lü. 2019. Boosting operational DNN testing efficiency through conditioning. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 499–509.
Zhong Li, Minxue Pan, Tian Zhang, and Xuandong Li. 2021. Testing DNN-based autonomous driving systems under critical environmental conditions. In Proceedings of the International Conference on Machine Learning. PMLR, 6471–6482.
Bin Liu. 2022. Consistent relative confidence and label-free model selection for convolutional neural networks. In Proceedings of the 3rd International Conference on Pattern Recognition and Machine Learning (PRML’22). IEEE, 375–379.
Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu et al. 2018. Deepgauge: Multi-granularity testing criteria for deep learning systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 120–131.
Wei Ma, Mike Papadakis, Anestis Tsakmalis, Maxime Cordy, and Yves Le Traon. 2021. Test selection for deep learning systems. ACM Trans. Softw. Eng. Methodol. 30, 2 (2021), 1–22.
Linghan Meng, Yanhui Li, Lin Chen, Zhi Wang, Di Wu, Yuming Zhou, and Baowen Xu. 2021. Measuring discrimination to boost comparative testing for multiple deep learning models. In Proceedings of the IEEE/ACM 43rd International Conference on Software Engineering (ICSE’21). IEEE Computer Society, Los Alamitos, CA, 385–396. https://doi.org/10.1109/ICSE43902.2021.00045
Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. 2021. Deep learning–based text classification: A comprehensive review. Comput. Surveys 54, 3 (Apr. 2021). https://doi.org/10.1145/ 3439726
Norman Mu and Justin Gilmer. 2019. MNIST-C: A robustness benchmark for computer vision. Retrieved from https: //arXiv:1906.02337
Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19).
Augustus Odena, Catherine Olsson, David Andersen, and Ian Goodfellow. 2019. Tensorfuzz: Debugging neural networks with coverage-guided fuzzing. In Proceedings of the International Conference on Machine Learning. PMLR, 4901–4911.
Niall O’Mahony, Sean Campbell, Anderson Carvalho, Suman Harapanahalli, Gustavo Velasco Hernandez, Lenka Krpalkova, Daniel Riordan, and Joseph Walsh. 2020. Deep learning vs. traditional computer vision. In Advances in Computer Vision, Kohei Arai and Supriya Kapoor (Eds.). Springer International Publishing, Cham, 128–144. https://doi.org/10.1007/978-3-030-17795-9_10
Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. Deepxplore: Automated whitebox testing of deep learning systems. In Proceedings of the 26th Symposium on Operating Systems Principles. 1–18.
Ruchir Puri, David S. Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Shyam Ramji, Ulrich Finkler, Susan Malaika, and Frederick Reiss. 2021. CodeNet: A large-scale AI for code dataset for learning a diversity of coding tasks. Retrieved from https://arxiv:2105.12655
Omer Sagi and Lior Rokach. 2018. Ensemble learning: A survey. WIREs Data Min. Knowl. Discov. 8, 4 (2018). https://doi.org/10.1002/widm.1249
Abhijit Sawant, Pranit Bari, and Pramila Chawan. 2012. Software testing techniques and strategies. Int. J. Eng. Res. Appl. 2 (06 2012), 980–986.
P. B. Selvapriya. 2013. Different software testing strategies and techniques. Int. J. Sci. Modern Eng. 2, 1 (2013).
Yan Sun, Celia Chen, Qing Wang, and Barry Boehm. 2017. Improving missing issue-commit link recovery using positive and unlabeled data. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE’17). 147–152. https://doi.org/10.1109/ASE.2017.8115627
Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2018. Deeptest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th International Conference on Software Engineering. 303–314.
Zan Wang, Hanmo You, Junjie Chen, Yingyi Zhang, Xuyuan Dong, and Wenbin Zhang. 2021. Prioritizing test inputs for deep neural networks via mutation analysis. In Proceedings of the IEEE/ACM 43rd International Conference on Software Engineering (ICSE’21). IEEE, 397–409.
Jacob Whitehill, Paul Ruvolo, Tingfan Wu, Jacob Bergsma, and Javier Movellan. 2009. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Proceedings of the 22nd International Conference on Neural Information Processing Systems (NIPS’09). Curran Associates, Red Hook, NY, 2035–2043. Retrieved from https://proceedings.neurips.cc/paper/2009/file/f899139df5e1059396431415e770c6dd-Paper.pdf.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, 38–45. Retrieved from https://www.aclweb.org/anthology/2020.emnlp-demos.6.
Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. Retrieved from https://arXiv:cs.LG/1708.07747
Xiaofei Xie, Lei Ma, Felix Juefei-Xu, Minhui Xue, Hongxu Chen, Yang Liu, Jianjun Zhao, Bo Li, Jianxiong Yin, and Simon See. 2019. Deephunter: A coverage-guided fuzz testing framework for deep neural networks. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 146–157.
Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu. 2019. Machine learning testing: Survey, landscapes and horizons. Retrieved from http://arxiv.org/abs/1906.10742
Tianming Zhao, Chunyang Chen, Yuanning Liu, and Xiaodong Zhu. 2021. GUIGAN: Learning to generate GUI designs using generative adversarial networks. In Proceedings of the IEEE/ACM 43rd International Conference on Software Engineering (ICSE’21). IEEE, 748–760. https://doi.org/10.1109/ICSE43902.2021.00074
Yan Zheng, Yi Liu, Xiaofei Xie, Yepang Liu, Lei Ma, Jianye Hao, and Yang Liu. 2021. Automatic web testing using curiosity-driven reinforcement learning. In Proceedings of the IEEE/ACM 43rd International Conference on Software Engineering (ICSE’21). 423–435. https://doi.org/10.1109/ICSE43902.2021.00048
Yan Zheng, Xiaofei Xie, Ting Su, Lei Ma, Jianye Hao, Zhaopeng Meng, Yang Liu, Ruimin Shen, Yingfeng Chen, and Changjie Fan. 2019. Wuji: Automatic online combat game testing using evolutionary deep reinforcement learning. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE’19). 772–784. https://doi.org/10.1109/ASE.2019.00077
Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Adv. Neural Inf. Process. Syst. 32 (2019).