deep learning testing; test selection; data distribution
Résumé :
[en] Similar to traditional software that is constantly under evolution, deep neural networks (DNNs) need to evolve upon the rapid growth of test data for continuous enhancement, e.g., adapting to distribution shift in a new environment for deployment. However, it is labor-intensive to manually label all the collected test data. Test selection solves this problem by strategically choosing a small set to label. Via retraining with the selected set, DNNs will achieve competitive accuracy. Unfortunately, existing selection metrics involve three main limitations: 1) using different retraining processes; 2) ignoring data distribution shifts; 3) being insufficiently evaluated. To fill this gap, we first conduct a systemically empirical study to reveal the impact of the retraining process and data distribution on model enhancement. Then based on our findings, we propose a novel distribution-aware test (DAT) selection metric. Experimental results reveal that retraining using both the training and selected data outperforms using only the selected data. None of the selection metrics perform the best under various data distributions. By contrast, DAT effectively alleviates the impact of distribution shifts and outperforms the compared metrics by up to 5 times and 30.09% accuracy improvement for model enhancement on simulated and in-the-wild distribution shift scenarios, respectively.
Disciplines :
Sciences informatiques
Auteur, co-auteur :
HU, Qiang ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal
GUO, Yuejun ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal
CORDY, Maxime ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal
Xie, Xiaofei; Singapore Management University
Ma, Lei; University of Alberta
PAPADAKIS, Mike ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > Computer Science and Communications Research Unit (CSC)
LE TRAON, Yves ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal
Co-auteurs externes :
yes
Langue du document :
Anglais
Titre :
An Empirical Study on Data Distribution-Aware Test Selection for Deep Learning Enhancement
Date de publication/diffusion :
2022
Titre du périodique :
ACM Transactions on Software Engineering and Methodology
David Berend, Xiaofei Xie, Lei Ma, Lingjun Zhou, Yang Liu, Chi Xu, and Jianjun Zhao. 2020. Cats are not fish: Deep learning testing calls for out-of-distribution awareness. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE'20). ACM, New York, NY, 1041-1052. https://doi.org/10.1145/3324884. 3416609
Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, et al. 2016. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316 (2016).
Junjie Chen, Zhuo Wu, Zan Wang, Hanmo You, Lingming Zhang, and Ming Yan. 2020. Practical accuracy estimation for efficient deep neural network testing. ACM Transactions on Software Engineering and Methodology 29, 4 (October 2020), Article 30, 35 pages. https://doi.org/10.1145/3394112
Zhenpeng Chen, Yanbin Cao, Yuanqiang Liu, Haoyu Wang, Tao Xie, and Xuanzhe Liu. 2020. A comprehensive study on challenges in deploying deep learning based software. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 750-762.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 248-255.
Swaroopa Dola, Matthew B. Dwyer, and Mary Lou Soffa. 2021. Distribution-aware testing of neural networks using generative models. In Proceedings of the IEEE/ACM 43nd International Conference on Software Engineering (ICSE'21). IEEE, Los Alamitos, CA.
Xiaoning Du, Xiaofei Xie, Yi Li, Lei Ma, Yang Liu, and Jianjun Zhao. 2019. DeepStellar: Model-based quantitative analysis of stateful deep learning systems. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 477-487.
Yang Feng, Qingkai Shi, Xinyu Gao, JunWan, Chunrong Fang, and Zhenyu Chen. 2020. DeepGini: Prioritizing massive tests to enhance the robustness of deep neural betworks. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA'20). ACM, New York, NY, 177-188. https://doi.org/10.1145/3395363. 3397357
Bent Fuglede and Flemming Topsoe. 2004. Jensen-Shannon divergence and Hilbert space embedding. In Proceedings of the International Symposium on Information Theory (ISIT'04). IEEE, Los Alamitos, CA, 31.
Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and harnessing adversarial examples. In Proceedings of the International Conference on Learning Representations. http://arxiv.org/abs/1412.6572.
Qianyu Guo, Sen Chen, Xiaofei Xie, Lei Ma, Qiang Hu, Hongtao Liu, Yang Liu, Jianjun Zhao, and Xiaohong Li. 2019. An empirical study towards characterizing deep learning development and deployment across different frameworks and platforms. In Proceedings of the 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE'19). IEEE, Los Alamitos, CA, 810-822.
Yuejun Guo, Qiang Hu, Maxime Cordy, Mike Papadakis, and Yves Le Traon. 2021. Robust active learning: Sampleefficient training of robust deep learning models. arXiv preprint arXiv:2112.02542 (2021).
Dan Hendrycks and Kevin Gimpel. 2016. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136 (2016).
Dan Hendrycks, MantasMazeika, and Thomas Dietterich. 2019. Deep anomaly detection with outlier exposure. In Proceedings of the International Conference on Learning Representations. https://openreview.net/forum?id=HyxCxhRcY7.
Qiang Hu, Yuejun Guo, Maxime Cordy, Xie Xiaofei, Wei Ma, Mike Papadakis, and Yves Le Traon. 2021. Towards exploring the limitations of active learning: An empirical study. In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering.
Jinhan Kim, Robert Feldt, and Shin Yoo. 2019. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41st International Conference on Software Engineering (ICSE'19). IEEE, Los Alamitos, CA, 1039-1049. https://doi.org/10.1109/ICSE.2019.00108
Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, et al. 2020. Wilds: A benchmark of in-the-wild distribution shifts. arXiv preprint arXiv:2012.07421 (2020).
Jannik Kossen, Sebastian Farquhar, Yarin Gal, and Tom Rainforth. 2021. Active testing: Sample-efficient model evaluation. arXiv preprint arXiv:2103.05331 (2021).
Alex Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report. University of Toronto.
Ken Lang. 1995. NewsWeeder: Learning to filter netnews. In Machine Learning Proceedings 1995. Elsevier, 331-339.
Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11 (Nov. 1998), 2278-2324.
Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. 2018. Training confidence-calibrated classifiers for detecting outof-distribution samples. In Proceedings of the International Conference on Learning Representations. https://openreview.net/forum?id=ryiAv2xAZ.
Xin Li and Yuhong Guo. 2013. Adaptive active learning for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 859-866. https://doi.org/10.1109/CVPR.2013.116
Zenan Li, Xiaoxing Ma, Chang Xu, Chun Cao, Jingwei Xu, and Jian Lü. 2019. Boosting operational DNN testing efficiency through conditioning. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE'19). ACM, New York, NY, 499-509. https://doi.org/10.1145/3338906.3338930
Shiyu Liang, Yixuan Li, and R. Srikant. 2018. Enhancing the reliability of out-of-distribution image detection in neural networks. In Proceedings of the International Conference on Learning Representations.
Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, et al. 2018. DeepGauge: Multigranularity testing criteria for deep learning systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE'18). ACM, New York, NY, 120-131. https://doi.org/10.1145/3238147.3238202
Wei Ma, Mike Papadakis, Anestis Tsakmalis, Maxime Cordy, and Yves Le Traon. 2021. Test selection for deep learning systems. ACM Transactions on Software Engineering and Methodology 30, 2 (2021), 1-22.
Andrew Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 142-150.
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards deep learning models resistant to adversarial attacks. In Proceedings of the International Conference on Learning Representations.
Kayo Matsushita, Kayo Matsushita, and Hasebe. 2018. Deep Active Learning. Springer.
Glenford J. Myers, Tom Badgett, Todd M. Thomas, and Corey Sandler. 2004. The Art of Software Testing. Vol. 2. Wiley Online Library.
Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. DeepXplore: Automated whitebox testing of deep learning systems. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP'17). ACM, New York, NY, 1-18. https://doi.org/10.1145/3132747.3132785
Jie Ren, Peter J. Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark A. DePristo, Joshua V. Dillon, and Balaji Lakshminarayanan. 2019. Likelihood ratios for out-of-distribution detection. arXiv preprint arXiv:1906.02845 (2019).
Kui Ren, Tianhang Zheng, Zhan Qin, and Xue Liu. 2020. Adversarial attacks and defenses in deep learning. Engineering 6, 3 (2020), 346-360. https://doi.org/10.1016/j.eng.2019.12.012
Pál Révész. 2005. Random Walk in Random and Non-Random Environments (2nd ed.).World Scientific. https://doi.org/10.1142/5847
Joan Serrà, David Álvarez, Vicenç Gómez, Olga Slizovskaia, José F. Núñez, and Jordi Luque. 2020. Input complexity and out-of-distribution detection with likelihood-based generative models. In Proceedings of the International Conference on Learning Representations.
C. E. Shannon. 1948. A mathematical theory of communication. Bell System Technical Journal 27, 3 (1948), 379-423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
Weijun Shen, Yanhui Li, Lin Chen, Yuanlei Han, Yuming Zhou, and Baowen Xu. 2020. Multiple-boundary clustering and prioritization to promote neural network retraining. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering (ASE'20). IEEE, Los Alamitos, CA, 410-422.
Yi Sun, Ding Liang, Xiaogang Wang, and Xiaoou Tang. 2015. DeepID3: Face recognition with very deep neural networks. CoRR abs/1502.00873 (2015). http://dblp.uni-trier.de/db/journals/corr/corr1502.html#SunLWT15.
Youcheng Sun, Min Wu, Wenjie Ruan, Xiaowei Huang, Marta Kwiatkowska, and Daniel Kroening. 2018. Concolic testing for deep neural networks. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 109-119.
Richard Szeliski. 2010.Computer Vision: Algorithms and Applications. Springer-Verlag, Berlin, Germany.
Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, and Ludwig Schmidt. 2020. Measuring robustness to natural distribution shifts in image classification. arXiv preprint arXiv:2007.00644 (2020).
Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2018. DeepTest: Automated testing of deep-neural-networkdriven autonomous cars. In Proceedings of the 40th International Conference on Software Engineering (ICSE'18). ACM, New York, NY, 303-314. https://doi.org/10.1145/3180155.3180220
Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, MichaëlMathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, et al. 2019. Grandmaster level in Starcraft II using multi-agent reinforcement learning. Nature 575, 7782 (2019), 350-354.
Daisuke Wakabayashi. 2018. Self-driving Uber car kills pedestrian in Arizona, where robots roam. New York Times. Retrieved April 25, 2022 from https://www.nytimes.com/2018/03/19/technology/uber-driverless-fatality.html.
Dan Wang and Yi Shang. 2014. A new active labeling method for deep learning. In Proceedings of the International Joint Conference on Neural Networks (IJCNN'14). IEEE, Los Alamitos, CA, 112-119.
Jingyi Wang, Jialuo Chen, Youcheng Sun, Xingjun Ma, Dongxia Wang, Jun Sun, and Peng Cheng. 2021. RobOT: Robustness-oriented testing for deep learning systems. In Proceedings of the 43rd International Conference on Software Engineering (ICSE'21). IEEE, Los Alamitos, CA.
Zan Wang, Hanmo You, Junjie Chen, Yingyi Zhang, Xuyuan Dong, and Wenbin Zhang. 2021. Prioritizing test inputs for deep neural networks via mutation analysis. In Proceedings of the IEEE/ACM 43nd International Conference on Software Engineering (ICSE'21).
Geoffrey I. Webb, Loong Kuan Lee, François Petitjean, and Bart Goethals. 2017. Understanding concept drift. CoRR abs/1704.00362 (2017). http://arxiv.org/abs/1704.00362.
Eric Wong, Leslie Rice, and J. Zico Kolter. 2019. Fast is better than free: Revisiting adversarial training. In Proceedings of the International Conference on Learning Representations.
Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv:cs.LG/1708.07747 [cs.LG] (2017).
Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig. 2016. Achieving human parity in conversational speech recognition. arXiv preprint arXiv:1610.05256 (2016).
Tianyi Zhang, Cuiyun Gao, Lei Ma, Michael Lyu, and Miryung Kim. 2019. An empirical study of common challenges in developing deep learning applications. In Proceedings of the International Symposium on Software Reliability Engineering (ISSRE'19). IEEE, Los Alamitos, CA, 104-115. https://doi.org/10.1109/ISSRE.2019.00020
Xiyue Zhang, Xiaofei Xie, Lei Ma, Xiaoning Du, Qiang Hu, Yang Liu, Jianjun Zhao, and Meng Sun. 2020. Towards characterizing adversarial defects of deep learning software from the lens of uncertainty. In Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering (ICSE'20). IEEE, Los Alamitos, CA, 739-751.
Zixun Zhang, Zhen Li, Lin Lin, Na Lei, Guanbin Li, and Shuguang Cui. 2020. MetaSelection: Metaheuristic substructure selection for neural network pruning using evolutionary algorithm. Frontiers in Artificial Intelligence and Applications 325 (2020), 2808-2815.