Black-Box Testing of Deep Neural Networks through Test Case Diversity

2023 • In *IEEE Transactions on Software Engineering*

Peer Reviewed verified by ORBi

TSE_BlackBox_diversity_R2__Final_version_.pdf

Author preprint (6.29 MB)

All documents in ORBilu are protected by a user license.

copy to clipboard copied

Keywords :

Deep Neural Network; Testing

Abstract :

[en] Deep Neural Networks (DNNs) have been extensively used in many areas including image processing, medical diagnostics and autonomous driving. However, DNNs can exhibit erroneous behaviours that may lead to critical errors, especially when used in safety-critical systems. Inspired by testing techniques for traditional software systems, researchers have proposed neuron coverage criteria, as an analogy to source code coverage, to guide the testing of DNNs. Despite very active research on DNN coverage, several recent studies have questioned the usefulness of such criteria in guiding DNN testing. Further, from a practical standpoint, these criteria are white-box as they require access to the internals or training data of DNNs, which is often not feasible or convenient. Measuring such coverage requires executing DNNs with candidate inputs to guide testing, which is not an option in many practical contexts.
In this paper, we investigate diversity metrics as an alternative to white-box coverage criteria. For the previously mentioned reasons, we require such metrics to be black-box and not rely on the execution and outputs of DNNs under test. To this end, we first select and adapt three diversity metrics and study, in a controlled manner, their capacity to measure actual diversity in input sets. We then analyze
their statistical association with fault detection using four datasets and five DNNs. We further compare diversity with state-of-the-art white-box coverage criteria. As a mechanism to enable such analysis, we also propose a novel way to estimate fault detection in DNNs.
Our experiments show that relying on the diversity of image features embedded in test input sets is a more reliable indicator than coverage criteria to effectively guide DNN testing. Indeed, we found that one of our selected black-box diversity metrics far outperforms existing coverage criteria in terms of fault-revealing capability and computational time. Results also confirm the suspicions that state-of-the-art coverage criteria are not adequate to guide the construction of test input sets to detect as many faults as possible using natural inputs.

Research center :

Interdisciplinary Centre for Security, Reliability and Trust (SnT) > SVV - Software Verification and Validation

Disciplines :

Computer science

Aghababaeyan, Zohreh; University of Ottawa

Abdellatif, Manel; Ecole de Technologie Supérieure

BRIAND, Lionel ^{}; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SVV

S, Ramesh; General Motor

Bagherzadeh, Mojtaba; University of Ottawa

External co-authors :

yes

Language :

English

Title :

Black-Box Testing of Deep Neural Networks through Test Case Diversity

Publication date :

2023

Journal title :

IEEE Transactions on Software Engineering

ISSN :

0098-5589

eISSN :

1939-3520

Publisher :

Institute of Electrical and Electronics Engineers, New-York, United States - New York

Peer reviewed :

Peer Reviewed verified by ORBi

Focus Area :

Security, Reliability and Trust

Funders :

General Motors

Available on ORBilu :

since 03 March 2023

Scopus citations^{®}

15

Scopus citations^{®}

without self-citations

without self-citations

12

OpenCitations

2

- X. Yang, F. Li, and H. Liu, "A survey of DNN methods for blind image quality assessment," IEEE Access, vol. 7, pp. 123 788-123 806, 2019.
- A.Giusti, D. C. Cireşan, J. Masci, L. M. Gambardella, and J. Schmidhuber, "Fast image scanning with deep max-pooling convolutional neural networks," in Proc. IEEE Int. Conf. Image Process., 2013, pp. 4034-4038.
- P. K. Mallick, S. H. Ryu, S. K. Satapathy, S. Mishra, G. N. Nguyen, and P. Tiwari, "Brain MRI image classification for cancer detection using deep wavelet autoencoder-based deep neural network," IEEE Access, vol. 7, pp. 46 278-46 287, 2019.
- V. Rajinikanth, A. N. Joseph Raj, K. P. Thanaraj, and G. R. Naik, "A customized VGG19 network with concatenation of deep and handcrafted features for brain tumor detection," Appl. Sci., vol. 10, no. 10, 2020, Art. no. 3429.
- T. G. Debelee, S. R. Kebede, F. Schwenker, and Z. M. Shewarega, "Deep learning in selected cancers' image analysis-a survey," J. Imag., vol. 6, no. 11, 2020, Art. no. 121.
- J. Pan, C. Liu, Z.Wang, Y. Hu, and H. Jiang, "Investigation of deep neural networks (DNN) for large vocabulary continuous speech recognition: Why DNN surpasses GMMS in acoustic modeling," in Proc. IEEE 8th Int. Symp. Chin. Spoken Lang. Process., 2012, pp. 301-305.
- A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani, "Deep reinforcement learning framework for autonomous driving," Electron. Imag., vol. 2017, no. 19, pp. 70-76, 2017.
- A. Stocco, M. Weiss, M. Calzana, and P. Tonella, "Misbehaviour prediction for autonomous driving systems," in Proc. IEEE/ACM 42nd Int. Conf. Softw. Eng., 2020, pp. 359-371.
- X.Cai and M.R. Lyu, "The effect of code coverage on fault detection under different testing profiles," in Proc. 1st Int. Workshop Adv. Model-Based Testing, 2005, pp. 1-7.
- J. Kim, R. Feldt, and S. Yoo, "Guiding deep learning system testing using surprise adequacy," in Proc. IEEE/ACM 41st Int. Conf. Softw. Eng., 2019, pp. 1039-1049.
- S. Gerasimou, H. F. Eniser, A. Sen, and A. Cakan, "Importance-driven deep learning system testing," in Proc. IEEE/ACM 42nd Int. Conf. Softw. Eng., 2020, pp. 702-713.
- K. Pei, Y. Cao, J. Yang, and S. Jana, "DeepXplore: Automated whitebox testing of deep learning systems," in Proc. 26th Symp. Operating Syst. Princ., 2017, pp. 1-18.
- L. Ma et al., "DeepGauge: Multi-granularity testing criteria for deep learning systems," Proc. IEEE/ACM 33rd Int. Conf. Automated Softw. Eng., 2018, pp. 120-131.
- Y. Sun, X. Huang, D. Kroening, J. Sharp, M. Hill, and R. Ashmore, "Structural test coverage criteria for deep neural networks," ACM Trans. Embedded Comput. Syst., vol. 18, no. 5s, pp. 1-23, 2019.
- J. Sekhon and C. Fleming, "Towards improved testing for deep learning," in Proc. IEEE/ACM 41st Int. Conf. Softw. Eng. New Ideas Emerg. Results, 2019, pp. 85-88.
- Z. Li, X. Ma, C. Xu, and C. Cao, "Structural coverage criteria for neural networks could be misleading," in Proc. IEEE/ACM 41st Int. Conf. Softw. Eng. New Ideas Emerg. Results, IEEE, 2019, pp. 89-92.
- Y. Dong et al., "An empirical study on correlation between coverage and robustness for deep neural networks," in Proc. IEEE 25th Int. Conf. Eng. Complex Comput. Syst., 2020, pp. 73-82.
- J. Chen, M. Yan, Z. Wang, Y. Kang, and Z. Wu, "Deep neural network test coverage: How far are we?," 2020, arXiv:2010.04946.
- M. Biagiola, A. Stocco, F. Ricca, and P. Tonella, "Diversity-based web test generation," in Proc. 27th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2019, pp. 142-153.
- H. Hemmati, Z. Fang, and M. V. Mantyla, "Prioritizing manual test cases in traditional and rapid release environments," in Proc. IEEE 8th Int. Conf. Softw. Testing Verification Validation, 2015, pp. 1-10.
- R. Feldt, S. Poulding, D. Clark, and S.Yoo, "Test set diameter:Quantifying the diversity of sets of test cases," in Proc. IEEE Int. Conf. Softw. Testing Verification Validation, 2016, pp. 223-233.
- Z. Li, X. Ma, C. Xu, C. Cao, J. Xu, and J. Lü, "Boosting operational dnn testing efficiency through conditioning," in Proc. 27th ACM JointMeeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2019, pp. 499-509.
- T. Y. Chen, F.-C. Kuo, R. G. Merkel, and T. Tse, "Adaptive random testing: The art of test case diversity," J. Syst. Softw., vol. 83, no. 1, pp. 60-66, 2010.
- T. Y. Chen, R. Merkel, P. Wong, and G. Eddy, "Adaptive random testing through dynamic partitioning," in Proc. IEEE 4th Int. Conf. onQuality Softw., 2004, pp. 79-86.
- H. Fahmy, F. Pastore, M. Bagherzadeh, and L. Briand, "Supporting deep neural network safety analysis and retraining through heatmap-based unsupervised learning," IEEE Trans. Rel., vol. 70, no. 4, pp. 1641-1657, Dec. 2021.
- A. Kulesza and B. Taskar, "Determinantal point processes for machine learning," 2012, arXiv:1207.6083.
- K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," 2014, arXiv:1409.1556.
- W. Mousser and S. Ouadfel, "Deep feature extraction for pap-smear image classification: A comparative study," in Proc. 5th Int. Conf. Comput. Technol. Appl., 2019, pp. 6-10.
- T. Kaur and T. K. Gandhi, "Automated brain image classification based on VGG-16 and transfer learning," in Proc. IEEE Int. Conf. Inf. Technol., 2019, pp. 94-98.
- Z. Gong, P. Zhong, and W. Hu, "Diversity in machine learning," IEEE Access, vol. 7, pp. 64 323-64 350, 2019.
- A. R. Cohen and P. M. Vitányi, "Normalized compression distance of multisets with applications," IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 8, pp. 1602-1614, Aug. 2015.
- J. R. Hershey and P. A. Olsen, "Approximating the kullback leibler divergence between Gaussian mixture models," in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2007, pp. IV-317-IV-320.
- B. Chen, X. He, B. Pan, X. Zou, and N. You, "Comparison of beta diversity measures in clustering the high-dimensional microbial data," PLoS One, vol. 16, no. 2, 2021, Art. no. e0246893.
- H. Lin and J. A. Bilmes, "Learning mixtures of submodular shells with application to document summarization," 2012, arXiv:1210.4871.
- B. Kang, "Fast determinantal point process sampling with application to clustering," in Proc. Int. Conf. Neural Inf. Process. Syst., 2013, pp. 2321-2329.
- H. Xu and Z. Ou, "Scalable discovery of audio fingerprint motifs in broadcast streams with determinantal point process based motif clustering," IEEE/ACM Trans. Audio, Speech, Lang. Process, vol. 24, no. 5, pp. 978-989, May 2016.
- H. Z. Nafchi, A. Shahkolaei, R. Hedjam, and M. Cheriet, "Mean deviation similarity index: Efficient and reliable full-reference image quality evaluator," IEEE Access, vol. 4, pp. 5579-5590, 2016.
- R. S. Borbely, "On normalized compression distance and large malware," J. Comput. Virol. Hacking Techn., vol. 12, no. 4, pp. 235-242, 2016.
- R. Cilibrasi and P. M. Vitányi, "Clustering by compression," IEEE Trans. Inf. Theory, vol. 51, no. 4, pp. 1523-1545, Apr. 2005.
- M. Elfeki, C. Couprie, M. Riviere, and M. Elhoseiny, "GDPP: Learning diverse generations using determinantal point processes," in Proc. Int. Conf. Mach. Learn., PMLR, 2019, pp. 1774-1783.
- B. Gong,W.-L. Chao, K. Grauman, and F. Sha, "Diverse sequential subset selection for supervised video summarization," in Proc. Int. Conf. Neural Inf. Process. Syst., 2014, pp. 2069-2077.
- T. Zhou, Z. Kuscsik, J.-G. Liu,M.Medo, J. R.Wakeling, and Y.-C. Zhang, "Solving the apparent diversity-accuracy dilemma of recommender systems," Proc. Nat. Acad. Sci. USA, vol. 107, no. 10, pp. 4511-4515, 2010.
- A. Krause, A. Singh, and C. Guestrin, "Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies," J. Mach. Learn. Res., vol. 9, no. 2, pp. 235-284, 2008.
- E. Celis, V. Keswani, D. Straszak, A. Deshpande, T. Kathuria, and N. Vishnoi, "Fair and diverse DPP-based data summarization," in Proc. Int. Conf. Mach. Learn., PMLR, 2018, pp. 716-725.
- Y. Bengio, G. Mesnil, Y. Dauphin, and S. Rifai, "Better mixing via deep representations," in Proc. Int. Conf. Mach. Learn., PMLR, 2013, pp. 552-560.
- A. N. Kolmogorov, "Three approaches to the quantitative definition ofinformation'," Problems Inf. Transmiss., vol. 1, no. 1, pp. 1-7, 1965.
- C. H. Bennett, P. Gács,M. Li, P. M.Vitányi, and W. H. Zurek, "Information distance," IEEE Trans. Inf. Theory, vol. 44, no. 4, pp. 1407-1423, Jul. 1998.
- M. Li, X. Chen, X. Li, B. Ma, and P. M. Vitányi, "The similarity metric," IEEE Trans. Inf. Theory, vol. 50, no. 12, pp. 3250-3264, Dec. 2004.
- R. Feldt, R. Torkar, T. Gorschek, and W. Afzal, "Searching for cognitively diverse tests: Towards universal test diversity metrics," in Proc. IEEE Int. Conf. Softw. Testing Verification ValidationWorkshop, 2008, pp. 178-186.
- D. Coltuc, M. Datcu, and D. Coltuc, "On the use of normalized compression distances for image similarity detection," Entropy, vol. 20, no. 2, pp. 99-114, 2018.
- A. Kocsor, A. Kertész-Farkas, L. Kaján, and S. Pongor, "Application of compression-based distance measures to protein sequence classification:A methodological study," Bioinformatics, vol. 22, no. 4, pp. 407-412, 2006.
- C. Henard,M. Papadakis,M.Harman,Y. Jia, and Y. Le Traon, "Comparing white-box and black-box test prioritization," in Proc. IEEE/ACM 38th Int. Conf. Softw. Eng., 2016, pp. 523-534.
- P. M. Bueno,W. E.Wong, and M. Jino, "Improving random test sets using the diversity oriented test data generation," in Proc. IEEE/ACM 2nd Int. Workshop Random Testing Co-Located 22nd Int. Conf. Automated Softw. Eng., 2007, pp. 10-17.
- D. Leon and A. Podgurski, "A comparison of coverage-based and distribution-based techniques for filtering and prioritizing test cases," in Proc. IEEE 14th Int. Symp. Softw. Rel. Eng., 2003, pp. 442-453.
- F. Harel-Canada, L. Wang, M. A. Gulzar, Q. Gu, and M. Kim, "Is neuron coverage a meaningful measure for testing deep neural networks?," in Proc. 28th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2020, pp. 851-862.
- S. Yan et al., "Correlations between deep neural network model coverage criteria and model quality," in Proc. 28th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2020, pp. 775-787.
- K. Alex, N. Vinod, and H. Geoffrey, The CIFAR-10 dataset, 2009. [Online]. Available: http://www.cs.toronto.edu/kriz/cifar.html
- L. Deng, "The MNIST database of handwritten digit images for machine learning research [best of the Web]," IEEE Signal Process. Mag., vol. 29, no. 6, pp. 141-142, Nov. 2012.
- Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, "Reading digits in natural images with unsupervised feature learning," in Proc. NIPS Workshop Deep Learn. Unsupervised Feature Learn., 2011. [Online]. Available: http://ufldl.stanford.edu/housenumbers/ nips2011housenumbers.pdf
- T. Zohdinasab, V. Riccio, A. Gambi, and P. Tonella, "DeepHyperion: Exploring the feature space of deep learning-based systems through illumination search," in Proc. 30th ACM SIGSOFT Int. Symp. Softw. Testing Anal., 2021, pp. 79-90.
- M. O. Attaoui, H. Fahmy, F. Pastore, and L. Briand, "Black-box safety analysis and retraining of DNNs based on feature extraction and clustering," ACM Trans. Softw. Eng. Methodol., Jul. 2022. [Online]. Available: https://doi.org/10.1145/3550271
- M. Joswiak, Y. Peng, I. Castillo, and L. H. Chiang, "Dimensionality reduction for visualizing industrial chemical process data," Control Eng. Pract., vol. 93, 2019, Art. no. 104189.
- L. McInnes, J. Healy, and J. Melville, "UMAP: Uniformmanifold approximation and projection for dimension reduction," 2018, arXiv:1802.03426.
- A. Diaz-Papkovich, L. Anderson-Trocmé, and S. Gravel, "A review of UMAP in population genetics," J. Hum. Genet., vol. 66, no. 1, pp. 85-91, 2021.
- Y. Hozumi, R. Wang, C. Yin, and G.-W. Wei, "UMAP-assisted K-means clustering of large-scale SARS-CoV-2 mutation datasets," Comput. Biol. Med., vol. 131, 2021, Art. no. 104264.
- I. T. Jolliffe and J. Cadima, "Principal component analysis: A review and recent developments," Philos. Trans. Roy. Soc. A Math., Phys. Eng. Sci., vol. 374, no. 2065, 2016, Art. no. 20150202.
- L. Van der Maaten and G. Hinton, "Visualizing data using t-SNE," J. Mach. Learn. Res., vol. 9, no. 11, pp. 2579-2605, 2008.
- L. McInnes, J. Healy, and S. Astels, "HDBSCAN: Hierarchical density based clustering," J. Open Source Softw., vol. 2, no. 11, pp. 205-206, 2017.
- P. J. Rousseeuw, "Silhouettes: A graphical aid to the interpretation and validation of cluster analysis," J. Comput. Appl. Math., vol. 20, pp. 53-65, 1987.
- D. Moulavi, P. A. Jaskowiak, R. J. Campello, A. Zimek, and J. Sander, "Density-based clustering validation," in Proc. SIAM Int. Conf. Data Mining, SIAM, 2014, pp. 839-847.
- M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, "Density-connected sets and their application for trend detection in spatial databases," in Proc. Int. Conf. Knowl. Discov. Data Mining, 1997, pp. 10-15.
- Q. Hu et al., "An empirical study on data distribution-aware test selection for deep learning enhancement," ACM Trans. Softw. Eng. Methodol., vol. 31, pp. 1-30, 2022.
- D. G. Bonett and T. A. Wright, "Sample size requirements for estimating pearson, Kendall and Spearman correlations," Psychometrika, vol. 65, no. 1, pp. 23-28, 2000.
- R. F. Woolson, "Wilcoxon signed-rank test," Wiley Encyclopedia of Clinical Trials. Hoboken, NJ, USA: Wiley, 2007, pp. 1-3.
- Z. Aghababaeyan, M. Abdellatif, L. Briand, R. S., and M. Bagherzadeh, DNN testing replication package, 2022. [Online]. Available: https:// github.com/zohreh-aaa/DNN-Testing
- M. Biagiola, F. Ricca, and P. Tonella, "Search based path and input data generation for web application testing," in Proc. Int. Symp. Search Based Softw. Eng., Springer, 2017, pp. 18-32.
- D. Zou, J. Liang, Y. Xiong, M. D. Ernst, and L. Zhang, "An empirical study of fault localization families and their combinations," IEEE Trans. Softw. Eng., vol. 47, no. 2, pp. 332-347, Feb. 2021.
- S. Pearson et al., "Evaluating and improving fault localization," in Proc. IEEE/ACM 39th Int. Conf. Softw. Eng., 2017, pp. 609-620.
- W. E. Wong, V. Debroy, R. Gao, and Y. Li, "The DStar method for effective software fault localization," IEEE Trans. Rel., vol. 63, no. 1, pp. 290-308, Mar. 2014.
- M. P. Wand and M. C. Jones, Kernel Smoothing. Boca Raton, FL, USA: CRC, 1994.
- T. Byun, S. Rayadurgam, and M. P. Heimdahl, "Black-box testing of deep neural networks," in Proc. IEEE 32nd Int. Symp. Softw. Rel. Eng., 2021, pp. 309-320.
- Y. Bengio, A. Courville, and P. Vincent, "Representation learning: A review and new perspectives," IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798-1828, Aug. 2013.
- Y. Feng, Q. Shi, X. Gao, J. Wan, C. Fang, and Z. Chen, "DeepGini: Prioritizing massive tests to enhance the robustness of deep neural networks," in Proc. 29th ACM SIGSOFT Int. Symp. Softw. Testing Anal., 2020, pp. 177-188.
- X. Gao, Y. Feng, Y. Yin, Z. Liu, Z. Chen, and B. Xu, "Adaptive test selection for deep neural networks," in Proc. IEEE/ACM 44th Int. Conf. Softw. Eng., 2022, pp. 73-85.
- M. A. Langford and B. H. Cheng, "Enki: A diversity-driven approach to test and train robust learning-enabled systems," ACM Trans. Auton. Adaptive Syst., vol. 15, no. 2, pp. 1-32, 2021.
- A. E. Eiben and J. E. Smith, Introduction to Evolutionary Computing, Berlin, Germany: Springer, 2003.
- Y. Tian, K. Pei, S. Jana, and B. Ray, "DeepTest: Automated testing of deep-neural-network-driven autonomous cars," in Proc. 40th Int. Conf. Softw. Eng., ser. ICSE '18. New York, NY, USA: Association for Computing Machinery, 2018, pp. 303-314. [Online]. Available: https://doi.org/10.1145/3180155.3180220
- E. G. Cartaxo, P. D. Machado, and F. G. O. Neto, "On the use of a similarity function for test case selection in the context of model-based testing," Softw. Testing Verification Rel., vol. 21, no. 2, pp. 75-100, 2011.
- F. G. de Oliveira Neto, A. Ahmad, O. Leifler, K. Sandahl, and E. Enoiu, "Improving continuous integration with similarity-based test case selection," in Proc. 13th Int. Workshop Automat. Softw. Test, 2018, pp. 39-45.
- H. Hemmati, A. Arcuri, and L. Briand, "Achieving scalable model-based testing through test case diversity," ACM Trans. Softw. Eng. Methodol., vol. 22, no. 1, pp. 1-42, 2013.
- R. Xu and D. Wunsch, "Survey of clustering algorithms," IEEE Trans. Neural Netw., vol. 16, no. 3, pp. 645-678, May 2005.
- S. Droste, T. Jansen, and I.Wegener, "On the analysis of the (1+ 1) evolutionary algorithm," Theor. Comput. Sci., vol. 276, no. 1/2, pp. 51-81, 2002.
- A. Mesbah, A. Van Deursen, and D. Roest, "Invariant-based automatic testing of modern web applications," IEEE Trans. Softw. Eng., vol. 38, no. 1, pp. 35-53, Jan./Feb. 2012.