![]() Khan, Zanis Ali ![]() ![]() ![]() in Proceedings of the 44th International Conference on Software Engineering (ICSE ’22) (2022, July) Log message template identification aims to convert raw logs containing free-formed log messages into structured logs to be processed by automated log-based analysis, such as anomaly detection and model ... [more ▼] Log message template identification aims to convert raw logs containing free-formed log messages into structured logs to be processed by automated log-based analysis, such as anomaly detection and model inference. While many techniques have been proposed in the literature, only two recent studies provide a comprehensive evaluation and comparison of the techniques using an established benchmark composed of real-world logs. Nevertheless, we argue that both studies have the following issues: (1) they used different accuracy metrics without comparison between them, (2) some ground-truth (oracle) templates are incorrect, and (3) the accuracy evaluation results do not provide any information regarding incorrectly identified templates. In this paper, we address the above issues by providing three guidelines for assessing the accuracy of log template identification techniques: (1) use appropriate accuracy metrics, (2) perform oracle template correction, and (3) perform analysis of incorrect templates. We then assess the application of such guidelines through a comprehensive evaluation of 14 existing template identification techniques on the established benchmark logs. Results show very different insights than existing studies and in particular a much less optimistic outlook on existing techniques. [less ▲] Detailed reference viewed: 392 (41 UL)![]() Ul Haq, Fitash ![]() ![]() ![]() in Proceedings of the 44th International Conference on Software Engineering (ICSE ’22) (2022, May) With the recent advances of Deep Neural Networks (DNNs) in real-world applications, such as Automated Driving Systems (ADS) for self-driving cars, ensuring the reliability and safety of such DNN- enabled ... [more ▼] With the recent advances of Deep Neural Networks (DNNs) in real-world applications, such as Automated Driving Systems (ADS) for self-driving cars, ensuring the reliability and safety of such DNN- enabled Systems emerges as a fundamental topic in software testing. One of the essential testing phases of such DNN-enabled systems is online testing, where the system under test is embedded into a specific and often simulated application environment (e.g., a driving environment) and tested in a closed-loop mode in interaction with the environment. However, despite the importance of online testing for detecting safety violations, automatically generating new and diverse test data that lead to safety violations present the following challenges: (1) there can be many safety requirements to be considered at the same time, (2) running a high-fidelity simulator is often very computationally-intensive, and (3) the space of all possible test data that may trigger safety violations is too large to be exhaustively explored. In this paper, we address the challenges by proposing a novel approach, called SAMOTA (Surrogate-Assisted Many-Objective Testing Approach), extending existing many-objective search algorithms for test suite generation to efficiently utilize surrogate models that mimic the simulator, but are much less expensive to run. Empirical evaluation results on Pylot, an advanced ADS composed of multiple DNNs, using CARLA, a high-fidelity driving simulator, show that SAMOTA is significantly more effective and efficient at detecting unknown safety requirement violations than state-of-the-art many-objective test suite generation algorithms and random search. In other words, SAMOTA appears to be a key enabler technology for online testing in practice. [less ▲] Detailed reference viewed: 1073 (86 UL)![]() Shin, Donghwan ![]() ![]() ![]() in Empirical Software Engineering (2022) Behavioral software models play a key role in many software engineering tasks; unfortunately, these models either are not available during software development or, if available, quickly become outdated as ... [more ▼] Behavioral software models play a key role in many software engineering tasks; unfortunately, these models either are not available during software development or, if available, quickly become outdated as implementations evolve. Model inference techniques have been proposed as a viable solution to extract finite state models from execution logs. However, existing techniques do not scale well when processing very large logs that can be commonly found in practice. In this paper, we address the scalability problem of inferring the model of a component-based system from large system logs, without requiring any extra information. Our model inference technique, called PRINS, follows a divide-and-conquer approach. The idea is to first infer a model of each system component from the corresponding logs; then, the individual component models are merged together taking into account the flow of events across components, as reflected in the logs. We evaluated PRINS in terms of scalability and accuracy, using nine datasets composed of logs extracted from publicly available benchmarks and a personal computer running desktop business applications. The results show that PRINS can process large logs much faster than a publicly available and well-known state-of-the-art tool, without significantly compromising the accuracy of inferred models. [less ▲] Detailed reference viewed: 369 (34 UL)![]() Shin, Donghwan ![]() ![]() ![]() in Proceedings of the 21st International Conference on Runtime Verification (2021, October) Log-based anomaly detection identifies systems' anomalous behaviors by analyzing system runtime information recorded in logs. While many approaches have been proposed, all of them have in common an ... [more ▼] Log-based anomaly detection identifies systems' anomalous behaviors by analyzing system runtime information recorded in logs. While many approaches have been proposed, all of them have in common an essential pre-processing step called log parsing. This step is needed because automated log analysis requires structured input logs, whereas original logs contain semi-structured text printed by logging statements. Log parsing bridges this gap by converting the original logs into structured input logs fit for anomaly detection. Despite the intrinsic dependency between log parsing and anomaly detection, no existing work has investigated the impact of the "quality" of log parsing results on anomaly detection. In particular, the concept of "ideal" log parsing results with respect to anomaly detection has not been formalized yet. This makes it difficult to determine, upon obtaining inaccurate results from anomaly detection, if (and why) the root cause for such results lies in the log parsing step. In this short paper, we lay the theoretical foundations for defining the concept of "ideal" log parsing results for anomaly detection. Based on these foundations, we discuss practical implications regarding the identification and localization of root causes, when dealing with inaccurate anomaly detection, and the identification of irrelevant log messages. [less ▲] Detailed reference viewed: 186 (31 UL)![]() Ul Haq, Fitash ![]() ![]() ![]() in Empirical Software Engineering (2021), 26(5), We distinguish two general modes of testing for Deep Neural Networks (DNNs): Offline testing where DNNs are tested as individual units based on test datasets obtained without involving the DNNs under test ... [more ▼] We distinguish two general modes of testing for Deep Neural Networks (DNNs): Offline testing where DNNs are tested as individual units based on test datasets obtained without involving the DNNs under test, and online testing where DNNs are embedded into a specific application environment and tested in a closed-loop mode in interaction with the application environment. Typically, DNNs are subjected to both types of testing during their development life cycle where offline testing is applied immediately after DNN training and online testing follows after offline testing and once a DNN is deployed within a specific application environment. In this paper, we study the relationship between offline and online testing. Our goal is to determine how offline testing and online testing differ or complement one another and if offline testing results can be used to help reduce the cost of online testing? Though these questions are generally relevant to all autonomous systems, we study them in the context of automated driving systems where, as study subjects, we use DNNs automating end-to-end controls of steering functions of self-driving vehicles. Our results show that offline testing is less effective than online testing as many safety violations identified by online testing could not be identified by offline testing, while large prediction errors generated by offline testing always led to severe safety violations detectable by online testing. Further, we cannot exploit offline testing results to reduce the cost of online testing in practice since we are not able to identify specific situations where offline testing could be as accurate as online testing in identifying safety requirement violations. [less ▲] Detailed reference viewed: 149 (32 UL)![]() Messaoudi, Salma ![]() ![]() in Proceedings of ISSTA '21: 30th ACM SIGSOFT International Symposium on Software Testing and Analysis (2021, July) Regression testing is arguably one of the most important activities in software testing. However, its cost-effectiveness and usefulness can be largely impaired by complex system test cases that are poorly ... [more ▼] Regression testing is arguably one of the most important activities in software testing. However, its cost-effectiveness and usefulness can be largely impaired by complex system test cases that are poorly designed (e.g., test cases containing multiple test scenarios combined into a single test case) and that require a large amount of time and resources to run. One way to mitigate this issue is decomposing such system test cases into smaller, separate test cases---each of them with only one test scenario and with its corresponding assertions---so that the execution time of the decomposed test cases is lower than the original test cases, while the test effectiveness of the original test cases is preserved. This decomposition can be achieved with program slicing techniques, since test cases are software programs too. However, existing static and dynamic slicing techniques exhibit limitations when (1) the test cases use external resources, (2) code instrumentation is not a viable option, and (3) test execution is expensive. In this paper, we propose a novel approach, called DS3 (Decomposing System teSt caSe), which automatically decomposes a complex system test case into separate test case slices. The idea is to use test case execution logs, obtained from past regression testing sessions, to identify "hidden" dependencies in the slices generated by static slicing. Since logs include run-time information about the system under test, we can use them to extract access and usage of global resources and refine the slices generated by static slicing. We evaluated DS3 in terms of slicing effectiveness and compared it with a vanilla static slicing tool. We also compared the slices obtained by DS3 with the corresponding original system test cases, in terms of test efficiency and effectiveness. The evaluation results on one proprietary system and one open-source system show that DS3 is able to accurately identify the dependencies related to the usage of global resources, which vanilla static slicing misses. Moreover, the generated test case slices are, on average, 3.56 times faster than original system test cases and they exhibit no significant loss in terms of fault detection effectiveness. [less ▲] Detailed reference viewed: 427 (37 UL)![]() Ul Haq, Fitash ![]() ![]() ![]() in 2021 ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA) (2021, July) Automatically detecting the positions of key-points (e.g., facial key-points or finger key-points) in an image is an essential problem in many applications, such as driver's gaze detection and drowsiness ... [more ▼] Automatically detecting the positions of key-points (e.g., facial key-points or finger key-points) in an image is an essential problem in many applications, such as driver's gaze detection and drowsiness detection in automated driving systems. With the recent advances of Deep Neural Networks (DNNs), Key-Points detection DNNs (KP-DNNs) have been increasingly employed for that purpose. Nevertheless, KP-DNN testing and validation have remained a challenging problem because KP-DNNs predict many independent key-points at the same time---where each individual key-point may be critical in the targeted application---and images can vary a great deal according to many factors. In this paper, we present an approach to automatically generate test data for KP-DNNs using many-objective search. In our experiments, focused on facial key-points detection DNNs developed for an industrial automotive application, we show that our approach can generate test suites to severely mispredict, on average, more than 93% of all key-points. In comparison, random search-based test data generation can only severely mispredict 41% of them. Many of these mispredictions, however, are not avoidable and should not therefore be considered failures. We also empirically compare state-of-the-art, many-objective search algorithms and their variants, tailored for test suite generation. Furthermore, we investigate and demonstrate how to learn specific conditions, based on image characteristics (e.g., head posture and skin color), that lead to severe mispredictions. Such conditions serve as a basis for risk analysis or DNN retraining. [less ▲] Detailed reference viewed: 311 (53 UL)![]() ; Ben Abdessalem (helali), Raja ![]() ![]() in 2021 IEEE 14th International Conference on Software Testing, Validation and Verification (ICST) (2021, May 25) The increasing levels of software- and data-intensive driving automation call for an evolution of automotive software testing. As a recommended practice of the Verification and Validation (V&V) process of ... [more ▼] The increasing levels of software- and data-intensive driving automation call for an evolution of automotive software testing. As a recommended practice of the Verification and Validation (V&V) process of ISO/PAS 21448, a candidate standard for safety of the intended functionality for road vehicles, simulation-based testing has the potential to reduce both risks and costs. There is a growing body of research on devising test automation techniques using simulators for Advanced Driver-Assistance Systems (ADAS). However, how similar are the results if the same test scenarios are executed in different simulators? We conduct a replication study of applying a Search-Based Software Testing (SBST) solution to a real-world ADAS (PeVi, a pedestrian vision detection system) using two different commercial simulators, namely, TASS/Siemens PreScan and ESI Pro-SiVIC. Based on a minimalistic scene, we compare critical test scenarios generated using our SBST solution in these two simulators. We show that SBST can be used to effectively and efficiently generate critical test scenarios in both simulators, and the test results obtained from the two simulators can reveal several weaknesses of the ADAS under test. However, executing the same test scenarios in the two simulators leads to notable differences in the details of the test outputs, in particular, related to (1) safety violations revealed by tests, and (2) dynamics of cars and pedestrians. Based on our findings, we recommend future V&V plans to include multiple simulators to support robust simulation-based testing and to base test objectives on measures that are less dependant on the internals of the simulators. [less ▲] Detailed reference viewed: 115 (18 UL)![]() Ul Haq, Fitash ![]() ![]() ![]() in 2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST) (2020, August 05) There is a growing body of research on developing testing techniques for Deep Neural Networks (DNN). We distinguish two general modes of testing for DNNs: Offline testing where DNNs are tested as ... [more ▼] There is a growing body of research on developing testing techniques for Deep Neural Networks (DNN). We distinguish two general modes of testing for DNNs: Offline testing where DNNs are tested as individual units based on test datasets obtained independently from the DNNs under test, and online testing where DNNs are embedded into a specific application and tested in a close-loop mode in interaction with the application environment. In addition, we identify two sources for generating test datasets for DNNs: Datasets obtained from real-life and datasets generated by simulators. While offline testing can be used with datasets obtained from either sources, online testing is largely confined to using simulators since online testing within real-life applications can be time consuming, expensive and dangerous. In this paper, we study the following two important questions aiming to compare test datasets and testing modes for DNNs: First, can we use simulator-generated data as a reliable substitute to real-world data for the purpose of DNN testing? Second, how do online and offline testing results differ and complement each other? Though these questions are generally relevant to all autonomous systems, we study them in the context of automated driving systems where, as study subjects, we use DNNs automating end-to-end control of cars' steering actuators. Our results show that simulator-generated datasets are able to yield DNN prediction errors that are similar to those obtained by testing DNNs with real-life datasets. Further, offline testing is more optimistic than online testing as many safety violations identified by online testing could not be identified by offline testing, while large prediction errors generated by offline testing always led to severe safety violations detectable by online testing. [less ▲] Detailed reference viewed: 289 (75 UL) |
||