Abstract: Search-based approaches have been used in the literature to automate the process of creating unit test cases. However, related work has shown that generated tests with high code coverage could be ineffective, i.
Regression testing is arguably one of the most important activities in software testing. However, its cost-effectiveness and usefulness can be largely impaired by complex system test cases that are poorly designed (e.g., test cases containing multiple test scenarios combined into a single test case) and that require a large amount of time and resources to run. One way to mitigate this issue is decomposing such system test cases into smaller, separate test cases---each of them with only one test scenario and with its corresponding assertions---so that the execution time of the decomposed test cases is lower than the original test cases, while the test effectiveness of the original test cases is preserved. This decomposition can be achieved with program slicing techniques, since test cases are software programs too. However, existing static and dynamic slicing techniques exhibit limitations when (1) the test cases use external resources, (2) code instrumentation is not a viable option, and (3) test execution is expensive. In this paper, we propose a novel approach, called DS3 (Decomposing System teSt caSe), which automatically decomposes a complex system test case into separate test case slices. The idea is to use test case execution logs, obtained from past regression testing sessions, to identify ``hidden'' dependencies in the slices generated by static slicing. Since logs include run-time information about the system under test, we can use them to extract access and usage of global resources and refine the slices generated by static slicing. We evaluated DS3 in terms of slicing effectiveness and compared it with a vanilla static slicing tool. We also compared the slices obtained by DS3 with the corresponding original system test cases, in terms of test efficiency and effectiveness. The evaluation results on one proprietary system and one open-source system show that DS3 is able to accurately identify the dependencies related to the usage of global resources, which vanilla static slicing misses. Moreover, the generated test case slices are, on average, 3.56 times faster than original system test cases and they exhibit no significant loss in terms of fault detection effectiveness.
Mutation testing is a well-established technique for assessing a test suite's effectiveness by injecting artificial faults into production code. In recent years, mutation testing has been extended to machine learning (ML) systems and deep learning (DL) in particular. Researchers have proposed approaches, tools, and statistically sound heuristics to determine whether mutants in DL systems are killed or not. However, as we will argue in this work, questions can be raised to what extent currently used mutation testing techniques in DL are actually in line with the classical interpretation of mutation testing. As we will discuss, in current approaches, the distinction between production and test code is blurry, the realism of mutation operators can be challenged, and generally, the degree to which the hypotheses underlying classical mutation testing (competent programmer hypothesis and coupling effect hypothesis) are followed lacks focus and explicit mappings. In this paper, we observe that ML model development follows a test-driven development (TDD) process, where data points (test data) with labels (implicit assertions) correspond to test cases in traditional software. Based on this perspective, we critically revisit existing mutation operators for ML, the mutation testing paradigm for ML, and its fundamental hypotheses. Based on our observations, we propose several action points for better alignment of mutation testing techniques for ML with paradigms and vocabularies of classical mutation testing.
Mutation testing is well-known for its efficacy in assessing test quality, and starting to be applied in the industry. However, what should a developer do when confronted with a low mutation score? Should the test suite be plainly reinforced to increase the mutation score, or should the production code be improved as well, to make the creation of better tests possible? In this paper, we aim to provide a new perspective to developers that enables them to understand and reason about the mutation score in the light of testability and observability. First, we investigate whether testability and observability metrics are correlated with the mutation score on six open-source Java projects. We observe a correlation between observability metrics and the mutation score, e.g., test directness, which measures the extent to which the production code is tested directly, seems to be an essential factor. Based on our insights from the correlation study, we propose a number of ''mutation score anti-patterns''', enabling software engineers to refactor their existing code or add tests to improve the mutation score. In doing so, we observe that relatively simple refactoring operations enable an improvement or increase in the mutation score.
Serverless architecture is an emerging design style for cloud-based software systems. Testing serverless applications plays an important role in software quality assurance. However, currently, there is no consensus on how to test and debug such systems properly. Moreover, the current lack of mature tooling is a central challenge. We designed and conducted three interviews among two tools vendor leaders in the serverless domain (Epsagon and Thundra) and one expert in the field (Yan Cui), investigating the good and bad practices and several open issues. The current status of testing and debugging in serverless-based applications depicted by the experts helped us to highlight issues and challenges that need to be deeply investigated.
Evolutionary-based crash reproduction techniques aid developers in their debugging practices by generating a test case that reproduces a crash given its stack trace. In these techniques, the search process is typically guided by a single search objective called Crash Distance. Previous studies have shown that current approaches could only reproduce a limited number of crashes due to a lack of diversity in the population during the search. In this study, we address this issue by applying Multi-Objectivization using Helper-Objectives (MO-HO) on crash reproduction. In particular, we add two helper-objectives to the Crash Distance to improve the diversity of the generated test cases and consequently enhance the guidance of the search process. We assessed MO-HO against the single-objective crash reproduction. Our results show that MO-HO can reproduce two additional crashes that were not previously reproducible by the single-objective approach.