Metamorphic testing

Searching for Quality: Genetic Algorithms and Metamorphic Testing for Software Engineering ML

More machine learning (ML) models are introduced to the field of Software Engineering (SE) and reached a stage of maturity to be considered for real-world use;But the real world is complex, and testing these models lacks often in explainability, feasability and computational capacities. Existing research introduced metamorphic testing to gain additional insights and certainty about the model, by applying semantic-preserving changes to input-data while observing model-output. As this is currently done at random places, it can lead to potentially unrealistic datapoints and high computational costs. With this work, we introduce genetic search as an additional aid for metamorphic testing in SE ML. Utilizing the delta in output as a fitness function, the evolutionary intelligence optimizes the transformations to produce higher deltas with less changes. We perform a case study minimizing F1-Score and MRR for Code2Vec on a representative sample from java-small with both genetic and random search. Our results show that within the same amount of time, genetic search was able to achieve a decrease of 10% in F1 while random search produced 3% drop.

Assessing Robustness of ML-Based Program Analysis Tools using Metamorphic Program Transformations

Metamorphic testing is a well-established testing technique that has been successfully applied in various domains, including testing deep learning models to assess their robustness against data noise or malicious input. Currently, metamorphic testing approaches for machine learning (ML) models focused on image processing and object recognition tasks. Hence, these approaches cannot be applied to ML targeting program analysis tasks. In this paper, we extend metamorphic testing approaches for ML models targeting software programs. We present Lampion, a novel testing framework that applies (semantics preserving) metamorphic transformations on the test datasets. Lampion produces new code snippets equivalent to the original test set but different in their identifiers or syntactic structure. We evaluate Lampion against CodeBERT, a state-of-the-art ML model for Code-To-Text tasks that creates Javadoc summaries for given Java methods. Our results show that simple transformations significantly impact the target model behavior, providing additional information on the models reasoning apart from the classic performance metric.