Test Case Selection for Deep Neural Networks: A Replication Study on LLMs for Code (Replicability Study)

Ali Asgari, Mitchell Olsthoorn, Annibale Panichella

June, 2026

Abstract

Recently, test case selection (TCS) techniques have been explored to support the operational evaluation of deep neural networks (DNNs) under limited testing budgets, where labeling cost is a primary concern and uncovering model failures early is a key objective. Although prior studies report promising results, existing empirical evaluations focus almost exclusively on vision-based DNNs and datasets. As observed in recent surveys, models and datasets specifically designed for software engineering tasks have not been considered, leaving it unclear whether prior findings generalize to LLM code models. This paper presents a large-scale replication study of TCS techniques in the context of LLM code models. We re-examine established TCS strategies originally proposed for DNNs and complement them with statistical sampling strategies that have not previously been evaluated for TCS. We assess their effectiveness on three code-related classification tasks: clone detection, vulnerability detection, and technical debt prediction. The study spans 17 task-specific fine-tuned model instances, 7 predictive features, and 13 selection strategies, including 12 feature-aware strategies and simple random sampling (SRS) as a feature-agnostic baseline. We evaluate performance along two dimensions: operational accuracy estimation and early failure discovery. The results indicate that only a subset of findings reported for vision-based DNNs generalize when TCS is applied to LLMs for code. In particular, uncertainty-based features are effective for early failure discovery, while representation-based features are more robust for accuracy estimation. At the same time, performance varies substantially across tasks and models, indicating that the effectiveness of TCS techniques is context-dependent. Overall, this study provides empirical evidence on the replicability of TCS techniques beyond vision-based deep learning and offers insights into their use for the operational evaluation of LLMs for code.

Type

Publication

The ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2026)