There are many pitfalls that can compromise or even invalidate the scientific findings and conclusions of a controlled experiment that uses mutation testing.
It can be a daunting challenge for experimenters and researchers to be sure they have catered for all of the potential threats to validity that have accrued over four decades of literature recording the development of mutation testing.
Therefore, to address this challenge, we provide a simple seven-step checklist that aims to give experimenters the confidence that they are compliant with best practice reporting of results. Ensuring that all seven steps are met is relatively straightforward, because it simply involves explaining and justifying choices that may affect conclusion validity.
Nevertheless, experimenters who follow these seven steps help other researchers replicate and investigate, properly, the influence of such potentially confounding factors, thereby contributing to the overall experimental robustness of their study.
- Mutant Selection: Explain the choice of mutant operators. One of the most important things that experimenters need to explain is the appropriateness of the chosen mutant operators with respect to the programming language used.
- Mutation Testing Tool: Justify the choice of mutation testing tool. The choice of mutation testing tool needs to be made carefully as at the current state, mutation testing tools differ significantly [Kintis:EMSE:2017]. To support the reproducibility and comprehension of the experimental results, researchers should also clearly describe the exact version of the employed mutation testing tool. If the used tool is not a publicly available, researchers should
list the exact transformation rules (mutant instances supported by each operator [Kintis:EMSE:2017]) that are supported by the mutant operators selected. Unfortunately, our survey found that more than a quarter of the empirical studies does not report such details. The objective is to provide readers with the low-level details that might vary from one study to another, so that these can be accounted for in subsequent studies.
- Mutant redundancy: Justify the steps taken to control mutant redundancy. Mutant redundancy may have a large impact on the validity of the assessment. Therefore, it is important to explain how mutant redundancy is handled (perhaps in the threats to validity section). Where possible, experimenters are advised to additionally use techniques like TCE [7882714] to remove the duplicate mutants (in case the interest is on the achieved score of a technique), or a dynamic approximation of the disjoint mutation score [PapadakisHHJT16], [7927997] (in case the interest is on comparing test techniques). Please consult Section 9.4 of the survey [PapadakisKZJTH18], for Algorithm 1 that provides the approximation of the disjoint mutants. In case these techniques are expensive, researchers are advised to clarify this and contrast their findings on a (small) sample of cases where mutant redundancy is controlled.
- Test suite choice and size: Explain the choice of test suite and any steps taken to account for the effects of test suite size, where appropriate. Ideally, an experimenter would like to have large, diverse (i.e., mutants are killed by multiple test cases) and high-strength (i.e., killing the majority of the mutants) test suites. As such test suites are rare in most of the open-source projects, researchers are advised to demonstrate and contrast their findings with a (small) sample of subjects with strong and diverse test suites (perhaps in addition to the chosen subjects). Alternatively, experimenters may consider using automated tools to augment their test suites. Overall, the objective is to
allow other researchers to create a similar test suite and/or to experiment with different choice of suite and measure the effects of such choices.
- Clean Program Assumption: Explain whet the study relies on the CPA assumption.
Ideally, where possible, the CPA should not be relied upon; testing should be applied to the faulty programs (instead of the clean, non-faulty ones). If this is not possible (potentially due to execution cost or lack of resources), researchers are advised to note the reliance on the CPA. Its effects may be small in some cases, justifying reliance on this assumption.
Either way, explicitly stating whether or not it is relied upon will aid clarity and facilitate subsequent studies.
- Multiple experimental repetitions: Clarify the number of experimental repetitions. Ideally, when techniques make stochastic choices they should be assessed by multiple experimental repetitions [Harman2012], [arcuri_STVR_11]. In practice, this might not be possible due to the required execution time or other constraints. In this case, researchers have to choose between experiments with many subjects but few repetitions or experiments with few subjects and many repetitions; research suggests that it is preferable to choose the second option [DelamaroO14]. Of course, this choice needs to be clarified according to the specific context and goals of the study.
- Presentation of the results: Clarify the granularity level of the empirical results. Many empirical studies compute mutation scores over the whole subject projects they are using (one score per project). Since, this practice may not generalise to other granularity levels (two methods can have a similar number of mutants killed on a project (overall number), but quite different numbers, of mutants killed, on the individual units of the project.}) (such as unit level) [LaurentVPHT16], researchers should report and explain the suitability of the chosen granularity level at the given application context.