Prudent Practices for Designing Malware Experiments: Status Quo and Outlook
Rossow et al, IEEE S&P 2012
Let's begin by defining some terminology. Malware can be defined as malcious software that causes deliberate harm to people, systems, and networks. We can classify malware into malware families that group related samples based on their behaviour. Malware-execution-driven experiments or dynamic analysis is a method of analysing malware by executing samples in a controlled environment to observe their behaviour.
Rossow et al 1 propose a set of guidelines that can be applied to malware-execution driven experiments in both the construction of datasets, and the design of the experiments. They do this as they identified a number of pitfalls in the research community that makes it difficult to replicate experiments and generalise from results. For example, the presence of environment artefacts (things such as username strings and IP addresses) within a malware dataset can have a degrading effect on performace.
These guidelines fall under four categories: correctness, transparency, realism, and safety.
Correctness is concerned with the construction of malware datasets that limit the introduction of biases. This is achieved by ensuring relevant sampling, separation and membership of the dataset, while also putting controls within the analysis environment to protect data sensors and mitigate the influence of system artefacts upon observed malware behaviour.
Transparency aims to provide more description and detail to the malware analysis process by encouraging authors to identify malware samples used, provide system/network configuration information, and provide interpretation and reasoning for observed results.
Realism seeks to edge authors towards designing experiments that are reflective of how malware behaves ‘in the wild’ to allow generalisation of findings.
Finally, safety promotes mitigation from harm by highlighting the need for both implementation and discussion of containment policies within a malware experiment.
In order to assess the applicability of these guidelines, they survey 36 papers (40% from top-tier venues) interpreting the results in three stages.
They first performed a per-guideline analysis, in which they investigated the extent each guideline was met, and found numerous violations across all the categories. For example, in the safety category most papers did not deploy or adequately describe their containment policy.
Per-paper analysis discussed how many papers could have benefitted from significant improvement via the application of these guidelines, observing a correlation between the number of violated criteria and the number of applicable criteria. This demonstrated that these guidelines become increasingly important when designing more complex experiments.
Finally a top-venue analysis sought to detail how papers appearing in top-tier venues compared with those appearing in other venues, identifying that papers in top-tier venues tended to include real-world scenarios (but these are potentially based on biased datasets) and interpret false positives. However, violations generally remained comparable across these groups, suggesting that all papers in the community could equally benefit from the suggested guidelines.
In conclusion, this paper identifies a number of pitfalls within the malware-execution research community that has impact upon the scientific method. This situation could be improved by increasing effort into the presentation (transparency) of research methodology and interpretation of results. The guidelines of correctness and realism are however difficult to control, as it is not always obvious that certain practices lead to incorrect datasets or unrealistic scenarios. That being said, guidelines presented here do help to establish a common set of criteria to ensure future prudent experimentation with malware datasets.
The questions that remain are ‘Did this lead to a change in the state-of-the-art within the malware research community?’ and `How did people make use of these guidelines?'. The discussion of which I will defer to a later post.
This review is based on a presentation I gave at a reading group - linked is a copy of the slide deck if you would like to find out more.
Rossow et al paper link : https://ieeexplore.ieee.org/abstract/document/6234405↩