Software Bertillonage

TitleSoftware Bertillonage
Publication TypeJournal Article
Year of Publication2012
AuthorsDavies, J, German, D, Godfrey, M, Hindle, A
Secondary TitleEmpirical Software Engineering
PublisherSpringer Netherlands
ISSN Number1382-3256

Deployed software systems are typically composed of many pieces, not all of which may have been created by the main development team. Often, the provenance of included components—such as external libraries or cloned source code—is not clearly stated, and this uncertainty can introduce technical and ethical concerns that make it difficult for system owners and other stakeholders to manage their software assets. In this work, we motivate the need for the recovery of the provenance of software entities by a broad set of techniques that could include signature matching, source code fact extraction, software clone detection, call flow graph matching, string matching, historical analyses, and other techniques. We liken our provenance goals to that of Bertillonage, a simple and approximate forensic analysis technique based on bio-metrics that was developed in 19th century France before the advent of fingerprints. As an example, we have developed a fast, simple, and approximate technique called anchored signature matching for identifying the source origin of binary libraries within a given Java application. This technique involves a type of structured signature matching performed against a database of candidates drawn from the Maven2 repository, a 275 GB collection of open source Java libraries. To show the approach is both valid and effective, we conducted an empirical study on 945 jars from the Debian GNU/Linux distribution, as well as an industrial case study on 81 jars from an e-commerce application.



Full Text