Mining source code to automatically split identifiers for software analysis

Submitted by msquire on Wed, 2011-04-13 08:36

Title	Mining source code to automatically split identifiers for software analysis
Publication Type	Conference Paper
Year of Publication	2009
Authors	Enslen, E, Hill, E, Pollock, L, Vijay-Shanker, K
Secondary Title	2009 6th IEEE International Working Conference on Mining Software Repositories (MSR)2009 6th IEEE International Working Conference on Mining Software Repositories
Pagination	71 - 80
Publisher	IEEE
Place Published	Vancouver, BC, Canada
ISBN Number	978-1-4244-3493-0
Keywords	java, samurai, sourceforge
Abstract	Automated software engineering tools (e.g., program search, concern location, code reuse, quality assessment, etc.) increasingly rely on natural language information from comments and identifiers in code. The first step in analyzing words from identifiers requires splitting identifiers into their constituent words. Unlike natural languages, where space and punctuation are used to delineate words, identifiers cannot contain spaces. One common way to split identifiers is to follow programming language naming conventions. For example, Java programmers often use camel case, where words are delineated by uppercase letters or non-alphabetic characters. However, programmers also create identifiers by concatenating sequences of words together with no discernible delineation, which poses challenges to automatic identifier splitting. In this paper, we present an algorithm to automatically split identifiers into sequences of words by mining word frequencies in source code. With these word frequencies, our identifier splitter uses a scoring technique to automatically select the most appropriate partitioning for an identifier. In an evaluation of over 8000 identifiers from open source Java programs, our Samurai approach outperforms the existing state of the art techniques.
DOI	10.1109/MSR.2009.5069482
Full Text