TitleLanguage entropy: A metric for characterization of author programming language distribution
Publication TypeConference Paper
Year of Publication2009
AuthorsKrein, JL, MacLean, AC, Delorey, DP, Knutson, CD, Eggett, DL
Secondary Title4th Workshop on Public Data about Software Development (WoPDaSD 2009)
Date Published2009
Keywordscontributions, developers, language entropy, lines of code, loc, multiple languages, programming languages, sourceforge

Programmers are often required to develop in multiple languages. In an effort to study the effects of programming language fragmentation on productivity—and ultimately on a programmer’s problem solving abilities—we propose a metric, language entropy, for characterizing the distribution of an individual’s development efforts across multiple programming languages. To evaluate this metric, we present an observational study examining all project contributions (through August 2006) of a random sample of 500 SourceForge developers. Using a random coefficients model, we found a statistically significant correlation (alpha level of 0.05) between language entropy and the size of monthly pro ject contributions (measured in lines of code added). Our results indicate that language entropy is a good candidate for characterizing author programing language distribution.


The data set used in this study was previously collected for a separate, but related work. It was originally extracted from the SourceForge Research Archive (SFRA), August 2006. For a detailed discussion of the data source, collection tools and processes, and summary statistics, see [6]."

"From the initial data set we extracted a random sample of 500 developers3 along with descriptive details of all revisions that those developers made since the inception of the projects on which they worked. We then condensed this sample by totaling the lines of code added by each developer for each month in which that developer made at least one code submission."
