%0 Conference Proceedings %B 10th Working Conference on Mining Software Repositories %D 2013 %T Apache Commits: Social Network Dataset %A MacLean, Alexander C. %A Knutson, Charles D. %X Building non-trivial software is a social endeavor. Therefore, understanding the social network of developers is key to the study of software development organizations. We present a graph representation of the commit behavior of developers within the Apache Software Foundation for 2010 and 2011. Relationships between developers in the network represent collaborative commit behavior. Several similarity and summary metrics have been pre-calculated. The data, along with the tools that were used to create it and some further discussion, can be found at: http://sequoia.cs.byu.edu/lab/?page=artifacts/apacheGraphs %B 10th Working Conference on Mining Software Repositories %8 05/2013 %0 Conference Proceedings %B Open Source Systems: Grounding Research (OSS 2011) %D 2011 %T An Analysis of Author Contribution Patterns in Eclipse Foundation Project Source Code %A Taylor, Quinn C. %A Krein, Jonathan L. %A MacLean, Alexander C. %A Knutson, Charles D. %K COLLABORATION %K contribution %K eclipse %K entropy %K java %X Collaborative development is a key tenet of open source software, but if not properly understood and managed, it can become a liability. We examine author contribution data for the newest revision of 251,633 Java source files in 592 Eclipse projects. We use this observational data to analyze collaboration patterns within files, and to explore relationships between file size, author count, and code authorship. We calculate author entropy to characterize the contributions of multiple authors to a given file, with an eye toward understanding the degree of collaboration and the most common interaction patterns. %B Open Source Systems: Grounding Research (OSS 2011) %I Springer %P 269-281 %8 10/2011 %0 Conference Proceedings %B Open Source Systems: Grounding Research (OSS 2011) %D 2011 %T Cliff Walls: An Analysis of Monolithic Commits Using Latent Dirichlet Allocation %A Pratt, Landon J. %A MacLean, Alexander C. %A Knutson, Charles D. %A Ringger, Eric K. %K artifacts %K commit %K cvs %K LDA %K lines of code %K log files %K scm %K sloc %K sourceforge %K version control %X Artifact-based research provides a mechanism whereby researchers may study the creation of software yet avoid many of the difficulties of direct observation and experimentation. However, there are still many challenges that can affect the quality of artifact-based studies, especially those studies examining software evolution. Large commits, which we refer to as “Cliff Walls,” are one significant threat to studies of software evolution because they do not appear to represent incremental development. We used Latent Dirichlet Allocation to extract topics from over 2 million commit log messages, taken from 10,000 SourceForge projects. The topics generated through this method were then analyzed to determine the causes of over 9,000 of the largest commits. We found that branch merges, code imports, and auto-generated documentation were significant causes of large commits. We also found that corrective maintenance tasks, such as bug fixes, did not play a significant role in the creation of large commits. %B Open Source Systems: Grounding Research (OSS 2011) %I Springer %P 282-298 %8 10/2011 %0 Conference Proceedings %B Open Source Systems: Grounding Research (OSS 2011) %D 2011 %T Knowledge Homogeneity and Specialization in the Apache HTTP Server Project %A MacLean, Alexander C. %A Pratt, Landon J. %A Knutson, Charles D. %A Ringger, Eric K. %K apache %K commits %K developer %K email %K email archive %K LDA %K mailing list %K revision control %K revision history %K scm %K social network analysis %K specialization %K subversion %K svn %X We present an analysis of developer communication in the Apache HTTP Server project. Using topic modeling techniques we expose latent conceptual sub-communities arising from developer specialization within the greater developer population. However, we found that among the major contributors to the project, very little specialization exists. We present theories to explain this phenomenon, and suggest further research. %B Open Source Systems: Grounding Research (OSS 2011) %I Springer %P 106-122 %8 10/2011 %U http://sequoia.cs.byu.edu/lab/files/pubs/MacLean2011a.pdf %> https://flosshub.org/sites/flosshub.org/files/MacLean2011a.pdf %0 Journal Article %J International Journal of Open Source Software and Processes %D 2010 %T Impact of Programming Language Fragmentation on Developer Productivity %A Krein, Jonathan L. %A MacLean, Alexander C. %A Knutson, Charles D. %A Delorey, Daniel P. %A Eggett, Dennis L. %K commits %K entropy %K language entropy %K programming languages %K sourceforge %K srda %X Programmers often develop software in multiple languages. In an effort to study the effects of programming language fragmentation on productivity—and ultimately on a developer’s problem-solving abilities—the authors present a metric, language entropy, for characterizing the distribution of a developer’s programming efforts across multiple programming languages. This paper presents an observational study examining the project contributions of a random sample of 500 SourceForge developers. Using a random coefficients model, the authors find a statistically (alpha level of 0.001) and practically significant correlation between language entropy and the size of monthly project contributions. Results indicate that programming language fragmentation is negatively related to the total amount of code contributed by developers within SourceForge, an open source software (OSS) community. %B International Journal of Open Source Software and Processes %V 2 %P 41 - 61 %8 32/2010 %N 2 %R 10.4018/jossp.2010040104 %0 Conference Paper %B 5th Workshop on Public Data about Software Development (WoPDaSD 2010) %D 2010 %T Trends That Affect Temporal Analysis Using SourceForge Data %A MacLean, Alexander C. %A Pratt, Landon J. %A Krein, Jonathan L. %A Knutson, Charles D. %K cliff walls %K committers %K cvs %K evolution %K growth %K source code %K sourceforge %K time %K time series %X SourceForge is a valuable source of software artifact data for researchers who study project evolution and developer behavior. However, the data exhibit patterns that may bias temporal analyses. Most notable are cliff walls in project source code repository timelines, which indicate large commits that are out of character for the given project. These cliff walls often hide significant periods of development and developer collaboration—a threat to studies that rely on SourceForge repository data. We demonstrate how to identify these cliff walls, discuss reasons for their appearance, and propose preliminary measures for mitigating their effects in evolution-oriented studies. %B 5th Workshop on Public Data about Software Development (WoPDaSD 2010) %> https://flosshub.org/sites/flosshub.org/files/wopdasd001.pdf %0 Conference Paper %B 2009 6th IEEE International Working Conference on Mining Software Repositories (MSR)2009 6th IEEE International Working Conference on Mining Software Repositories %D 2009 %T Author entropy vs. file size in the GNOME suite of applications %A Casebolt, Jason R. %A Krein, Jonathan L. %A MacLean, Alexander C. %A Knutson, Charles D. %A Delorey, Daniel P. %K author entropy %K contributions %K gnome %K msr challenge %X We present the results of a study in which author entropy was used to characterize author contributions per file. Our analysis reveals three patterns: banding in the data, uneven distribution of data across bands, and file size dependent distributions within bands. Our results suggest that when two authors contribute to a file, large files are more likely to have a dominant author than smaller files. %B 2009 6th IEEE International Working Conference on Mining Software Repositories (MSR)2009 6th IEEE International Working Conference on Mining Software Repositories %I IEEE %C Vancouver, BC, Canada %P 91 - 94 %@ 978-1-4244-3493-0 %R 10.1109/MSR.2009.5069484 %0 Conference Paper %B 4th Workshop on Public Data about Software Development (WoPDaSD 2009) %D 2009 %T Language entropy: A metric for characterization of author programming language distribution %A Krein, Jonathan L. %A MacLean, Alexander C. %A Delorey, Daniel P. %A Knutson, Charles D. %A Eggett, Dennis L. %K contributions %K developers %K language entropy %K lines of code %K loc %K multiple languages %K programming languages %K sourceforge %X Programmers are often required to develop in multiple languages. In an effort to study the effects of programming language fragmentation on productivity—and ultimately on a programmer’s problem solving abilities—we propose a metric, language entropy, for characterizing the distribution of an individual’s development efforts across multiple programming languages. To evaluate this metric, we present an observational study examining all project contributions (through August 2006) of a random sample of 500 SourceForge developers. Using a random coefficients model, we found a statistically significant correlation (alpha level of 0.05) between language entropy and the size of monthly pro ject contributions (measured in lines of code added). Our results indicate that language entropy is a good candidate for characterizing author programing language distribution. %B 4th Workshop on Public Data about Software Development (WoPDaSD 2009) %8 2009 %> https://flosshub.org/sites/flosshub.org/files/LanguageEntropy-JonathanKrein.pdf %0 Conference Proceedings %B 21st Annual Psychology of Programming Interest Group Conference %D 2009 %T Mining Programming Language Vocabularies from Source Code %A Delorey, Daniel P. %A Knutson, Charles D. %A Davies, Mark %X We can learn much from the artifacts produced as the by-products of software devel- opment and stored in software repositories. Of all such potential data sources, one of the most important from the perspective of program comprehension is the source code itself. While other data sources give insight into what developers intend a program to do, the source code is the most accurate human-accessible description of what it will do. However, the ability of an individual developer to comprehend a particular source file depends directly on his or her familiarity with the specific features of the programming language being used in the file. This is not unlike the difficulties second-language learners may encounter when attempting to read a text written in a new language. We propose that by applying the techniques used by corpus linguists in the study of natural language texts to a corpus of programming language texts (i.e., source code repositories), we can gain new insights into the communication medium that is programming language. In this paper we lay the foundation for applying corpus linguistic methods to programming language by 1) defining the term “word” for programming language, 2) developing data collection tools and a data storage schema for the Java programming language, and 3) presenting an initial analysis of an example linguistic corpus based on version 1.5 of the Java Developers Kit. %B 21st Annual Psychology of Programming Interest Group Conference %P 12 pp %> https://flosshub.org/sites/flosshub.org/files/21st-delorey.pdf %0 Conference Paper %B 3rd Workshop on Public Data about Software Development (WoPDaSD 2008) %D 2008 %T Author Entropy: A Metric for Characterization of Software Authorship Patterns %A Taylor, Quinn C. %A Stevenson, James E. %A Delorey, Daniel P. %A Knutson, Charles D. %K developers %K entropy %K flossmole %K sourceforge %X We propose the concept of author entropy and describe how file-level entropy measures may be used to understand and characterize authorship patterns within individual files, as well as across an entire project. As a proof of concept, we compute author entropy for 28,955 files from 33 open-source projects. We explore patterns of author entropy, identify techniques for visualizing author entropy, and propose avenues for further study. %B 3rd Workshop on Public Data about Software Development (WoPDaSD 2008) %P 42-47 %8 2008 %> https://flosshub.org/sites/flosshub.org/files/entropy2008.pdf %0 Conference Paper %B 2nd Workshop on Public Data about Software Development (WoPDaSD 2007) %D 2007 %T Programming Language Trends in Open Source Development: An Evaluation Using Data from All Production Phase SourceForge Projects %A Delorey, Daniel P. %A Knutson, Charles D. %A Giraud-Carrier, C. %K cvs %K cvs2mysql %K programming languages %K sfra %K sourceforge %K srda %X In this work, we analyze data collected from the CVS repos- itories of 9,997 Open Source projects hosted on SourceForge in an effort to understand trends in programming language usage in the Open Source community between 2000 and 2005. The trends we consider include: 1) the relative popularity of the ten most popular programming languages over time, 2) the use of multiple programming languages by individual programmers and by individual projects, and 3) the programming languages most often used in combination. %B 2nd Workshop on Public Data about Software Development (WoPDaSD 2007) %> https://flosshub.org/sites/flosshub.org/files/Delorey2007b.pdf %0 Conference Paper %B 2nd Workshop on Public Data about Software Development (WoPDaSD 2007) %D 2007 %T Studying Production Phase SourceForge Projects: An Exploratory Analysis Using cvs2mysql and SFRA %A Delorey, Daniel P. %A Knutson, Charles D. %A MacLean, Alexander C. %K Data Collection %K forge %K repositories %K sourceforge %X A wealth of data can be extracted from the natural by-products of software development processes and used in empirical studies of software engineering. However, the size and accuracy of such studies depend in large part on the availability of tools that facilitate the collection of data from individual projects and the combination of data from multiple projects. To demonstrate this point, we present our experience gathering and analyzing data from nearly 10,000 open source projects hosted on SourceForge. We describe the tools we developed to collect the data and the ways in which these tools and data may be used by other researchers. We also provide examples of statistics that we have calculated from these data to describe interesting author- and project-level behaviors of the SourceForge community. %B 2nd Workshop on Public Data about Software Development (WoPDaSD 2007) %8 2007 %> https://flosshub.org/sites/flosshub.org/files/Delorey2007c.pdf