%0 Conference Paper %B Proceedings of the 34th IEEE/ACM International Conference On Software Engineering (ICSE 2012) %D 2012 %T Content classification of developer emails %A Bacchelli, Alberto %A Dal Sasso, Tommaso %A D'Ambros, Marco %A Lanza, Michele %K email %K Emails %K Empirical software engineering %K mailing list %K natural language %K Unstructured Data Mining %X Emails related to the development of a software system contain information about design choices and issues encountered during the development process. Exploiting the knowledge embedded in emails with automatic tools is challenging, due to the unstructured, noisy and mixed language nature of this communication medium. Natural language text is often not well-formed and is interleaved with languages with other syntaxes, such as code or stack traces. We present an approach to classify email content at line level. Our technique classifies email lines in five categories (i.e., text, junk, code, patch, and stack trace) to allow one to subsequently apply ad hoc analysis techniques for each category. We evaluated our approach on a statistically significant set of emails gathered from mailing lists of four unrelated open source systems. %B Proceedings of the 34th IEEE/ACM International Conference On Software Engineering (ICSE 2012) %8 06/2012 %U http://www.inf.usi.ch/phd/bacchelli/publications.php %> https://flosshub.org/sites/flosshub.org/files/icse2012.pdf %0 Conference Paper %B Proceedings of ICPC 2010 (18th IEEE International Conference on Program Comprehension) %D 2010 %T Extracting source code from e-mails %A Bacchelli, Alberto %A D'Ambros, Marco %A Lanza, Michele %K argouml %K email %K freenet %K jmeter %K mailing lists %K mina %K natural language %K openjpa %K source code %X E-mails, used by developers and system users to communicate over a broad range of topics, offer a valuable source of information. If archived, e-mails can be mined to support program comprehension activities and to provide views of a software system that are alternative and complementary to those offered by the source code. However, e-mails are written in natural language, and therefore contain noise that makes it difficult to retrieve the important data. Thus, before conducting an effective system analysis and extracting data for program comprehension, it is necessary to select the relevant messages, and to expose only the meaningful information. In this work we focus both on classifying e-mails that hold fragments of the source code of a system, and on extracting the source code pieces inside the e-mail. We devised and analyzed a number of lightweight techniques to accomplish these tasks. To assess the validity of our techniques, we manually inspected and annotated a statistically significant number of e-mails from five unrelated open source software systems written in Java. With such a benchmark in place, we measured the effectiveness of each technique in terms of precision and recall. %B Proceedings of ICPC 2010 (18th IEEE International Conference on Program Comprehension) %P 24-33 %U http://www.inf.usi.ch/phd/bacchelli/publications.php %> https://flosshub.org/sites/flosshub.org/files/icpc2010.pdf %0 Conference Paper %B Proceedings of the 2008 international working conference on Mining software repositories %D 2008 %T AMAP: automatically mining abbreviation expansions in programs to enhance software maintenance tools %A Hill, Emily %A Fry, Zachary P. %A Boyd, Haley %A Sridhara, Giriprasad %A Novikova, Yana %A Pollock, Lori %A Vijay-Shanker, K. %K automatic abbreviation expansion %K azureus %K itext.net %K liferay %K maintenance %K natural language %K openoffice.org %K program comprehension %K source code %K tiger envelopes %K tools %X When writing software, developers often employ abbreviations in identifier names. In fact, some abbreviations may never occur with the expanded word, or occur more often in the code. However, most existing program comprehension and search tools do little to address the problem of abbreviations, and therefore may miss meaningful pieces of code or relationships between software artifacts. In this paper, we present an automated approach to mining abbreviation expansions from source code to enhance software maintenance tools that utilize natural language information. Our scoped approach uses contextual information at the method, program, and general software level to automatically select the most appropriate expansion for a given abbreviation. We evaluated our approach on a set of 250 potential abbreviations and found that our scoped approach provides a 57% improvement in accuracy over the current state of the art. %B Proceedings of the 2008 international working conference on Mining software repositories %S MSR '08 %I ACM %C New York, NY, USA %P 79–88 %8 05/2008 %@ 978-1-60558-024-1 %U http://doi.acm.org/10.1145/1370750.1370771 %R http://doi.acm.org/10.1145/1370750.1370771 %> https://flosshub.org/sites/flosshub.org/files/p79-hill.pdf %0 Conference Paper %B Proceedings of the 30th international conference on Software engineering %D 2008 %T An approach to detecting duplicate bug reports using natural language and execution information %A Wang, Xiaoyin %A Zhang, Lu %A Xie, Tao %A Anvik, John %A Sun, Jiasu %K bug report %K duplicate bug report %K execution information %K information retrieval %K natural language %X An open source project typically maintains an open bug repository so that bug reports from all over the world can be gathered. When a new bug report is submitted to the repository, a person, called a triager, examines whether it is a duplicate of an existing bug report. If it is, the triager marks it as DUPLICATE and the bug report is removed from consideration for further work. In the literature, there are approaches exploiting only natural language information to detect duplicate bug reports. In this paper we present a new approach that further involves execution information. In our approach, when a new bug report arrives, its natural language information and execution information are compared with those of the existing bug reports. Then, a small number of existing bug reports are suggested to the triager as the most similar bug reports to the new bug report. Finally, the triager examines the suggested bug reports to determine whether the new bug report duplicates an existing bug report. We calibrated our approach on a subset of the Eclipse bug repository and evaluated our approach on a subset of the Firefox bug repository. The experimental results show that our approach can detect 67%-93% of duplicate bug reports in the Firefox bug repository, compared to 43%-72% using natural language information alone. %B Proceedings of the 30th international conference on Software engineering %S ICSE '08 %I ACM %C New York, NY, USA %P 461–470 %@ 978-1-60558-079-1 %U http://doi.acm.org/10.1145/1368088.1368151 %R 10.1145/1368088.1368151 %0 Conference Paper %B Proceedings of the 2008 international workshop on Mining software repositories - MSR '08 %D 2008 %T Extracting structural information from bug reports %A Premraj, Rahul %A Zimmermann, Thomas %A Kim, Sunghun %A Bettenburg, Nicolas %Y Hassan, Ahmed E. %Y Lanza, Michele %Y Godfrey, Michael W. %K bug reports %K eclipse %K enumerations %K infozilla %K natural language %K patches %K source code %K stack trace %X In software engineering experiments, the description of bug reports is typically treated as natural language text, although it often contains stack traces, source code, and patches. Neglecting such structural elements is a loss of valuable information; structure usually leads to a better performance of machine learning approaches. In this paper, we present a tool called infoZilla that detects structural elements from bug reports with near perfect accuracy and allows us to extract them. We anticipate that infoZilla can be used to leverage data from bug reports at a different granularity level that can facilitate interesting research in the future. %B Proceedings of the 2008 international workshop on Mining software repositories - MSR '08 %I ACM Press %C New York, New York, USA %P 27-30 %8 05/2008 %@ 9781605580241 %! MSR '08 %R 10.1145/1370750.1370757 %> https://flosshub.org/sites/flosshub.org/files/p27-bettenburg.pdf %0 Conference Paper %B International Workshop on Mining Software Repositories (MSR 2004) %D 2004 %T LASER: a lexical approach to analogy in software reuse %A Amin, R. %A Mel O Cinneide %A Veale, Tony %K class %K developers %K functions %K jrefactory %K method %K naming %K natural language %K reuse %K source code %K wordnet %X Software reuse is the process of creating a software system from existing software components, rather than creating it from scratch. With the increase in size and complexity of existing software repositories, the need to provide intelligent support to the programmer becomes more pressing. An analogy is a comparison of certain similarities between things which are otherwise unlike. This concept has shown to be valuable in developing UML-level reuse techniques. In the LASER project we apply lexically-driven Analogy at the code level, rather than at the UML-level, in order to retrieve matching components from a repository of existing components. Using the lexical ontology Word-Net, we have conducted a case study to assess if class and method names in open source applications are used in a semantically meaningful way. Our results demonstrate that both hierarchical reuse and parallel reuse can be enhanced through the use of lexically-driven Analogy. %B International Workshop on Mining Software Repositories (MSR 2004) %I IEE %C Edinburgh, Scotland, UK %V 2004 %P 112 - 116 %R 10.1049/ic:20040487 %> https://flosshub.org/sites/flosshub.org/files/112LASER.pdf