Content classification of developer emails

TitleContent classification of developer emails
Publication TypeConference Paper
Year of Publication2012
AuthorsBacchelli, A, Dal Sasso, T, D'Ambros, M, Lanza, M
Secondary TitleProceedings of the 34th IEEE/ACM International Conference On Software Engineering (ICSE 2012)
Date Published06/2012
Keywordsemail, Emails, Empirical software engineering, mailing list, natural language, Unstructured Data Mining

Emails related to the development of a software system contain information about design choices and issues encountered during the development process. Exploiting the knowledge embedded in emails with automatic tools is challenging, due to the unstructured, noisy and mixed language nature of this communication medium. Natural language text is often not well-formed and is interleaved with languages with other syntaxes, such as code or stack traces.

We present an approach to classify email content at line level. Our technique classifies email lines in five categories (i.e., text, junk, code, patch, and stack trace) to allow one to subsequently apply ad hoc analysis techniques for each category. We evaluated our approach on a statistically significant set of emails gathered from mailing lists of four unrelated open source systems.


We created a web application to manually classify email content in the chosen categories. We classified a statistically significant set of emails from four java open source software (OSS) systems, used to evaluate the accuracy of our approach.
The contributions of this paper are:
1) a novel approach that fuses parsing and ML techniques
for classification of email lines;
2) a web application to manually classify email content; 3) the manual classification of a statistically significant
sample set of emails (for a total of 67,792 lines) from mailing lists of four different software systems–in the form of a freely available benchmark; and
4) the empirical evaluation of our approach against the benchmark

Full Text
PDF icon icse2012.pdf661.43 KB