Using Latent Dirichlet Allocation for automatic categorization of software

TitleUsing Latent Dirichlet Allocation for automatic categorization of software
Publication TypeConference Paper
Year of Publication2009
AuthorsTian, K, Revelle, M, Poshyvanyk, D
Secondary Title2009 6th IEEE International Working Conference on Mining Software Repositories (MSR)2009 6th IEEE International Working Conference on Mining Software Repositories
Pagination163 - 166
PublisherIEEE
Place PublishedVancouver, BC, Canada
ISBN Number978-1-4244-3493-0
Keywordscategorization, category mining, lact, mudablue, multiple languages, repository
Abstract

In this paper, we propose a technique called LACT for automatically categorizing software systems in open-source repositories. LACT is based on latent Dirichlet Allocation, an information retrieval method which is used to index and analyze source code documents as mixtures of probabilistic topics. For an initial evaluation, we performed two studies. In the first study, LACT was compared against an existing tool, MUDABlue, for classifying 41 software systems written in C into problem domain categories. The results indicate that LACT can automatically produce meaningful category names and yield classification results comparable to MUDABlue. In the second study, we applied LACT to 43 software systems written in different programming languages such as C/C++, Java, C#, PHP, and Perl. The results indicate that LACT can be used effectively for the automatic categorization of software systems regardless of the underlying programming language or paradigm. Moreover, both studies indicate that LACT can identify several new categories that are based on libraries, architectures, or programming languages, which is a promising improvement as compared to manual categorization and existing techniques.

DOI10.1109/MSR.2009.5069496
Full Text
AttachmentSize
PDF icon 163MSR2009_TianPos.pdf75.5 KB