Using Pig as a data preparation language for large-scale mining software repositories studies: An experience report
Title | Using Pig as a data preparation language for large-scale mining software repositories studies: An experience report |
Publication Type | Journal Article |
Year of Publication | 2012 |
Authors | Shang, W, Adams, B, Hassan, AE |
Secondary Title | Journal of Systems and Software |
Volume | 85 |
Issue | 10 |
Pagination | 2195 - 2204 |
Date Published | 10/2012 |
ISSN Number | 01641212 |
Keywords | flossmole cited |
Abstract | The Mining Software Repositories (MSR) field analyzes software repository data to uncover knowledge and assist development of ever growing, complex systems. However, existing approaches and platforms for MSR analysis face many challenges when performing large-scale MSR studies. Such approaches and platforms rarely scale easily out of the box. Instead, they often require custom scaling tricks and designs that are costly to maintain and that are not reusable for other types of analysis. We believe that the web community has faced many of these software engineering scaling challenges before, as web analyses have to cope with the enormous growth of web data. In this paper, we report on our experience in using a web-scale platform (i.e., Pig) as a data preparation language to aid large-scale MSR studies. Through three case studies, we carefully validate the use of this web platform to prepare (i.e., Extract, Transform, and Load, ETL) data for further analysis. Despite several limitations, we still encourage MSR researchers to leverage Pig in their large-scale studies because of Pig's scalability and flexibility. Our experience report will help other researchers who want to scale their analyses. |
Notes | "For example, FLOSSMole (Howison et al., 2006) is a public relational database that contains data extracted from a large number of software repositories. Many researchers use FLOSSMole as a platform. For example, Herraiz et al. (2008) used data in FLOSSMole (Howison et al., 2006) to perform analysis to illustrate that most of the software projects are governed by short term goals rather than long term goals." |
URL | http://www.sciencedirect.com/science/article/pii/S0164121211002007 |
DOI | 10.1016/j.jss.2011.07.034 |
Short Title | Journal of Systems and Software |
Full Text |
- Log in or register to post comments
- Google Scholar
- DOI
- BibTeX
- Tagged
- EndNote XML