%0 Journal Article %J Journal of Systems and Software %D 2012 %T Using Pig as a data preparation language for large-scale mining software repositories studies: An experience report %A Weiyi Shang %A Adams, Bram %A Hassan, Ahmed E. %K flossmole cited %X The Mining Software Repositories (MSR) field analyzes software repository data to uncover knowledge and assist development of ever growing, complex systems. However, existing approaches and platforms for MSR analysis face many challenges when performing large-scale MSR studies. Such approaches and platforms rarely scale easily out of the box. Instead, they often require custom scaling tricks and designs that are costly to maintain and that are not reusable for other types of analysis. We believe that the web community has faced many of these software engineering scaling challenges before, as web analyses have to cope with the enormous growth of web data. In this paper, we report on our experience in using a web-scale platform (i.e., Pig) as a data preparation language to aid large-scale MSR studies. Through three case studies, we carefully validate the use of this web platform to prepare (i.e., Extract, Transform, and Load, ETL) data for further analysis. Despite several limitations, we still encourage MSR researchers to leverage Pig in their large-scale studies because of Pig's scalability and flexibility. Our experience report will help other researchers who want to scale their analyses. %B Journal of Systems and Software %V 85 %P 2195 - 2204 %8 10/2012 %U http://www.sciencedirect.com/science/article/pii/S0164121211002007 %N 10 %! Journal of Systems and Software %R 10.1016/j.jss.2011.07.034 %0 Conference Paper %B 2009 6th IEEE International Working Conference on Mining Software Repositories (MSR)2009 6th IEEE International Working Conference on Mining Software Repositories %D 2009 %T MapReduce as a general framework to support research in Mining Software Repositories (MSR) %A Weiyi Shang %A Zhen Ming Jiang %A Adams, Bram %A Hassan, Ahmed E. %K hadoop %K mapreduce %X Researchers continue to demonstrate the benefits of Mining Software Repositories (MSR) for supporting software development and research activities. However, as the mining process is time and resource intensive, they often create their own distributed platforms and use various optimizations to speed up and scale up their analysis. These platforms are project-specific, hard to reuse, and offer minimal debugging and deployment support. In this paper, we propose the use of MapReduce, a distributed computing platform, to support research in MSR. As a proof-of-concept, we migrate J-REX, an optimized evolutionary code extractor, to run on Hadoop, an open source implementation of MapReduce. Through a case study on the source control repositories of the Eclipse, BIRT and Datatools projects, we demonstrate that the migration effort to MapReduce is minimal and that the benefits are significant, as running time of the migrated J-REX is only 30% to 50% of the original J-REX's. This paper documents our experience with the migration, and highlights the benefits and challenges of the MapReduce framework in the MSR community. %B 2009 6th IEEE International Working Conference on Mining Software Repositories (MSR)2009 6th IEEE International Working Conference on Mining Software Repositories %I IEEE %C Vancouver, BC, Canada %P 21 - 30 %@ 978-1-4244-3493-0 %R 10.1109/MSR.2009.5069477 %> https://flosshub.org/sites/flosshub.org/files/21MSR2009-MSR-0114-Shang-Weiyi.pdf