%0 Conference Paper %B Proceedings of the 11th Working Conference on Mining Software Repositories %D 2014 %T A Dataset for Maven Artifacts and Bug Patterns Found in Them %A Saini, Vaibhav %A Sajnani, Hitesh %A Ossher, Joel %A Lopes, Cristina V. %K Empirical Research %K Empirical software engineering %K findbugs %K maven %K software quality %X In this paper, we present data downloaded from Maven, one of the most popular component repositories. The data includes the binaries of 186,392 components, along with source code for 161,025. We identify and organize these components into groups where each group contains all the versions of a library. In order to asses the quality of these components, we make available report generated by the FindBugs tool on 64,574 components. The information is also made available in the form of a database which stores total number, type, and priority of bug patterns found in each component, along with its defect density. We also describe how this dataset can be useful in software engineering research. %B Proceedings of the 11th Working Conference on Mining Software Repositories %S MSR 2014 %I ACM %C New York, NY, USA %P 416–419 %@ 978-1-4503-2863-0 %U http://doi.acm.org/10.1145/2597073.2597134 %R 10.1145/2597073.2597134 %0 Journal Article %J Science of Computer Programming %D 2014 %T Sourcerer: An infrastructure for large-scale collection and analysis of open-source code %A Bajracharya, Sushil %A Ossher, Joel %A Lopes, Cristina %X A large amount of open source code is now available online, presenting a great potential resource for software developers. This has motivated software engineering researchers to develop tools and techniques to allow developers to reap the benefits of these billions of lines of source code. However, collecting and analyzing such a large quantity of source code presents a number of challenges. Although the current generation of open source code search engines provides access to the source code in an aggregated repository, they generally fail to take advantage of the rich structural information contained in the code they index. This makes them significantly less useful than Sourcerer for building state-of-the-art software engineering tools, as these tools often require access to both the structural and textual information available in source code.We have developed Sourcerer, an infrastructure for large-scale collection and analysis of open source code. By taking full advantage of the structural information extracted from source code in its repository, Sourcerer provides a foundation upon which state-of-the-art search engines and related tools can easily be built. We describe the Sourcerer infrastructure, present the applications that we have built on top of it, and discuss how existing tools could benefit from using Sourcerer. %B Science of Computer Programming %V 79 %P 241 - 259 %8 1/2014 %! Science of Computer Programming %R 10.1016/j.scico.2012.04.008 %0 Conference Paper %B 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010)2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010) %D 2010 %T Automated dependency resolution for open source software %A Ossher, Joel %A Bajracharya, Sushil %A Lopes, Cristina %K dependencies %K java %K source code %K sourcerer %X Opportunities for software reuse are plentiful, thanks in large part to the widespread adoption of open source processes and the availability of search engines for locating relevant artifacts. One challenge presented by open source software reuse is simply getting a newly downloaded artifact to build/run in the first place. The artifact itself likely reuses other artifacts, and so depends on their being located to function properly. While merely tedious in the individual case, this can cause serious difficulties for those seeking to study open source software. It is simply not feasible to manually resolve dependencies for thousands of projects, and many forms of analysis require declarative completeness. In this paper we present a method for automatically resolving dependencies for open source software. It works by cross-referencing a project's missing type information with a repository of candidate artifacts. We have implemented this method on top of the Sourcerer, an infrastructure for the large-scale indexing and analysis of open source code. The performance of our resolution algorithm was evaluated in two parts. First, for a small number of popular open source projects, we manually examined the artifacts suggested by our system to determine if they were appropriate. Second, we applied the algorithm to the 13,241 projects in the Sourcerer managed repository to evaluate the rate of resolution success. The results demonstrate the feasibility of this approach, as the algorithm located all of the required artifacts needed by 3,904 additional projects, increasing the percentage of declaratively complete projects in Sourcerer from 39% to 69%. %B 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010)2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010) %I IEEE %C Cape Town, South Africa %P 130 - 140 %@ 978-1-4244-6802-7 %R 10.1109/MSR.2010.5463346 %0 Conference Paper %B 2009 6th IEEE International Working Conference on Mining Software Repositories (MSR)2009 6th IEEE International Working Conference on Mining Software Repositories %D 2009 %T SourcererDB: An aggregated repository of statically analyzed and cross-linked open source Java projects %A Ossher, Joel %A Bajracharya, Sushil %A Linstead, Erik %A Baldi, Pierre %A Lopes, Cristina %K apache %K integration %K java %K java.net %K project %K repository %K sourceforge %K SourcererDB %X The open source movement has made vast quantities of source code available online for free, providing an extremely large dataset for empirical study and potential resuse. A major difficulty in exploiting this potential fully is that the data are currently scattered between competing source code repositories, none of which are structured for empirical analysis and cross-project comparison. As a result, software researchers and developers are left to compile their own datasets, resulting in duplicated effort and limited results. To address this challenge, we built SourcererDB, an aggregated repository of statically analyzed and cross-linked open source Java projects. SourcererDB contains local snapshots of 2,852 Java projects taken from Sourceforge, Apache and Java.net. These projects are statically analyzed to extract rich structural information, which is then stored in a relational database. References to entities in the 16,058 external jars are resolved and grouped, allowing for cross-project usage information to be accessed easily. This paper describes: (a) the mechanism for resolving and grouping these cross-project references, (b) the structure of and the metamodel for the SourcererDB repository, and (d) end-user dataset access mechanisms. Our goal in building SourcererDB is to provide a rich dataset of source code to facilitate the sharing of extracted data and to encourage reuse and repeatability of experiments. %B 2009 6th IEEE International Working Conference on Mining Software Repositories (MSR)2009 6th IEEE International Working Conference on Mining Software Repositories %I IEEE %C Vancouver, BC, Canada %P 183 - 186 %@ 978-1-4244-3493-0 %R 10.1109/MSR.2009.5069501 %0 Conference Paper %B 2009 6th IEEE International Working Conference on Mining Software Repositories (MSR) %D 2009 %T SourcererDB: An aggregated repository of statically analyzed and cross-linked open source Java projects %A Ossher, Joel %A Bajracharya, Sushil %A Linstead, Erik %A Baldi, Pierre %A Lopes, Cristina %K apache %K java %K java.net %K source code %K sourceforge %K sourcerer %X The open source movement has made vast quantities of source code available online for free, providing an extremely large dataset for empirical study and potential re-use. A major difficulty in exploiting this potential fully is that the data are currently scattered between competing source code repositories, none of which are structured for empirical analysis and cross-project comparison. As a result, software researchers and developers are left to compile their own datasets, resulting in duplicated effort and limited results. To address this challenge, we built SourcererDB, an aggregated repository of statically analyzed and cross-linked open source Java projects. SourcererDB contains local snapshots of 2,852 Java projects taken from Sourceforge, Apache and Java.net. These projects are statically analyzed to extract rich structural information, which is then stored in a relational database. References to entities in the 16,058 external jars are resolved and grouped, allowing for cross-project usage information to be accessed easily. This paper describes: (a) the mechanism for resolving and grouping these cross-project references, (b) the structure of and the metamodel for the SourcererDB repository, and (d) end-user dataset access mechanisms. Our goal in building SourcererDB is to provide a rich dataset of source code to facilitate the sharing of extracted data and to encourage reuse and repeatability of experiments. %B 2009 6th IEEE International Working Conference on Mining Software Repositories (MSR) %I IEEE %C Vancouver, BC, Canada %P 183 - 186 %@ 978-1-4244-3493-0 %R 10.1109/MSR.2009.5069501