%0 Journal Article %J Science of Computer Programming %D 2014 %T Sourcerer: An infrastructure for large-scale collection and analysis of open-source code %A Bajracharya, Sushil %A Ossher, Joel %A Lopes, Cristina %X A large amount of open source code is now available online, presenting a great potential resource for software developers. This has motivated software engineering researchers to develop tools and techniques to allow developers to reap the benefits of these billions of lines of source code. However, collecting and analyzing such a large quantity of source code presents a number of challenges. Although the current generation of open source code search engines provides access to the source code in an aggregated repository, they generally fail to take advantage of the rich structural information contained in the code they index. This makes them significantly less useful than Sourcerer for building state-of-the-art software engineering tools, as these tools often require access to both the structural and textual information available in source code.We have developed Sourcerer, an infrastructure for large-scale collection and analysis of open source code. By taking full advantage of the structural information extracted from source code in its repository, Sourcerer provides a foundation upon which state-of-the-art search engines and related tools can easily be built. We describe the Sourcerer infrastructure, present the applications that we have built on top of it, and discuss how existing tools could benefit from using Sourcerer. %B Science of Computer Programming %V 79 %P 241 - 259 %8 1/2014 %! Science of Computer Programming %R 10.1016/j.scico.2012.04.008