A scalable crawler framework for FLOSS data

TitleA scalable crawler framework for FLOSS data
Publication TypeConference Paper
Year of Publication2013
AuthorsZou, Y, Zie, B, Zhang, L
Tertiary AuthorsMei, H, Lv, J, Mao, X
Secondary TitleProceedings of the 5th Asia-Pacific Symposium on Internetware - Internetware '13
Pagination1 - 7
PublisherACM Press
Place PublishedChangsha, China
ISBN Number9781450323697
Keywordsflossmole cited
Abstract

Free / Libre / Open Source Software (FLOSS) data, such as bug reports, mailing lists and related webpages, contains valuable information for reusing open source software projects. Before conducting further experiment on FLOSS data, researchers often need to download these data into a local storage system. We refer to this pre-process as FLOSS data retrieval, which in many cases can be a challenging task. In this paper, we proposed a crawler framework to ease the process of FLOSS data retrieval. To cope with various types of FLOSS data scattered on the Internet, we designed the framework in a scalable manner where a crawler program can be easily plugged into the system to extend its functionality. Researchers can perform the retrieval process on datasets of various types and sources simply by adding new configurations to the system. We have implemented the framework and provided basic functions via web-based interfaces. We presented the usage of the system by a detailed case study where we retrieved various types of datasets related to Apache Lucene project using our framework.

Notes

FLOSSmole [8] and FLOSSmetrics [9] retrieved FLOSS data of
various types from famous software forges like SourceForge and
Google code, interface to data sharing and analyzing is also
provided.

" Typically,
Howison et al. [8] proposed a system called FLOSSmole.
FLOSSmole is a large collection of datasets extracted from
famous software forges such as SourceForge, GitHub, and Google
Code. Datasets in FLOSSmole are mainly metadata describing
various facts about the development of FLOSS projects.
FLOSSmole manages its datasets in an open and collaborative
manner. Most of the data is collected by the FLOSSmole research
team, yet they also accept data donation from other research
groups or similar projects. The scripts and programs that collects
the datasets from the Internet is also open for download and
donation. "

"Using
FLOSSmole [8] and FLOSSmetrics [9] as case studies, similar
systems as such are called “repository of repositories (RoR)” and
basic requirements of these systems are proposed. "

DOI10.1145/2532443.2532454
Full Text
Taxonomy upgrade extras: