Sourceforge Research Data Archives
SourceForge.net Research Data Archive
SourceForge.net is the world's largest Open Source software development web site, with the largest repository of Open Source code and applications available on the Internet. Owned and operated by OSTG, Inc. ("OSTG"), SourceForge.net provides free services to Open Source developers. The SourceForge.net web site is database driven and the supporting database includes historic and status statistics on over 140,000 projects and over 1.5 million registered users' activities at the project management web site. OSTG has shared certain SourceForge.net data with the University of Notre Dame for the sole purpose of supporting academic and scholarly research on the Free/Open Source Software phenomenon. OSTG has given Notre Dame permission to in turn share this data with other academic researchers studying the Free/Open Source Software phenomenon.
Release of the SourceForge.net Research Data
To advance the understanding of, and research on, the Free/Open Source Software phenomenon, portions of the data that may support such research, will be made available to academic or scholarly researchers. All requests for data must be submitted in writing (e-mail) to the Notre Dame PI, (Greg Madey). Only academic and scholarly researchers are eligible to receive the data. To receive the data, a short questionnaire and agreement must be completed, signed and returned. There is also a wiki for SRDA users.
Description of Data Available
SourceForge.net uses relational databases to store project managment activity and statistics. There are over 100 relations (tables) in the data dumps provided to Notre Dame. Some of the data have been removed for security and privacy reasons. SourceForge.net cleanses the data of personal information and strips out all OSTG specific and site functionality specific information. On a monthly basis, a complete dump of the databases (minus the data dropped for privacy and security reasons) is shared with Notre Dame. The Notre Dame researchers have built a data warehouse comprised of these monthly dumps, with each stored in a separate schema. Thus, each monthly dump is a shapshot of the status of all the SourceForge.net projects at that point in time. As of March 2007, the data warehouse was almost 500 GBytes in size, and is growing at about 25 GBytes per month. Much of the data is duplicated among the monthly dumps, but trends or changes in project activity and structure can be discovered by comparing data from the monthly dumps. Queries across the monthly schema may be used to discover when changes took place, to estimate trends in project activity and participation, or even that no activity, events or changes have taken place.