Mining StackOverflow to Filter out Off-topic IRC Discussion

TitleMining StackOverflow to Filter out Off-topic IRC Discussion
Publication TypeConference Proceedings
Year of Publication2015
AuthorsChowdhury, SA, Hindle, A
Refereed DesignationRefereed
Secondary TitleInternational Working Conference on Mining Software Repositories
Pagination4 pages
Date Published05/2015
Keywordsirc, topics

Internet Relay Chat (IRC) is a commonly used tool by OpenSource developers. Developers use IRC channels to discuss programming related problems, but much of the discussion is irrelevant and off-topic. Essentially if we treat IRC discussions like email messages, and apply spam filtering, we can try to filter out the spam (the off-topic discussions) from the ham (the programming discussions). Yet we need labelled data that unfortunately takes time to curate.
To avoid costly curration in order to filter out off-topic discussions, we need positive and negative data-sources. On- line discussion forums, such as StackOverflow, are very effective for solving programming problems. By engaging in open-data, StackOverflow data becomes a powerful source of labelled text regarding programming. This work shows that we can train classifiers using StackOverflow posts as positive examples of on-topic programming discussion. YouTube video comments, notorious for their lack of quality, serve as training set of off- topic discussion. By exploiting these datasets, accurate classifiers can be built, tested and evaluated that require very little effort for end-users to deploy and exploit.

Full Text
PDF icon shaiful-mining_so.pdf144.71 KB
Taxonomy upgrade extras: