You are here

Mining StackOverflow to Filter out Off-topic IRC Discussion

Submitted by msquire on Mon, 2015-05-25 13:36

Title	Mining StackOverflow to Filter out Off-topic IRC Discussion
Publication Type	Conference Proceedings
Year of Publication	2015
Authors	Chowdhury, SA, Hindle, A
Refereed Designation	Refereed
Secondary Title	International Working Conference on Mining Software Repositories
Pagination	4 pages
Date Published	05/2015
Keywords	irc, topics
Abstract	Internet Relay Chat (IRC) is a commonly used tool by OpenSource developers. Developers use IRC channels to discuss programming related problems, but much of the discussion is irrelevant and off-topic. Essentially if we treat IRC discussions like email messages, and apply spam filtering, we can try to filter out the spam (the off-topic discussions) from the ham (the programming discussions). Yet we need labelled data that unfortunately takes time to curate. To avoid costly curration in order to filter out off-topic discussions, we need positive and negative data-sources. On- line discussion forums, such as StackOverflow, are very effective for solving programming problems. By engaging in open-data, StackOverflow data becomes a powerful source of labelled text regarding programming. This work shows that we can train classifiers using StackOverflow posts as positive examples of on-topic programming discussion. YouTube video comments, notorious for their lack of quality, serve as training set of off- topic discussion. By exploiting these datasets, accurate classifiers can be built, tested and evaluated that require very little effort for end-users to deploy and exploit.
URL	http://webdocs.cs.ualberta.ca/~hindle1/2015/shaiful-mining_so.pdf
Full Text

Attachment	Size
shaiful-mining_so.pdf	144.71 KB

Taxonomy upgrade extras: