Coping with noise in a real-world weblog crawler and retrieval system

Publication Type  Conference Poster
Year of Publication  2010
Authors  Lanagan, J.; Ferguson, P.; O'Hare, N.; Smeaton, A.F.
Conference Name  In: Fourth International AAAI Conference on Weblogs and Social Media
Conference Date(s)  23-26 May 2010
Conference Location  Washington, DC, UK.
Key Words  RP5
Abstract  

In this paper we examine the effects of noise when creating a real-world weblog corpus for information retrieval. We focus on the DiffPost (Lee et al. 2008) approach to noise removal from blog pages, examining the difficulties encountered when crawling the blogosphere during the creation of a real-world corpus of blog pages. We introduce and evaluate a number of enhancements to the original DiffPost approach in order to increase the robustness of the algorithm. We then extend DiffPost by looking at the anchor-text to text ratio, and dis- cover that the time-interval between crawls is more impor- tant to the successful application of noise-removal algorithms within the blog context, than any additional improvements to the removal algorithm itself.

URL  http://doras.dcu.ie/15439/