Optimized Content Extraction from web pages using Composite Approaches

  IJCOT-book-cover
 
International Journal of Computer Trends and Technology (IJCTT)          
 
© - Issue 2013 by IJCTT Journal
Volume-4 Issue-3                           
Year of Publication : 2013
Authors : Sheba Gaikwad, G. Naveen Sundar

MLA

Sheba Gaikwad, G. Naveen Sundar "Optimized Content Extraction from web pages using Composite Approaches "International Journal of Computer Trends and Technology (IJCTT),V4(3):450-453 Issue 2013 .ISSN 2231-2803.www.ijcttjournal.org. Published by Seventh Sense Research Group.

Abstract: -The information available today on web is tremendous and comes with greater challenges. Content extraction identifies the main content and removes the clutter from web pages. The main problem in extracting the content from the web page is the newer architecture of web pages and the diversity in the structure of web pages. Optimized content extraction from HTML documents using collective approaches proposes a hybrid model that operates on Document Object Model (DOM) tree of the corresponding HTML document to extract the content accurately. It combines approaches and techniques like statistical features extraction, formatting characteristic. Content type identification is used along with collective approach to overcome problem of dealing with versatile web pages, and yielding to achieve more accuracy in extracting the contents.

References-

[1] S. -H. Lin, J.-M. Ho, “Discovering informative content blocks from Web documents”, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, ACM, New York, NY, USA, 2002, pp. 588–593.
[2] C. Mantratzis, M. Orgun, S. Cassidy, “ Separating XHTML content from navigation clutter using DOM-structure block analysis”, in: Proceedings of the Sixteenth ACM Conference on Hypertext and Hypermedia, HYPERTEXT ’05, ACM, New York, NY, USA, 2005, pp. 145–147.
[3] P.A.R. Qureshi, N. Memon, U.K. Wiil, “Statistical model for content extraction”, European Intelligence and Security Informatics Conference (EISIC), IEEE Computer Society Press, Athens, Greece, September 2011, pp. 129–134.
[4] T. Weninger, W.H. Hsu, J. Han, “CETR: content extraction via tag ratios”, Proceedings of the 19th International Conference on World Wide Web, WWW ’10, ACM, New York, NY, USA, 2010, pp. 971–980.
[5] T.V. Raman, Toward 2W , beyond web 2.0, Commun. ACM 52 (February 2009) 52–59.

Keywords — Data mining, Information Extraction, Content extraction, HTML, Open source intelligence, Information filtering.