Detection algorithm for content on Internet web portals

Krzysztof Ulman,

Krzysztof Rzecki

Abstrakt

The paper shows steps, made during designing and implementing automatic web pages contents recognition algorithm, based on HTML structure analysis. A web page contents is the article text with its headline, without any other text like menu, advertisements, user’s comments, image captions, etc.

Słowa kluczowe: web pages contents recognition, data mining, web scraping, data collection, web pages structure analysis, HTML