Mechanism for Change Detection in HTML Web Pages as XML Documents

Name
Kaarel Tõnisson
Abstract
Change detection of web pages is an important aspect of web monitoring. Automated web monitoring can be used for the collection of specifc information, for example for detecting public announcements, news posts and changes of prices. If we store the HTML code of a page, we can compare the current and previous codes when we revisit the page, allowing us to find their changes. HTML code can be compared using ordinary text comparison, but this brings the risk of losing information about the structure of the page. HTML code is treelike in structure and it is a desirable property to preserve when finding changes. In this work we describe a mechanism that can be applied to collected HTML pages to find their changes by transforming HTML pages into XML documents and comparing the resulting XML trees. We give a general list of the components needed for this task, describe our implementation which uses NutchWAX, NekoHTML, XMLUnit, Jena and MongoDB, and show the results of applying the program to a dataset. We analyse the results of measurements collected when running our program on 1.1 million HTML pages. To our knowledge this mechanism has not been tested in previous works. We show that the mechanism is usable on real world data.
Graduation Thesis language
English
Graduation Thesis type
Bachelor - Computer Science
Supervisor(s)
Peep Küngas
Defence year
2015
 
PDF