Institute of Computer Science - Graduation Theses Registry

Completed theses (Submit your thesis) Graduation theses topics (Submit a thesis topic)

Mechanism for Change Detection in HTML Web Pages as XML Documents

Name

Kaarel Tõnisson

Abstract

Change detection of web pages is an important aspect of web monitoring. Automated web monitoring can be used for the collection of specifc information, for example for detecting public announcements, news posts and changes of prices. If we store the HTML code of a page, we can compare the current and previous codes when we revisit the page, allowing us to find their changes. HTML code can be compared using ordinary text comparison, but this brings the risk of losing information about the structure of the page. HTML code is treelike in structure and it is a desirable property to preserve when finding changes. In this work we describe a mechanism that can be applied to collected HTML pages to find their changes by transforming HTML pages into XML documents and comparing the resulting XML trees. We give a general list of the components needed for this task, describe our implementation which uses NutchWAX, NekoHTML, XMLUnit, Jena and MongoDB, and show the results of applying the program to a dataset. We analyse the results of measurements collected when running our program on 1.1 million HTML pages. To our knowledge this mechanism has not been tested in previous works. We show that the mechanism is usable on real world data.

Graduation Thesis language

English

Graduation Thesis type

Bachelor - Computer Science

Supervisor(s)

Peep Küngas

Defence year

2015

PDF

UT Institute of Computer Science Graduation Theses Registry

Mechanism for Change Detection in HTML Web Pages as XML Documents