Using URL Templates to Find Hidden Entity Pages
Name
Ago Allikmaa
Abstract
This thesis describes a method for finding hidden entity pages based on a list of URLs visited by a web crawler. The described method creates a list of URL templates based on the input URLs and predicts new possible entity page addresses based on those. In the initial template generation phase, templates are generated by detecting numeric path ele-ments and treating other elements as static texts. To generate only one template for one set of entities, they are deduplicated in the unused path element detection phase by merging together templates that represent the same set of entities via an alternative path, which is achieved by comparing the contents of the pages they represent. The templates are split to have only one changing variable which is the numeric entity identifier, known as its index. New URLs are generated from the gaps of values in the entity index for a template.
Graduation Thesis language
English
Graduation Thesis type
Bachelor - Computer Science
Supervisor(s)
Peep Küngas
Defence year
2016