Heuristics for Crawling WSDL Descriptions of Web Service Interfaces - the Heritrix Case

Name
Taniel Põld
Abstract
The goal of this thesis is to configure and modify Heritrix web crawler to add the support for finding WSDL description URIs. Heritrix is an open-source spider that has been written in Java programming language and has been designed to help Internet Archive store the contents of Internet. It already includes most of the common heuristics used for spidering and it has a modular architecture design which makes it easy to alter. We gathered a collection of strategies and crawler job configuration options to be used on Heritrix. These originated from the published works that the other teams had done on the topic. In addition to it, we created a new module to the crawler’s source code, that allows logging of search results without any excessive data. With the job configuration changes mentioned, it was possible to spider the web for WSDL description URIs, but as Heritrix does not support focused crawling, the spider would explore all the web sites it happens to stumble upon. Most of these sites would accommodate no information relevant to finding web services. To guide the course of the spider's job to the resources potentially containing “interesting” data, we implemented support for focused crawling of WSDL URIs. The change required the creation of a new module in Heritrix’s source code, the algorithm used as basis for our solution was described in one of the articles. To see if our enhancement provided any improvement in the crawl’s process, a series of experiments were conducted. In them we compared performance and accuracy of two crawlers. Both of which were configured for WSDL descriptions crawling, but one of them was also fitted with module providing support for focused crawling. From the analysis of the experiments' results we deduced that although the crawler job set for the experiments' baseline processed URIs a bit faster, the spider with the improvements found WSDL descriptions more accurately and was able to find more of them.
Graduation Thesis language
English
Graduation Thesis type
Bachelor - Information Technology
Supervisor(s)
Peep Küngas, Meelis Kull
Defence year
2012
 
PDF