Mittetriviaalselt sarnaste dokumentide otsimine suurest dokumentide korpusest

Oskar Gross
This thesis introduces the methods which are used for measuring the similarity between documents. The document similarity measures are an important topic in information retrieval and in document classification systems. Finding similar documents from a document corpus is applicable in many different fields - web search engines, news aggregation services, advertising systems et cetera. An important aspect for a document similarity measure is, that the human opinion of the similarity should concur with the score of similarity. The problem of semantic similarity arises. The standard way to find similarity between documents is to compare the co-occurrence of words in them. Thus it is possible, that two documents which are contextually very similar, but to dot contain the same words, are marked dissimilar by the standard document similarity measures. The goal of the semantic similarity measures is to take into account the context of the documents and use this information for measuring the similarity. The goal of this thesis is to first give an overview of different methods which are used for standard and for semantic document similarity. The second goal is to experiment with the document similarity measures on a news portal dataset and to explore whether we can find some interesting properties of those measures. The motivation for the topic originates from an idea to create a new advertising network which is able to target advertisements better than the networks currently in the market. The goal was to analyse whether we could find a simple, intuitive, yet effective method for finding the non-trivial similarity between documents.
Graduation Thesis language
Graduation Thesis type
Master - Computer Science
Sven Laur, D.Sc. (Tech), Prof. Hannu Toivonen, PhD
Defence year