Natural language processing in information retrieval

Raigo Kodasmaa
Information retrieval is a field of natural language processing, which main task is to search, find and retain relevant text documents to match user’s query. Information retrieval system can create effective search using many natural language processing tehniques. This document contains four bigger chapters: the introdution to information retrieval, document pre-processing, indexing terms and documents, tehniques used in formulating queries. The first chapter gives an overview of information retrieval recent history and information retrieval systems’ arhitecture. The second chapter describes the processes made before sending terms and documents to indexing, including lexical analysis, stop words removal and stemming. Lexical analysis identifies words from text. Stop words are the words, that carry a little semantic information and stemming find the words stems. The main purpose of document pre-processing is to reduce the set of words to accelerate indexing. The next bigger process introduced is indexing, where term and document frequencies are involved in weighting schemes. In addition to frequencies, term positions in documents are also considered. The fourth and the longest chapter shows how relevance feedback and query expansion are used in query formulation. In relevance feedback, users are involved to judge if the results are relevant or not and then information retrieval system creates new query based on users’ feedback. Results are re-printed to user. Query expansion does not need users activity in query processing. This tehnique uses thesauri and relations between words to expand the query automatically and show final results to user. The aim of this document is to introduce techniques used in information retrieval and create scripts to illustrate some of these tehniques in Estonian. In the end of the document, there are some additional parts, like terms vocabulary, Estonian stop words list and scripts, that can find term frequency, term-document matrix and process stemming to an Estonian words.
Graduation Thesis language
Graduation Thesis type
Bachelor - Computer Science
Mare Koit
Defence year