Counting word frequencies based on limited regular expressions

Name
Anti Alman
Abstract
This bachelor's thesis concentrates on developing and implementing an algorithm for a subtask in a biomarker discovery pipeline. The pipeline itself is being developed at the BIIT group in the University of Tartu as part of an industrial collaboration. The input of this algorithm is data about a large number of different biological samples. The data about these samples is represented by using short words and corresponding frequencies, which allow us to find significant differences between samples. It is also known that in some cases a limited regular expression would be a much better representation of these differences. However the frequencies that correspond to any given regular expression need to be calculated based on words and the frequencies of these words. This problem can be divided into two parts. First we need to find all of the words that match the given regular expression, this is achieved by using large bitvectors that will be constantly stored in memory. The second part concentrates on calculating the frequencies based on matching words. Speed is here achieved by storing frequencies in memory as a sparse array in format that allows fast adding of rows. The resulting algorithm is implemented in both Python and C++. The details of these implementations are given and finally the speed of both of these implementations is measured against a naive solution. The bachelors thesis results in an program that is able to find the frequencies of input regular expressions with sufficient speed.
Graduation Thesis language
Estonian
Graduation Thesis type
Bachelor - Information Technology
Supervisor(s)
Meelis Kull, Sven Laur
Defence year
2013
 
PDF