Clustering-based motif discovery from short peptides

Name
Mari-Liis Kruup
Abstract
With the help of new sequencing technologies we can generate a lot of biological data of different backgrounds. These data need to be analysed in order to extract the most important information from them. In this work we develop a method for extracting motifs from a large amount of short amino acid sequences called peptides that contain information about antibodies in that organism. Motifs found from these peptides could be linked to diseases that a person has had. Since none of the tested existing methods were suitable for solving this problem, we developed our own method that consists of two parts. First part, finding groups of similar peptides, is based on hierarchical clustering and has two different options for automatically extracting clusters from the hierarchical clustering tree. Second part is reading motifs from groups of similar peptides. Since we cannot validate the method on real data due to the lack of knowledge about the true motifs in them, we generate synthetic datasets that we validate the developed method on. The percentage of motifs the developed method could identify from synthetic data with different properties ranged from 50% to 100%, with 86% on the data that should be most similar to the real data. Method that reads motifs from group of similar peptides worked also very well. It could identify 100% of motifs from groups of peptides where no noise was added and 90% of motifs from noisier peptide groups. The developed method could be also used for motif discovery on different biological datasets. In that case we would have to change some parameters that were specifically chosen for this problem. Future work could be to test how well this method performs on different biological datasets.
Graduation Thesis language
Estonian
Graduation Thesis type
Master - Computer Science
Supervisor(s)
Meelis Kull Jaak Vilo
Defence year
2015
 
PDF