Predicting the Impact of Non-Coding Genetic Variants on Transcription Factor Binding with Machine Learning

Yurii Toma
Understanding how the human organism works is one of the most important problems in the science. A lot of research effort went into analysis of deoxyribonucleic acid (DNA) since the first human genome was sequenced. Despite these efforts, there are still a large number of poorly understood processes happening in the human organism. One of them is understanding the functional consequences of non-coding genetic variants in the DNA sequence of a human. These variants, if functional, are likely to influence the binding of transcription factors - regulatory proteins that control the expression of other genes by binding to regulatory elements across the genome. A diverse set of methods have been developed to predict the effect of genetic variants on transcription factor binding. However, all of these methods have been limited by the lack of high quality testing data to evaluate their accuracy. Here I combine and re-analyse three large genetic studies to identify a high quality set of likely causal genetic variants that regulate the binding of CTCF and PU.1 transcription factors. I then use these variants to evaluate the accuracy of three state-of-the art prediction algorithms.
My results indicate that while the impact of some genetic variants with large effect can be readily predicted, most variants with smaller effects are missed by current prediction algorithms. My approach is generalisable to other transcription factors and can be used to benchmark the accuracy of novel prediction algorithms developed in the future.
Graduation Thesis language
Graduation Thesis type
Master - Computer Science
Kaur Alasoo, Dmytro Fishman
Defence year