Estimating Concordance Between Measured and Predicted Genetic Variant Effects on Chromatin Accessibility
Name
Kristiina Kuningas
Abstract
Many GWAS studies have identified genetic variants associated with human traits or diseases. However, understanding the underlying molecular mechanisms of those associations has been challenging. Chromatin accessibility is one of those traits that has been associated with a higher risk for a disease as several chromatin accessibility quantitative trait locus (caQTL) studies have shown. If chromatin is not accessible, then transcription factors cannot bind to it and gene expression or protein synthesis cannot be initiated. This can lead to a altered risk for some diseases. Therefore, it is essential to study caQTLs.
One of the approaches to find genetic variants is caQTL mapping. It uses open chromatin data and genotype imputation to find associations between genetic variants and chromatin accessibility. Additional fine-mapping distinguishes the potentially causal variants. In addition, deep learning models predicting genetic variants’ effects on molecular traits have been integrated into the studies to understand even better the biological mechanisms behind the associations between genetic variants and phenotypic traits. However, the predictive accuracy of these models is still unclear. In this thesis, we created five caQTL datasets for five different cell types based on the fine-mapping results. These datasets were then used to validate the performance of a state-of-the-art deep learning model
Enformer in predicting genetic variant effects on chromatin accessibility. Although other studies have evaluated Enformer predictions already, then they have done it from gene expression perspective based on measured effects from RNA-seq data. This thesis, however, compares measured genetic variants’ effects on chromatin accessibility from ATAC-seq data to Enformer’s predicted effects. It compares both the effect size but also the direction of it. It provides an initial overview of how Enformer performs on chromatin accessibility. Results showed that Enformer performs pretty well on especially the variants for which it predicts stronger effects. In addition, it provided expected results when the cell type of a measured variant was different from the cell type of the predicted variant, meaning it had more opposite effects than it would have with a similar cell type.
On the other hand, it also showed very low near-zero effect scores in many cases when the measured effect was higher.
One of the approaches to find genetic variants is caQTL mapping. It uses open chromatin data and genotype imputation to find associations between genetic variants and chromatin accessibility. Additional fine-mapping distinguishes the potentially causal variants. In addition, deep learning models predicting genetic variants’ effects on molecular traits have been integrated into the studies to understand even better the biological mechanisms behind the associations between genetic variants and phenotypic traits. However, the predictive accuracy of these models is still unclear. In this thesis, we created five caQTL datasets for five different cell types based on the fine-mapping results. These datasets were then used to validate the performance of a state-of-the-art deep learning model
Enformer in predicting genetic variant effects on chromatin accessibility. Although other studies have evaluated Enformer predictions already, then they have done it from gene expression perspective based on measured effects from RNA-seq data. This thesis, however, compares measured genetic variants’ effects on chromatin accessibility from ATAC-seq data to Enformer’s predicted effects. It compares both the effect size but also the direction of it. It provides an initial overview of how Enformer performs on chromatin accessibility. Results showed that Enformer performs pretty well on especially the variants for which it predicts stronger effects. In addition, it provided expected results when the cell type of a measured variant was different from the cell type of the predicted variant, meaning it had more opposite effects than it would have with a similar cell type.
On the other hand, it also showed very low near-zero effect scores in many cases when the measured effect was higher.
Graduation Thesis language
English
Graduation Thesis type
Master - Data Science
Supervisor(s)
Kaur Alasoo
Defence year
2023