[Preliminarily taken] Machine learning / Statistical modeling: Predicting human health from genomic and life history information

Organization
ATI
Abstract
An ambitious goal of personalized medicine is to predict human health. In Estonia, there is the unique availability of data from electronic health records, the Estonian Genome project, and additional assays for the same set of individuals, providing abundant high quality information. In addition, genomic measurements have been gathered for many more non-Estonians, which could be mined for linkages that are also present in our population. For the first time, we are in a position to test how well we can forecast health using these diverse sources of data.

We previously attacked the problem of predicting individual characteristics using genomic information in yeast [1], and found that traits can be predicted surprisingly well, with on average 91% accuracy, when using information about variation in DNA, as well as other measurements for the same individual. Importantly, close relatives greatly aided prediction. This demonstrated that there are no fundamental limitations to accurate prediction, and we are now asking if the same holds true for human health information.

The aim of this project is to predict elements of electronic health records based on all the rest of the available data on the person, including DNA sequence and phenotypes of closely related individuals. The methods used would initially follow those of [1], starting with standard linear mixed models to combine information from the genome and other traits, and expanding to random forest based methods for a more flexible model class. If desired, other types of approaches, such as deep neural networks, can be tested. The project is in collaboration with the Estonian Genome Center (Geenivaramu) and its scientists.

This data science project is well-suited for someone with experience in (or desire to acquire) machine learning or statistical modeling methods, and basic data science skills of obtaining, cleaning, and visualising data. Knowledge of genomics is beneficial.



References

Kaspar Märtens, Johan Hallin, Jonas Warringer, Gianni Liti, Leopold Parts. “Predicting quantitative traits from genome and phenome with near perfect accuracy”. Nature Communications, 2016. http://www.nature.com/ncomms/2016/160510/ncomms11512/full/ncomms11512.html

"Eesti teadlased ennustavad pärilikke tunnuseid täpsemalt kui kunagi varem". Novaator, 2016. http://novaator.err.ee/v/tervis/66e8ceb2-c82c-441f-b272-b53be541c5e6/eesti-teadlased-ennustavad-parilikke-tunnuseid-tapsemalt-kui-kunagi-varem
Graduation Theses defence year
2016-2017
Supervisor
Leopold Parts
Spoken language (s)
Estonian, English
Requirements for candidates
This data science project is well-suited for someone with experience in (or desire to acquire) machine learning or statistical modeling methods, and basic data science skills of obtaining, cleaning, and visualising data. Knowledge of genomics is beneficial.
Level
Bachelor, Masters
Keywords
#genetics #medicine #health #statistics #machine_learning

Application of contact

 
Name
Leopold Parts
Phone
E-mail
leopold.parts@ut.ee