Classification of Human Y Chromosome Haplogroups Based on Dense and Sparse Genetic Data Using Machine Learning Approaches

Name
Jose Rodrigo Flores Espinosa
Abstract
The genetic data of human Y chromosomes is classified into haplogroup categories based on the underlying phylogenetic tree, where a haplogroup represents a monophyletic clade on the tree. Current methods for the assignment of these categories work by representing a known human Y chromosome phylogeny as tree data structure. For an individual Y chromosome to be assigned a haplogroup using this representation, strategies based on breadth-first search (BFS) are often used. The tree is traversed in a manner that paths showing supporting evidence from mutations are further explored eventually leading to a leaf node and final classification. This strategy shows high efficiency when dense genotyping/sequencing data are available. However, in case of lower density genetic data such as genotyping arrays or ancient DNA data, BFS-based strategies often fail to reach a leaf node due to uncertainty and lack of information of where to go next.

In this work we leverage the increasing availability of world-wide panels of Y chromosome data with available curated haplogroup categories. We present a novel method on the application of a K-nearest neighbors classifier to both low-density and high-density types of data. The main goal is to assess the extent to which this approach can be useful in the challenging cases where BSF-based methods fail to produce a tractable and meaningful result. To achieve this, we have employed different DNA sequence encodings together with dimensionality reduction techniques. We have also investigated a novel method of DNA representation using Word2vec contextual embeddings. The DNA snippets are represented as text words and the whole DNA sequence is a text sentence. Encoding the DNA sequences in this manner gives rich contextual information that helps in haplogroup classification and can be extended to other applications in genomics.

The results show that classification accuracy is high (>98%) with next-generation sequencing (NGS) and genotyping arrays, high-density and lower-density data classes respectively. Performance however is low (<60% on average) when classifying ancient DNA data, which has the lowest level of resolution and higher levels of error. We observe that in many of the challenging cases KNN fails to correctly predict the label at its finest degree of resolution but does classifies correctly at the main category level which can be useful in practice.
Graduation Thesis language
English
Graduation Thesis type
Master - Computer Science
Supervisor(s)
Dr. Kallol Roy, Dra. Monika Karmin
Defence year
2022
 
PDF