Re-using public RNA-Seq data

Name
Tõnis Tasa
Abstract
Next Generation Sequencing (NGS) methods are rapidly becoming the most popular paradigm for exploring genomic data. RNA-Seq is a NGS method that enables gene expression analyses. Raw sequencing data generated by researchers is actively submitted to public databases as part of the requirements for publishing in academic journals. Raw sequencing data is quite large in size and analysis of each experiment is time consuming. Therefore published raw files are currently not re-used much. Repetitive analysis of uploaded data is also complicated by negligent experiment set-up write-ups and lack of clear standards for the analysis process. Publicly available analysis results have been obtained using a varying set of tools and parameters. There are biases introduced by algorithmic differences of tools which greatly decreases the comparability of results between experiments. This is due because of lack of golden analysis standards. Comprehensive collections of expression data have to account for computational expenses and time limits. Therefore collection set-up needs an effective pipeline implementation with automatic parameter estimation, a defined subset of tools and a robust handling mechanism to ensure minimal required user input. Aggregating expression data from individual experiments with varying experimental conditions creates many new opportunities for data aggregation and mining. Pattern discovery over larger collections generalises local tendencies. One such analysis sub-field is assessing gene co-expression over a broader set of experiments. In this thesis, we have designed and implemented a framework for performing large scale analysis of publicly available RNA-Seq experiments. No separate configuration file for analysis is required, instead a pre-built database is employed. User intervention is minimal and the process is self-guiding. All parameters within the analysis process are determined automatically. This enables unsupervised sequential analysis of numerous experiments. Analysed datasets can be used as an input for co-expression analysis tool MEM which was developed by BIIT research group and was originally designed for public microarray data. RNA-Seq data adds a new application field for the tool. Other than co-expression analysis with MEM, the data can also be used in other downstream analysis applications.
Graduation Thesis language
English
Graduation Thesis type
Master - Computer Science
Supervisor(s)
Priit Adler
Defence year
2015
 
PDF