TY - GEN
T1 - Accurate estimation of expression levels of homologous genes in RNA-seq experiments
AU - Paşaniuc, Bogdan
AU - Zaitlen, Noah
AU - Halperin, Eran
PY - 2010
Y1 - 2010
N2 - Next generation high throughput sequencing (NGS) is poised to replace array based technologies as the experiment of choice for measuring RNA expression levels. Several groups have demonstrated the power of this new approach (RNA-seq), making significant and novel contributions and simultaneously proposing methodologies for the analysis of RNA-seq data. In a typical experiment, millions of short sequences (reads) are sampled from RNA extracts and mapped back to a reference genome. The number of reads mapping to each gene is used as proxy for its corresponding RNA concentration. A significant challenge in analyzing RNA expression of homologous genes is the large fraction of the reads that map to multiple locations in the reference genome. Currently, these reads are either dropped from the analysis, or a näive algorithm is used to estimate their underlying distribution. In this work, we present a rigorous alternative for handling the reads generated in an RNA-seq experiment within a probabilistic model for RNA-seq data; we develop maximum likelihood based methods for estimating the model parameters. In contrast to previous methods, our model takes into account the fact that the DNA of the sequenced individual is not a perfect copy of the reference sequence. We show with both simulated and real RNA-seq data that our new method improves the accuracy and power of RNA-seq experiments.
AB - Next generation high throughput sequencing (NGS) is poised to replace array based technologies as the experiment of choice for measuring RNA expression levels. Several groups have demonstrated the power of this new approach (RNA-seq), making significant and novel contributions and simultaneously proposing methodologies for the analysis of RNA-seq data. In a typical experiment, millions of short sequences (reads) are sampled from RNA extracts and mapped back to a reference genome. The number of reads mapping to each gene is used as proxy for its corresponding RNA concentration. A significant challenge in analyzing RNA expression of homologous genes is the large fraction of the reads that map to multiple locations in the reference genome. Currently, these reads are either dropped from the analysis, or a näive algorithm is used to estimate their underlying distribution. In this work, we present a rigorous alternative for handling the reads generated in an RNA-seq experiment within a probabilistic model for RNA-seq data; we develop maximum likelihood based methods for estimating the model parameters. In contrast to previous methods, our model takes into account the fact that the DNA of the sequenced individual is not a perfect copy of the reference sequence. We show with both simulated and real RNA-seq data that our new method improves the accuracy and power of RNA-seq experiments.
UR - http://www.scopus.com/inward/record.url?scp=78650288768&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-12683-3_26
DO - 10.1007/978-3-642-12683-3_26
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:78650288768
SN - 3642126820
SN - 9783642126826
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 397
EP - 409
BT - Research in Computational Molecular Biology - 14th Annual International Conference, RECOMB 2010, Proceedings
T2 - 14th Annual International Conference on Research in Computational Molecular Biology, RECOMB 2010
Y2 - 25 April 2010 through 28 April 2010
ER -