TY - JOUR

T1 - Large scale sequencing by hybridization

AU - Shamir, Ron

AU - Tsur, Dekel

PY - 2002

Y1 - 2002

N2 - Sequencing by hybridization is a method for reconstructing a DNA sequence based on its k-mer content. This content, called the spectrum of the sequence, can be obtained from hybridization with a universal DNA chip. However, even with a sequencing chip containing all 49 9-mers and assuming no hybridization errors, only about 400-bases-long sequences can be reconstructed unambiguously. Drmanac et al. (1989) suggested sequencing long DNA targets by obtaining spectra of many short overlapping fragments of the target, inferring their relative positions along the target, and then computing spectra of subfragments that are short enough to be uniquely recoverable. Drmanac et al. do not treat the realistic case of errors in the hybridization process. In this paper, we study the effect of such errors. We show that the probability of ambiguous reconstruction in the presence of (false negative) errors is close to the probability in the errorless case. More precisely, the ratio between these probabilities is 1 + 0 (p=(1 - p)4 · 1=d) where d is the average length of subfragments, and p is the probability of a false negative. We also obtain lower and upper bounds for the probability of unambiguous reconstruction based on an errorless spectrum. For realistic chip sizes, these bounds are tighter than those given by Arratia et al. (1996). Finally, we report results on simulations with real DNA sequences, showing that even in the presence of 50% false negative errors, a target of cosmid length can be recovered with less than 0.1% miscalled bases.

AB - Sequencing by hybridization is a method for reconstructing a DNA sequence based on its k-mer content. This content, called the spectrum of the sequence, can be obtained from hybridization with a universal DNA chip. However, even with a sequencing chip containing all 49 9-mers and assuming no hybridization errors, only about 400-bases-long sequences can be reconstructed unambiguously. Drmanac et al. (1989) suggested sequencing long DNA targets by obtaining spectra of many short overlapping fragments of the target, inferring their relative positions along the target, and then computing spectra of subfragments that are short enough to be uniquely recoverable. Drmanac et al. do not treat the realistic case of errors in the hybridization process. In this paper, we study the effect of such errors. We show that the probability of ambiguous reconstruction in the presence of (false negative) errors is close to the probability in the errorless case. More precisely, the ratio between these probabilities is 1 + 0 (p=(1 - p)4 · 1=d) where d is the average length of subfragments, and p is the probability of a false negative. We also obtain lower and upper bounds for the probability of unambiguous reconstruction based on an errorless spectrum. For realistic chip sizes, these bounds are tighter than those given by Arratia et al. (1996). Finally, we report results on simulations with real DNA sequences, showing that even in the presence of 50% false negative errors, a target of cosmid length can be recovered with less than 0.1% miscalled bases.

KW - DNA sequencing with errors

KW - Sequencing by hybridization

UR - http://www.scopus.com/inward/record.url?scp=0036100836&partnerID=8YFLogxK

U2 - 10.1089/10665270252935548

DO - 10.1089/10665270252935548

M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???

AN - SCOPUS:0036100836

VL - 9

SP - 413

EP - 428

JO - Journal of Computational Biology

JF - Journal of Computational Biology

SN - 1066-5277

IS - 2

ER -