TY - GEN
T1 - An efficient and accurate graph-based approach to detect population substructure
AU - Sridhar, Srinath
AU - Rao, Satish
AU - Halperin, Eran
PY - 2007
Y1 - 2007
N2 - Currently, large-scale projects are underway to perform whole genome disease association studies. Such studies involve the genotyping of hundreds of thousands of SNP markers. One of the main obstacles in performing such studies is that the underlying population substructure could artificially inflate the p-values, thereby generating a lot of false positives. Although existing tools cope well with very distinct sub-populations, closely related population groups remain a major cause of concern. In this work, we present a graph based approach to detect population substructure.Our method is based on a distance measure between individuals. We show analytically that when the allele frequency differences between the two populations are large enough (in the l 2-norm sense), our algorithm is guaranteed to find the correct classification of individuals to sub-populations. We demonstrate the empirical performance of our algorithms on simulated and real data and compare it against existing methods, namely the widely used software method STRUCTURE and the recent method EIGENSTRAT. Our new technique is highly efficient (in particular it is hundreds of times faster than STRUCTURE), and overall it is more accurate than the two other methods in classifying individuals into sub-populations. We demonstrate empirically that unlike the other two methods, the accuracy of our algorithm consistently increases with the number of SNPs genotyped. Finally, we demonstrate that the efficiency of our method can be used to assess the significance of the resulting clusters. Surprisingly, we find that the different methods find population sub-structure in each of the homogeneous populations of the HapMap project. We use our significance score to demonstrate that these substructures are probably due to over-fitting.
AB - Currently, large-scale projects are underway to perform whole genome disease association studies. Such studies involve the genotyping of hundreds of thousands of SNP markers. One of the main obstacles in performing such studies is that the underlying population substructure could artificially inflate the p-values, thereby generating a lot of false positives. Although existing tools cope well with very distinct sub-populations, closely related population groups remain a major cause of concern. In this work, we present a graph based approach to detect population substructure.Our method is based on a distance measure between individuals. We show analytically that when the allele frequency differences between the two populations are large enough (in the l 2-norm sense), our algorithm is guaranteed to find the correct classification of individuals to sub-populations. We demonstrate the empirical performance of our algorithms on simulated and real data and compare it against existing methods, namely the widely used software method STRUCTURE and the recent method EIGENSTRAT. Our new technique is highly efficient (in particular it is hundreds of times faster than STRUCTURE), and overall it is more accurate than the two other methods in classifying individuals into sub-populations. We demonstrate empirically that unlike the other two methods, the accuracy of our algorithm consistently increases with the number of SNPs genotyped. Finally, we demonstrate that the efficiency of our method can be used to assess the significance of the resulting clusters. Surprisingly, we find that the different methods find population sub-structure in each of the homogeneous populations of the HapMap project. We use our significance score to demonstrate that these substructures are probably due to over-fitting.
UR - http://www.scopus.com/inward/record.url?scp=34547493338&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-71681-5_35
DO - 10.1007/978-3-540-71681-5_35
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:34547493338
SN - 3540716807
SN - 9783540716808
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 503
EP - 517
BT - Research in Computational Molecular Biology - 11th Annual International Conference, RECOMB 2007, Proceedings
PB - Springer Verlag
T2 - 11th Annual International Conference on Research in Computational Molecular Biology, RECOMB 2007
Y2 - 21 April 2007 through 25 April 2007
ER -