A rigorous analysis of population stratification with limited data

Kamalika Chaudhuri, Eran Halperin, Satish Rao, Shuheng Zhou

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

7 Scopus citations


Finding the genetic factors of complex diseases such as cancer, currently a major effort of the international community, will potentially lead to better treatment of these diseases. One of the major difficulties in these studies, is the fact that the genetic components of an individual not only depend on the disease, but also on its ethnicity. Therefore, it is crucial to find methods that could reduce the population structure effects on these studies. This can be formalized as a clustering problem, where the individuals are clustered according to their genetic information. Mathematically, we consider the problem of clustering bit "feature" vectors, where each vector represents the genetic information of an individual. Our model assumes that this bit vector is generated according to a prior probability distribution specified by the individual's membership in a population. We present methods that can cluster the vectors while attempting to optimize the number of features required. The focus of the paper is not on the algorithms, but on showing that optimizing certain objective functions on the data yields the right clustering, under the random generative model. In particular, we prove that some of the previous formulations for clustering are effective. We consider two different clustering approaches. The first approach forms a graph, and then clusters the data using a connected components algorithm, or a max cut algorithm. The second approach tries to estimate simultanously the feature frequencies in each of the populations, and the classification of vectors into populations. We show that using the first approach Θ(log N/γ2) data (i.e., total number of features times number of vectors) is sufficient to find the correct classification, where N is the number of vectors of each population, and γis the average ℓ22 distance between the feature probability vectors of the two populations. Using the second approach, we show that O(log N/α4) data is enough, where α is the average ℓ1 distance between the populations. We also present polynomial time algorithms for the resulting max margin which, for now, needs only slightly more data than stated above. Our methods can also be used to give a simple combinatorial algorithm for finding a bisection in a random graph that matches Boppana's convex programming approach (and McSherry's spectral results).

Original languageEnglish
Title of host publicationProceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007
PublisherAssociation for Computing Machinery
Number of pages10
ISBN (Electronic)9780898716245
StatePublished - 2007
Externally publishedYes
Event18th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007 - New Orleans, United States
Duration: 7 Jan 20079 Jan 2007

Publication series

NameProceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms


Conference18th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007
Country/TerritoryUnited States
CityNew Orleans


FundersFunder number
Army Research OffceCNF-0435382, DAAD19-02-1-0389
National Science FoundationIIS-0513599, EF-0331494, CCF-0105304


    Dive into the research topics of 'A rigorous analysis of population stratification with limited data'. Together they form a unique fingerprint.

    Cite this