TY - JOUR
T1 - Scalable probabilistic PCA for large-scale genetic variation data
AU - Agrawal, Aman
AU - Chiu, Alec M.
AU - Le, Minh
AU - Halperin, Eran
AU - Sankararaman, Sriram
N1 - Publisher Copyright:
© 2020 Agrawal et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
PY - 2020/5
Y1 - 2020/5
N2 - Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in about thirty minutes. To illustrate the utility of computing PCs in large samples, we leveraged the population structure inferred by ProPCA within White British individuals in the UK Biobank to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4.
AB - Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in about thirty minutes. To illustrate the utility of computing PCs in large samples, we leveraged the population structure inferred by ProPCA within White British individuals in the UK Biobank to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4.
UR - http://www.scopus.com/inward/record.url?scp=85086346261&partnerID=8YFLogxK
U2 - 10.1371/journal.pgen.1008773
DO - 10.1371/journal.pgen.1008773
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
C2 - 32469896
AN - SCOPUS:85086346261
SN - 1553-7390
VL - 16
JO - PLoS Genetics
JF - PLoS Genetics
IS - 5
M1 - e1008773
ER -