Scalable probabilistic PCA for large-scale genetic variation data

  • Aman Agrawal
  • , Alec M. Chiu
  • , Minh Le
  • , Eran Halperin
  • , Sriram Sankararaman*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

27 Scopus citations

Abstract

Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in about thirty minutes. To illustrate the utility of computing PCs in large samples, we leveraged the population structure inferred by ProPCA within White British individuals in the UK Biobank to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4.

Original languageEnglish
Article numbere1008773
JournalPLoS Genetics
Volume16
Issue number5
DOIs
StatePublished - May 2020
Externally publishedYes

Funding

FundersFunder number
National Institutes of Health
Alfred P. Sloan Foundation
National Human Genome Research InstituteT32HG002536
National Institute on Minority Health and Health DisparitiesR56MD013312
National Science FoundationDGE-1829071, 1829071, III-1705121
National Institute of General Medical SciencesR35GM125055, R00GM111744, R25GM112625
National Institute of Mental HealthR01MH122569, R01MH115979
National Center for Advancing Translational SciencesUL1TR001881
Okawa Foundation33127

    Fingerprint

    Dive into the research topics of 'Scalable probabilistic PCA for large-scale genetic variation data'. Together they form a unique fingerprint.

    Cite this