TY - JOUR
T1 - EmbedGEM
T2 - a framework to evaluate the utility of embeddings for genetic discovery
AU - Mukherjee, Sumit
AU - Mccaw, Zachary R.
AU - Pei, Jingwen
AU - Merkoulovitch, Anna
AU - Soare, Tom
AU - Tandon, Raghav
AU - Amar, David
AU - Somineni, Hari
AU - Klein, Christoph
AU - Satapati, Santhosh
AU - Lloyd, David
AU - Probert, Christopher
AU - Koller, Daphne
AU - O'dushlaine, Colm
AU - Karaletsos, Theofanis
N1 - Publisher Copyright:
© 2024 The Author(s).
PY - 2024
Y1 - 2024
N2 - Summary: Machine learning-derived embeddings are a compressed representation of high content data modalities. Embeddings can capture detailed information about disease states and have been qualitatively shown to be useful in genetic discovery. Despite their promise, embeddings have a major limitation: it is unclear if genetic variants associated with embeddings are relevant to the disease or trait of interest. In this work, we describe EmbedGEM (Embedding Genetic Evaluation Methods), a framework to systematically evaluate the utility of embeddings in genetic discovery. EmbedGEM focuses on comparing embeddings along two axes: heritability and disease relevance. As measures of heritability, we consider the number of genome-wide significant associations and the mean χ2 statistic at significant loci. For disease relevance, we compute polygenic risk scores for each embedding principal component, then evaluate their association with high-confidence disease or trait labels in a held-out evaluation patient set. While our development of EmbedGEM is motivated by embeddings, the approach is generally applicable to multivariate traits and can readily be extended to accommodate additional metrics along the evaluation axes. We demonstrate EmbedGEM's utility by evaluating embeddings and multivariate traits in two separate datasets: (i) a synthetic dataset simulated to demonstrate the ability of the framework to correctly rank traits based on their heritability and disease relevance and (ii) a real data from the UK Biobank, including metabolic and liver-related traits. Importantly, we show that greater disease relevance does not automatically follow from greater heritability.
AB - Summary: Machine learning-derived embeddings are a compressed representation of high content data modalities. Embeddings can capture detailed information about disease states and have been qualitatively shown to be useful in genetic discovery. Despite their promise, embeddings have a major limitation: it is unclear if genetic variants associated with embeddings are relevant to the disease or trait of interest. In this work, we describe EmbedGEM (Embedding Genetic Evaluation Methods), a framework to systematically evaluate the utility of embeddings in genetic discovery. EmbedGEM focuses on comparing embeddings along two axes: heritability and disease relevance. As measures of heritability, we consider the number of genome-wide significant associations and the mean χ2 statistic at significant loci. For disease relevance, we compute polygenic risk scores for each embedding principal component, then evaluate their association with high-confidence disease or trait labels in a held-out evaluation patient set. While our development of EmbedGEM is motivated by embeddings, the approach is generally applicable to multivariate traits and can readily be extended to accommodate additional metrics along the evaluation axes. We demonstrate EmbedGEM's utility by evaluating embeddings and multivariate traits in two separate datasets: (i) a synthetic dataset simulated to demonstrate the ability of the framework to correctly rank traits based on their heritability and disease relevance and (ii) a real data from the UK Biobank, including metabolic and liver-related traits. Importantly, we show that greater disease relevance does not automatically follow from greater heritability.
UR - http://www.scopus.com/inward/record.url?scp=85206323191&partnerID=8YFLogxK
U2 - 10.1093/bioadv/vbae135
DO - 10.1093/bioadv/vbae135
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
C2 - 39664859
AN - SCOPUS:85206323191
SN - 2635-0041
VL - 4
JO - Bioinformatics Advances
JF - Bioinformatics Advances
IS - 1
M1 - vbae135
ER -