TY - JOUR
T1 - A supervised approach for identifying discriminating genotype patterns and its application to breast cancer data
AU - Yosef, Nir
AU - Yakhini, Zohar
AU - Tsalenko, Anya
AU - Kristensen, Vessela
AU - Børresen-Dale, Anne Lise
AU - Ruppin, Eytan
AU - Sharan, Roded
N1 - Funding Information:
The authors thank Oded Regev for proving the hardness of DP, Dani Yekutieli for his help with the FDR analysis and Shai Kaplan for help with the excess hypergeometric score. N.Y. is supported by the Tel-Aviv university president and rector scholarship. E.R. is supported by grants from the Yeshaya Howoritz Complexity Center, from the Tauber Foundation and the Israeli Science Foundation. R.S. is supported by an Alon Fellowship.
PY - 2007
Y1 - 2007
N2 - Motivation: Large-scale association studies, investigating the genetic determinants of a phenotype of interest, are producing increasing amounts of genomic variation data on human cohorts. A fundamental challenge in these studies is the detection of genotypic patterns that discriminate individuals exhibiting the phenotype under study from individuals that do not posses it. The difficulty stems from the large number of single nucleotide polymorphism (SNP) combinations that have to be tested. The discrimination problem becomes even more involved when additional high-throughput data, such as gene expression data, are available for the same cohort. Results: We have developed a graph theoretic approach for identifying discriminating patterns (DPs) for a given phenotype in a genotyped population. The method is based on representing the SNP data as a bipartite graph of individuals and their SNP states, and identifying fully connected subgraphs of this graph that relate individuals enriched for a given phenotypic group. The method can handle additional data types such as expression profiles of the genotyped population. It is reminiscent of biclustering approaches with the crucial difference that its search process is guided by the phenotype under consideration in a supervised manner. We tested our approach in simulations and on real data. In simulations, our method was able to retrieve planted patterns with high success rate. We then applied our approach to a dataset of 72 breast cancer patients with available gene expression profiles, genotyped over 695 SNPs. We detected several DPs that were highly significant with respect to various clinical phenotypes, and investigated the groups of patients and the groups of genes they defined. We found the patient groups to be highly enriched for other phenotypes and to display expression coherency among their profiles. The gene groups displayed functional coherency and involved genes with known role in cancer, providing additional support to their involvement.
AB - Motivation: Large-scale association studies, investigating the genetic determinants of a phenotype of interest, are producing increasing amounts of genomic variation data on human cohorts. A fundamental challenge in these studies is the detection of genotypic patterns that discriminate individuals exhibiting the phenotype under study from individuals that do not posses it. The difficulty stems from the large number of single nucleotide polymorphism (SNP) combinations that have to be tested. The discrimination problem becomes even more involved when additional high-throughput data, such as gene expression data, are available for the same cohort. Results: We have developed a graph theoretic approach for identifying discriminating patterns (DPs) for a given phenotype in a genotyped population. The method is based on representing the SNP data as a bipartite graph of individuals and their SNP states, and identifying fully connected subgraphs of this graph that relate individuals enriched for a given phenotypic group. The method can handle additional data types such as expression profiles of the genotyped population. It is reminiscent of biclustering approaches with the crucial difference that its search process is guided by the phenotype under consideration in a supervised manner. We tested our approach in simulations and on real data. In simulations, our method was able to retrieve planted patterns with high success rate. We then applied our approach to a dataset of 72 breast cancer patients with available gene expression profiles, genotyped over 695 SNPs. We detected several DPs that were highly significant with respect to various clinical phenotypes, and investigated the groups of patients and the groups of genes they defined. We found the patient groups to be highly enriched for other phenotypes and to display expression coherency among their profiles. The gene groups displayed functional coherency and involved genes with known role in cancer, providing additional support to their involvement.
UR - http://www.scopus.com/inward/record.url?scp=33846699305&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btl298
DO - 10.1093/bioinformatics/btl298
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:33846699305
SN - 1367-4803
VL - 23
SP - e91-e98
JO - Bioinformatics
JF - Bioinformatics
IS - 2
ER -