TY - JOUR

T1 - Linear separability of gene expression data sets

AU - Unger, Giora

AU - Chor, Benny

N1 - Funding Information:
The authors are happy to thank many friends and colleagues for helpful discussions and advice in the course of working on this project: Noga Alon, Eleazar Eskin, Nir Friedman, Irit Gat-Viks, Danny Halperin, Naftali Kaminski, Isaaco Meilijson, Amnon Peled, Tal Pupko, Amos Tanay, and Zohar Yakhini. They would also like to thank Koby Crammer for letting them use his SVM package and the anonymous referees for their useful suggestions. This research was supported by the Israel Science Foundation under Grant 418/00.

PY - 2010

Y1 - 2010

N2 - We study simple geometric properties of gene expression data sets, where samples are taken from two distinct classes (e.g., two types of cancer). Specifically, the problem of linear separability for pairs of genes is investigated. If a pair of genes exhibits linear separation with respect to the two classes, then the joint expression level of the two genes is strongly correlated to the phenomena of the sample being taken from one class or the other. This may indicate an underlying molecular mechanism relating the two genes and the phenomena(e.g., a specific cancer). We developed and implemented novel efficient algorithmic tools for finding all pairs of genes that induce a linear separation of the two sample classes. These tools are based on computational geometric properties and were applied to 10 publicly available cancer data sets. For each data set, we computed the number of actual separating pairs and compared it to an upper bound on the number expected by chance and to the numbers resulting from shuffling the labels of the data at random empirically. Seven out of these 10 data sets are highly separable. Statistically, this phenomenon is highly significant, very unlikely to occur at random. It is therefore reasonable to expect that it manifests a functional association between separating genes and the underlying phenotypic classes.

AB - We study simple geometric properties of gene expression data sets, where samples are taken from two distinct classes (e.g., two types of cancer). Specifically, the problem of linear separability for pairs of genes is investigated. If a pair of genes exhibits linear separation with respect to the two classes, then the joint expression level of the two genes is strongly correlated to the phenomena of the sample being taken from one class or the other. This may indicate an underlying molecular mechanism relating the two genes and the phenomena(e.g., a specific cancer). We developed and implemented novel efficient algorithmic tools for finding all pairs of genes that induce a linear separation of the two sample classes. These tools are based on computational geometric properties and were applied to 10 publicly available cancer data sets. For each data set, we computed the number of actual separating pairs and compared it to an upper bound on the number expected by chance and to the numbers resulting from shuffling the labels of the data at random empirically. Seven out of these 10 data sets are highly separable. Statistically, this phenomenon is highly significant, very unlikely to occur at random. It is therefore reasonable to expect that it manifests a functional association between separating genes and the underlying phenotypic classes.

KW - Bioinformatics (genome or protein) databases

KW - Biology and genetics

KW - DNA microarrays

KW - Data mining

KW - Diagnosis

KW - Gene expression analysis

KW - Geometrical problems and computations

KW - Heuristic methods

KW - Information filtering

KW - Linear separation

UR - http://www.scopus.com/inward/record.url?scp=77952132845&partnerID=8YFLogxK

U2 - 10.1109/TCBB.2008.90

DO - 10.1109/TCBB.2008.90

M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???

AN - SCOPUS:77952132845

SN - 1545-5963

VL - 7

SP - 375

EP - 381

JO - IEEE/ACM Transactions on Computational Biology and Bioinformatics

JF - IEEE/ACM Transactions on Computational Biology and Bioinformatics

IS - 2

M1 - 4604654

ER -