TY - JOUR
T1 - CSN
T2 - unsupervised approach for inferring biological networks based on the genome alone
AU - Galili, Maya
AU - Tuller, Tamir
PY - 2020/5/15
Y1 - 2020/5/15
N2 - BACKGROUND: Most organisms cannot be cultivated, as they live in unique ecological conditions that cannot be mimicked in the lab. Understanding the functionality of those organisms' genes and their interactions by performing large-scale measurements of transcription levels, protein-protein interactions or metabolism, is extremely difficult and, in some cases, impossible. Thus, efficient algorithms for deciphering genome functionality based only on the genomic sequences with no other experimental measurements are needed. RESULTS: In this study, we describe a novel algorithm that infers gene networks that we name Common Substring Network (CSN). The algorithm enables inferring novel regulatory relations among genes based only on the genomic sequence of a given organism and partial homolog/ortholog-based functional annotation. It can specifically infer the functional annotation of genes with unknown homology. This approach is based on the assumption that related genes, not necessarily homologs, tend to share sub-sequences, which may be related to common regulatory mechanisms, similar functionality of encoded proteins, common evolutionary history, and more. We demonstrate that CSNs, which are based on S. cerevisiae and E. coli genomes, have properties similar to 'traditional' biological networks inferred from experiments. Highly expressed genes tend to have higher degree nodes in the CSN, genes with similar protein functionality tend to be closer, and the CSN graph exhibits a power-law degree distribution. Also, we show how the CSN can be used for predicting gene interactions and functions. CONCLUSIONS: The reported results suggest that 'silent' code inside the transcript can help to predict central features of biological networks and gene function. This approach can help researchers to understand the genome of novel microorganisms, analyze metagenomic data, and can help to decipher new gene functions. AVAILABILITY: Our MATLAB implementation of CSN is available at https://www.cs.tau.ac.il/~tamirtul/CSN-Autogen.
AB - BACKGROUND: Most organisms cannot be cultivated, as they live in unique ecological conditions that cannot be mimicked in the lab. Understanding the functionality of those organisms' genes and their interactions by performing large-scale measurements of transcription levels, protein-protein interactions or metabolism, is extremely difficult and, in some cases, impossible. Thus, efficient algorithms for deciphering genome functionality based only on the genomic sequences with no other experimental measurements are needed. RESULTS: In this study, we describe a novel algorithm that infers gene networks that we name Common Substring Network (CSN). The algorithm enables inferring novel regulatory relations among genes based only on the genomic sequence of a given organism and partial homolog/ortholog-based functional annotation. It can specifically infer the functional annotation of genes with unknown homology. This approach is based on the assumption that related genes, not necessarily homologs, tend to share sub-sequences, which may be related to common regulatory mechanisms, similar functionality of encoded proteins, common evolutionary history, and more. We demonstrate that CSNs, which are based on S. cerevisiae and E. coli genomes, have properties similar to 'traditional' biological networks inferred from experiments. Highly expressed genes tend to have higher degree nodes in the CSN, genes with similar protein functionality tend to be closer, and the CSN graph exhibits a power-law degree distribution. Also, we show how the CSN can be used for predicting gene interactions and functions. CONCLUSIONS: The reported results suggest that 'silent' code inside the transcript can help to predict central features of biological networks and gene function. This approach can help researchers to understand the genome of novel microorganisms, analyze metagenomic data, and can help to decipher new gene functions. AVAILABILITY: Our MATLAB implementation of CSN is available at https://www.cs.tau.ac.il/~tamirtul/CSN-Autogen.
KW - Biological networks
KW - E. coli
KW - Gene expression
KW - Gene function annotation
KW - S. cerevisiae
KW - Transcripts comparison
UR - http://www.scopus.com/inward/record.url?scp=85084786838&partnerID=8YFLogxK
U2 - 10.1186/s12859-020-3479-9
DO - 10.1186/s12859-020-3479-9
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
C2 - 32414319
AN - SCOPUS:85084786838
SN - 1471-2105
VL - 21
SP - 190
JO - BMC Bioinformatics
JF - BMC Bioinformatics
IS - 1
ER -