TY - JOUR
T1 - Unsupervised learning of natural languages
AU - Solan, Zach
AU - Horn, David
AU - Ruppin, Eytan
AU - Edelman, Shimon
PY - 2005/8/16
Y1 - 2005/8/16
N2 - We address the problem, fundamental to linguistics, bioinformatics, and certain other disciplines, of using corpora of raw symbolic sequential data to infer underlying rules that govern their production. Given a corpus of strings (such as text, transcribed speech, chromosome or protein sequence data, sheet music, etc.), our unsupervised algorithm recursively distills from it hierarchically structured patterns. The ADIOS (automatic distillation of structure) algorithm relies on a statistical method for pattern extraction and on structured generalization, two processes that have been implicated in language acquisition. It has been evaluated on artificial context-free grammars with thousands of rules, on natural languages as diverse as English and Chinese, and on protein data correlating sequence with function. This unsupervised algorithm is capable of learning complex syntax, generating grammatical novel sentences, and proving useful in other fields that call for structure discovery from raw data, such as bioinformatics.
AB - We address the problem, fundamental to linguistics, bioinformatics, and certain other disciplines, of using corpora of raw symbolic sequential data to infer underlying rules that govern their production. Given a corpus of strings (such as text, transcribed speech, chromosome or protein sequence data, sheet music, etc.), our unsupervised algorithm recursively distills from it hierarchically structured patterns. The ADIOS (automatic distillation of structure) algorithm relies on a statistical method for pattern extraction and on structured generalization, two processes that have been implicated in language acquisition. It has been evaluated on artificial context-free grammars with thousands of rules, on natural languages as diverse as English and Chinese, and on protein data correlating sequence with function. This unsupervised algorithm is capable of learning complex syntax, generating grammatical novel sentences, and proving useful in other fields that call for structure discovery from raw data, such as bioinformatics.
KW - Computational linguistics
KW - Grammar induction
KW - Language acquisition
KW - Machine learning
KW - Protein classification
UR - http://www.scopus.com/inward/record.url?scp=23844541694&partnerID=8YFLogxK
U2 - 10.1073/pnas.0409746102
DO - 10.1073/pnas.0409746102
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:23844541694
SN - 0027-8424
VL - 102
SP - 11629
EP - 11634
JO - Proceedings of the National Academy of Sciences of the United States of America
JF - Proceedings of the National Academy of Sciences of the United States of America
IS - 33
ER -