Unsupervised context sensitive language acquisition from a large corpus

Zach Solan, David Horn, Eytan Ruppin, Shimon Edelman

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We describe a pattern acquisition algorithm that learns, in an unsupervised fashion, a streamlined representation of linguistic structures from a plain natural-language corpus. This paper addresses the issues of learning structured knowledge from a large-scale natural language data set, and of generalization to unseen text. The implemented algorithm represents sentences as paths on a graph whose vertices are words (or parts of words). Significant patterns, determined by recursive context-sensitive statistical inference, form new vertices. Linguistic constructions are represented by trees composed of significant patterns and their associated equivalence classes. An input module allows the algorithm to be subjected to a standard test of English as a Second Language (ESL) proficiency. The results are encouraging: the model attains a level of performance considered to be "intermediate" for 9th-grade students, despite having been trained on a corpus (CHILDES) containing transcribed speech of parents directed to small children.

Original languageEnglish
Title of host publicationAdvances in Neural Information Processing Systems 16 - Proceedings of the 2003 Conference, NIPS 2003
PublisherNeural information processing systems foundation
ISBN (Print)0262201526, 9780262201520
StatePublished - 2004
Event17th Annual Conference on Neural Information Processing Systems, NIPS 2003 - Vancouver, BC, Canada
Duration: 8 Dec 200313 Dec 2003

Publication series

NameAdvances in Neural Information Processing Systems
ISSN (Print)1049-5258

Conference

Conference17th Annual Conference on Neural Information Processing Systems, NIPS 2003
Country/TerritoryCanada
CityVancouver, BC
Period8/12/0313/12/03

Fingerprint

Dive into the research topics of 'Unsupervised context sensitive language acquisition from a large corpus'. Together they form a unique fingerprint.

Cite this