|
Unsupervised context sensitive language acquisition from a large corpus David Horn Abstract A central tenet of generative linguistics is that extensive innate knowledge of grammar is essential to explain the acquisition of language from positive-only data. Here, we explore an alternative hypothesis, according to which syntax is an abstraction that emerges from exposure to language. The incremental process of acquisition of patterns is driven both by structural similarities and by statistical information inherent in the data, so that frequent strings of similar composition come to be represented by the same pattern. Our algorithm, ADIOS, represents a corpus by a graph whose nodes are words. Sentences appear as paths, or strings, within the graph. Significant patterns (SP) are extracted from the data and assigned the role of new nodes, thus restructuring the graph. In the process we uncover the existence of equivalence classes (EC). The syntactic contents of the corpus is represented by trees of SPs and ECs. Our model allows for structure-sensitive generalization in the production and the assimilation of unseen examples. We demonstrate good generalization ability of this algorithm, both on artificial context free grammar and on a natural language corpus.
|