SEL292: COMPUTATIONAL LINGUISTICS

Course Manual


INTRODUCTION

All the sciences share a tool which allows them to pursue their research and present their results with precision, to frame their theories, and to communicate with researchers in related disciplines: mathematics. Between the 1930s and the the mid-1950s a mathematical theory of language was developed, and this has since then evolved along three main lines:

1. Generative grammar, whose aim is the statement of explanatory theories of natural language. Chomsky's Transformational Grammar was an early attempt at this; more recent examples are Generalised Phrase Structure Grammar, Lexical-Functional Grammar, Government & Binding, and Minimalism.

2. Theory of computation, whose aim is to state a theory of computation in terms of language processing devices. This line of development is the scientific and engineering basis of all current computer technology.

3. Natural language processing, whose aim is to develop computer-based systems for applications involving speech and text forms of natural language. These applications range from word processing through statistical text analysis to artificial intelligence systems that interact with humans using spoken or written language.

Theoretical linguistics is concerned with (1), and computational linguistics with (2) and (3). We shall consequently be looking at (2) and (3). The structure of the discussion is as follows:

1. Mathematical characterization of language

2. The nature of computation

3. Formal language and automata theory

  • Phrase structure grammars

  • Automata

  • Finite state automata for formal language definition

  • Limitations of finite state automata

  • Pushdown automata

  • Parsing

  • Natural language parsing with pushdown automata


Lecture 1

MATHEMATICAL CHARACTERIZATION OF LANGUAGE

1. Sets

A set is any collection of objects --of trains, teeth, policemen, numbers. Indeed, so general is the concept that the members of the set do not even have to have any obvious connection with one another: ships and snails and sealing wax, cabbages and kings. Intuitively, the notion of a set is simple and straightforward; it is also the foundation on which mathematics in general, and the theory of formal languages and automata in particular, are built. A well developed set theory exists. For present purposes, however, only the following observations need to be made:

a) Sets are given labels for ease of reference in discussion. By convention, these labels are uppercase letters: A, B, C...

b) There are two ways to define a set, that is, to specify which objects are included in a set:

i. The objects belonging to a set may be listed. To make it clear that such a list is meant to constitute a set, it is enclosed in curly brackets. For example, the days of the week may be specified as:

{Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday}

ii. Some criterion or criteria may be specified which can be used to determine which objects belong to set. This is the only practical course when the number of objects in a set is very large. It would be effectively impossible to list all the green objects in the world, for example, but one could easily state a rule for set membership: that any green object belongs to the set. Then, for any object, it would be possible to decide whether or not it belonged to the set. The usual way of writing such rules in mathematics is on the following pattern:

G = {x | x is green}

Note that this set is given an uppercase letter label, and that the definition is enclosed in curly brackets as before. The bar | is new: it is read as 'such that', and the above definition as a whole reads: 'G is the set of all x such that x is green'. The character x here is a variable, a place holder, which can be used to stand for any object we might want to test for set membership. Thus, if we let x = 'tree', then x is a member of the set, but not if x = 'sky'.

This style of set definition becomes indispensible when an infinite number of objects belongs to a set. The set of all green objects in the world is very large, but given enough time, manpower, and lunatic motivation a list could be made; the number of green objects is finite. But there are infinite sets, that is, sets whose members could never be listed, even given world enough and time. A familiar example is the set of positive numbers {1,2,3...}. No matter how large the last number in this sequence, it is always possible to add 1, thus creating a yet larger number, and the sequence consequently never ends. In such a situation it is not only impractical to make a list, but impossible, and as such the 'set definition by rule' approach is the only one available.

c) To indicate that a given object belongs to a set, the symbol ε which is read as 'belongs to', is used. Thus, for a set A = {x | x is an animal}, cat ε A.

d) It is often necessary to refer not to a whole set, but to some part of it. A part of a set is called a subset. In the set A below, B is a subset of A: very object in B is necessarily in A, but not vice versa.

The notion of a subset is precise. There is no such thing as 'more or less a subset': if there is even one object in B that is not also in A, then B is not a subset of A.


2. Cartesian product

Sets are not much use on their own. Things can be done with them, however. One operation on them is particularly important: multiplication. Assume the sets {a,b,c} and {x,y,z}. Multiplying them yields:

{(a,x), (a,y), (a,z), (b,x), (b,y), (b,z), (c,x), (c,y), (c,z)}

Note that:

i. The result of set multiplication is itself a set.

ii. Multiplication has yielded a set of pairs, such that each element from the first set is paired with each element from the second.

iii. The pairs are ordered pairs. That is, the order of the components of each pair matters: (a,x) is not the same as (x,a). The order of pairs is determined by the order in which the sets are multiplied. Thus {a,b,c} x {x,y,z} yields the above set of pairs, while {x,y,z} x {a,b,c} yields:

{(x,a), (x,b), (x,c), (y,a)...etc}

Multiplication of sets is not, moreover, limited to only two. Any number of sets may be multiplied. For example, {a,b} x {i,j} x {x,y} yields:

{(a,i,x), (a,i,y), (a,j,x), (a,j,y), (b,i,x), (b,i,y), (b,j,x), (b,j,y)}

The components of this set are ordered triples. Multiples of four sets are called 4-tuples, of five sets 5-tuples, and so on. In general, given n sets A1...An, the set of all n-tuples is called the Cartesian product.


3. Relations

In the language of love, a 'relationship' refers to the state in which two people find themselves. One can easily formalize this concept, and so gain understanding at the cost of losing its beauty. Take two sets: W = {x | x is a woman}, and M = {y | y is a man}. Then take the Cartesian product, which results in a set which contains all possible pairings of men and women. Now think of a man and a woman who are in love: they are one pair in the Cartesian product. Extend this idea, and think of all the men and women you know who love each other. These, too, will be pairs in the Cartesian product. And now go even further, and try to imagine all the men and women in Newcastle who are in love. By now we are dealing with a reasonably large set of pairs. Nevertheless, it is still a tiny subset of the Cartesian product of all the men and women in Newcastle --after all, any given man or woman will be in love with, usually, one or at most two or three others, but hardly with all the other women or men in the city. If one were now to be in a position to identify and pick out all the relevant love pairs from the Cartesian product of all men and women in Newcastle, and put them into a set, one would have what a mathematician would call a relation, a subset of a Cartesian product of sets such that members of the set are 'related' in some way, that they share certain specified characteristics, such as being in love.

Another example is the 'father of' relation. The set of father-son and father-daughter pairs is a subset of the Cartesian product of F x C, where F = {x | x is a man and x has offspring} and C = {y | y is human and y is not equal to x}. In this example, the significance of pairs being ordered emerges clearly: to say that John is the father of Mary is not the same as saying that Mary is the father of John, that is, (John, Mary) is not equivalent to (Mary, John).

A relation, therefore, is simply a subset of some Cartesian product of sets. In principle, the tuples in a relation can be of any length (corresponding to n-fold Cartesian product). In practice, a very important kind of relation is the binary relation, which consists of ordered pairs such as those discussed above.


4. Alphabet

In everyday usage, 'alphabet' refers to the 26 letters which comprise our writing system; the erudite will also know of other alphabets like Greek and Cyrillic. In formal language and automata theory, however, the notion of an alphabet is generalised to include any set of symbols representable in any medium: knots in a string, flashes of light, coloured flags, marks on paper, sounds. An alphabet can therefore be defined as a finite set of symbols.


5. Strings

A string over an alphabet is a finite sequence of symbols taken from an alphabet and written from left to right; repetition of symbols is allowed. Thus, assuming an alphabet A= {a,b,c}, some strings over this alphabet are:

aaaaabbbbb

abccba

ab

c

bbbbbbbbbbbbbbbbb

Clearly, many others are possible. Indeed, for any alphabet, it is possible to construct an infinite number of strings. If this seems outlandish, take the words of the English language as an alphabet --that is, each word as a separate symbol: A = {x | x is an English word}. Now we can start making strings:

This is the number one

This is the number two

...etc...

This is the number four trillion and one

...etc...

There is no end to the sequence. No matter what number is named at the end of the most recent sentence, it is always possible to name the next. There are, in short, at least as many sentences as there are positive numbers, and, as we saw, there are infinitely many positive numbers. And, of course, we haven't even begun to consider all the other possible English sentences yet. The set of all possible strings over an alphabet A is denoted A*.

So far, we have been talking about 'strings' as 'finite sequences of symbols' without asking where the notion of 'sequence' comes from. Intuitively, we all know what sequence means, but mathematics is not --and cannot be-- built on intuitions. Rather, mathematics is built on set theory; we have to base a definition of 'sequence' on set theory as well. The way to do this is via the Cartesian product described earlier. It was noted that there is no limit to the number of sets that may be multiplied: for n sets, the result is a set of n-tuples. One further point about set multiplication now needs to be made: a single set may be multiplied by itself any number of times. For example, given the set {a,b}, the operation {a,b} x {a,b} x {a,b} yields:

{(a,a,a,), (a,a,b), (a,b,a), (a,b,b), (b,a,a), (b,a,b), (b,b,a), (b,b,b)}

Each of these triples is in fact a sequence over the alphabet {a,b}, that is, each is a string. Thus we have definition of sequence, and thereby of string, in terms of set theory. To obtain strings of length 4 over this alphabet, multiply the set {a,b} four times, and so on up to arbitrary n. In general, then, a string is a member of the n-fold Cartesian product of an alphabet.

Two notes on strings:

i. It is possible to subdivide a string into substrings. Substrings are in effect chunks taken out of a string: if aaabbb is a string, then aa and abbb are substrings, as are aaabb, b, etc.

ii. There is such a thing in formal language and automata theory as the null (or empty) string, that is, a string consisting of no symbols at all. This is the kind of apparent fatuousness that gives mathematics its bad name --it's like calling an open field a null city, or an area of table a null cup of coffee-- but it has its uses in the theory of languages, and has to be tolerated.

 


6. Languages

Having defined sets, alphabets, and strings, we are now in a position to give a precise definition of what a language is. The definition is simple, even disappointingly so: a language is a subset of all the strings over an alphabet. Stated more formally, a language L is some subset of A*, where A is some alphabet.

Note that, though a language is a subset of an infinite set, it may itself be infinite. This appears nonsensical, but common sense fails where infinity is involved. It is easy to show that a subset of an infinite set can itself be infinite. Earlier we noted that the set of positive numbers constitutes an infinite set. Now, the set of all even positive numbers is clearly a subset of these. However, the sequence 2,4,6... has no end: it is always possible to add 2 to the most recent number in the sequence, without end. Thus a language turns out to be a (possibly infinite) set of strings.

 



Lecture 2

THE NATURE OF COMPUTATION

This module is called Computational Linguistics, and we shall be dealing extensively with computation, both theoretically and in practice via computer programming. In all things it's crucial to know what one is talking about. 'Linguistics' is clear enough: it is, broadly, the study of language. But what is 'computation'? This section defines it.

To understand computation, it is first necessary to understand two fundamental concepts:

1. Functions

Functions are a special case of binary relations. Take two sets S and T. The Cartesian product S x T consists of ordered pairs (s,t) where s ε S and t ε T, and a binary relation is a subset of this product. What differentiates a function from a binary relation is the following restriction: each of the members of S included in the relation appears exactly once as the first component of an ordered pair. This is in sharp contrast with a relation, where any member of S included in the relation can appear as the first component of many ordered pairs. The difference is easily seen pictorially:

Note that, in the case of a function, each element in the first set is associated with one and only one element in the second. Note also that any number of elements in the first set may be associated with a single element in the second. Some examples:

A common mathematical function is 'squared', that is, a number multiplied by itself. Pictorially, the function looks like this:

For any number in the first set, there is exactly one number in the second. This is the same as saying that any number squared has a unique result: 2x2 is always 4, 3x3 is always 9, and so on. Thus, if one were to choose a number from the first set, say 4, the result would always be the same; the relevant pair is (4,16). The squared function as a whole goes: {(1,1), (2,4), (3,9), (4,16)...}.

The formal definition of function is rather abstract, and for most people takes a while to sink in. But functions are extremely important in mathematics, including language and automata theory, and the aim was to show that the concept is not plucked out of thin air, but rather is derivable from set theory via Cartesian product and relations. In practice, picture a function as a device for solving some particular problem. On the input side of the device, insert the information that needs to be processed; the output gives a unique answer:

2. Algorithms

An algorithm is a sequence of instructions which, if followed, is guaranteed to result in some desired state of affairs. Say, for example, the desired state of affairs is to get to the Student Union from the third floor of the Percy Building. An algorithm to achieve that might be:

i. Go through the red doors on the landing

ii. Walk down to the ground floor

iii. Go through the front door

iv. Walk down the quad

v. Walk under the arches

vi. Cross the road

vii. Walk 10 metres

viii. Turn right

Note two things: the instructions are in a prescribed sequence, and each instruction is absolutely explicit –there's no possibility of misinterpretation. If this sequence of instructions, or algorithm, is followed exactly, you are guaranteed to end up at the front door of the Union. Another example of an algorithm is a cooking recipe.

3. Computation

What do functions and algorithms have to do with computation? The answer is this: algorithms compute functions. Or, in other words, computation is algorithmic solution of functions. To understand what this means, consider an example. 'Subtraction' is an important and well known function in mathematics, and everyone uses it as a matter of course to subtract one number from another. But what if you were required to say how to do subtraction, that is, to specify an algorithm for subtracting one number from another? There are various possibilities; here are two:

1. Count backwards. Say 34 is to be subtracted from 256. The algorithm is to count backwards 34 from 256, and the result is the answer. This method may seem moronic, but it always works.

2. Write one number on a piece of paper, then the other one below it to that the digits line up, and then follow some rules; we all know what they are:

256
-34
___
222

So, given some Cartesian product P, a computation is an algorithm for selecting from P the tuples that constitute some desired function. Applying this to the above discussion, we take a 3-tuple Cartesian product of the numbers from, say, 0 to 999; this is a very large set, so only example tuples are given:

P = {(0,0,0), (0,0,1), (0,0,2)...(255, 34, 222), (256, 34, 222)...(998, 999, 999)...}

To subtract 34 from 256, we select the triple (256, 34, 222), where 222 is the answer. If we wanted to subtract 15 from 30, we would select (30, 15, 15), and so on. Note that most of the triples in this set do not belong to the function 'Subtract': 255 – 34 is not, for example, 222.


Lecture 4

FORMAL LANGUAGE AND AUTOMATA THEORY

What, then, is 'computational linguistics'? The answer is now straightforward: computational linguistics is the computation of functions involving language. In other words, computational linguistics does not concern itself with, say, functions involving numbers (which is the preserve of numerical mathematics), but with functions involving symbol strings.

To introduce the notion of functions involving language, we consider the simplest of such functions: language definition. We have seen that a language is just a (possibly infinite) subset of an infinite set of strings. Let us now assume that we have an alphabet, and that we want to define some particular language. Can we invent a function that will specify which strings comprise the appropriate subset? To state the problem a little more formally: given an alphabet A, the set A* (that is, the Cartesian product) consists of all possible strings over A; for some specific language L, we want a function which will allow us to decide whether or not any particular string in A* is a member of L.

For present purposes, there are three ways of functionally defining L:

  1. Make an exhaustive list of all the strings in A* belonging to L, that is, define the function by listing all the tuples which belong to it.

  2. Construct a grammar which generates all and only the strings of L

  3. Construct an automaton that either generates or recognises the strings of L

The straightforwardness of the first approach is appealing: to decide if a given string in A* belongs to L, simply look it up in the list. But what happens when the language to be defined contains a very large number of strings? The larger the number, the less practical the list becomes. True, there is no objection in principle, and very long lists can be constructed. In fact, though, this first approach is not merely potentially impractical but simply inadequate for our purposes. Why? We have seen that languages, though subsets of infinite sets, may themselves be infinite. Any generally effective language defining function has to be able to handle such languages. But, as the discussion of sets showed, there is no way to list an infinite set, since such a list would never end, and if the list never ended, the language would never be defined. The first of the above approaches has therefore to be rejected; the rest of this manual deals with the other two.

 

GRAMMARS

One way of defining a language, whether infinite or not, is to give a complete description of how the strings belonging to that language may be formed. A grammar constitutes such a description in that it is a finite set of rules whose application generates all the strings belonging to the language in question, and no others. Any string which a grammar can generate belongs to the language it describes; the set of strings which such a grammar is capable of generating is the language defined by the grammar. A grammar is therefore a language defining function.

We shall be looking at a class of grammars called phrase structure grammars. They are so called because they generate strings by building them up out of substrings called phrases. The first step in understanding how such grammars work is to see how a string can have phrase structure.

Phrase structure grammars were developed by linguists attempting to describe the syntactic characteristics of natural languages. They are a formalization of intuitions which native speakers have about the structure of their language. We shall begin our discussion with intuitions about English, and then proceed to formalize them.

The alphabet of English is the set of English words, and word sequences are what constitute the strings --or, more colloquially, the sentences-- of the language. Most English speakers agree that sentences have structure; that is the observation on which all of what follows is based. Specifically:

  • The words which comprise the alphabet of English fall into categories. Words like bear, car, and house are universally felt to be very different from ones like run, sleep, think, and different from both these categories are words like in, under, about. These and other categories have long-established names: nouns, verbs, prepositions, and so on.

 

  • When used in a sentence, word categories typically form distinctive groupings called phrases. For example, phrases like on the beach, behind the red car, in sight have a common pattern: preposition + (optional) determiner + (optional) adjective(s) + noun, and the pattern is called a prepositional phrase. Another common pattern is the noun phrase, which typically looks like this: (optional) determiner + (optional) adjectives + noun, ie, the man, the sad tale, a long hot summer, restaurants.

 

  • Word categories and phrases can combine with one another to form new, more complex phrases. For example, the word category verb combines with a noun phrase of a prepositional phrase to form a verb phrase:

...saw the dark horse (verb + noun phrase)

...ran into the street (verb + prepositional phrase)

 

  • The combination and recombination of word categories and phrases as described in (ii) and (iii) eventually produces sentences. The general pattern for English declarative sentences is a noun phrase followed by a verb phrase:

the man in the boat saw an island on the horizon

where the noun phrase the man in the boat itself consists of a less complex noun phrase (the man) and a prepositional phrase (in the boat), and the verb phrase consists of a verb (saw) and a noun phrase (an island on the horizon), which in its turn consists of a noun phrase (an island) and a prepositional phrase (on the horizon).

From this, it emerges that English sentences not only have a structure, but that that structure is hierarchical: a sentence consists of phrases, phrases consist of other phrases and/or word categories, and word categories consist of words. This hierarchical structure is called phrase structure.

A phrase structure grammar is essentially just a description of the phrase structures which can exist in a language. Such a description takes the form of rules. Some rules for English might look like this:

  • Rule 1: A sentence consists of a noun phrase followed by a verb phrase

  • Rule 2: A noun phrase may consist of a noun alone, a determiner followed by a noun, a determiner followed by any number of adjectives followed bya noun, or any of these followed by a prepositional phrase.

  • Rule 3: A verb phrase may consist of a verb alone, or a verb followed by a noun phrase, or a verb followed by a prepositional phrase.

This is cumbersome. A notation has been developed to express such rules more succinctly. First of all, represent word categories, phrase types, and 'sentence' by short labels such as N for 'noun, V for 'verb', NP for 'noun phrase', and so on. The, represent the relations between and among sentence, phrase type, word category, and word by means of an arrow, which is read 'stands for' or simply 'is'. For example, S --> NP NP reads: 'A sentence consists of a noun phrase followed by a verb phrase'; NP --> Det N reads 'A noun phrase consists of a determiner followed by a noun', and so on. Rules written using this notation are called productions.

Once a set of productions has been compiled, the phrase structure grammar can be used to generate sentences. How? The grammar of a language was said to contain a set of productions which give a complete description of how the sentences of that language are formed. As we saw in the foregoing discussion, this means that the productions will specify (i) what phrases a sentence consists of, (ii) what subphrases and/or word categories each sentence-constituent phrase consists of, and (iii) what words the word categories consist of. Generating a sentence involves using the productions to move from the top to the bottom of this structural hierarchy, expanding at each step the description of the sentence until the sequence of words which constitutes the sentence has been produced. For example, consider this set of productions:

S --> NP VP

NP --> N

NP --> Det N

N --> man

N --> dog
N --> cat

Det --> the

VP --> V NP
V --> bites
V --> catches

We want to use this set of productions to generate a sentence. What is a sentence? The first production tells us; write it down:

S --> NP VP

The right-hand side of this expression is called a sentential form. It is not yet an English sentence, however, but merely a description of one, and a very general description at that: it says only that a sentence consists of a noun phrase followed by a verb phrase. But other productions allow this description of the sentence S to be refined, because they specify what NP and VP consist of. If NP in the above sentential form is replaced by Det N, and VP by V NP, as the productions allow us to do, a more detailed description of S results. In so doing, however, two rules need to be observed. The first is that only one such replacement or rewrite can be made at a time, and the second is that the leftmost symbol of the sentential form is rewritten. So, replacing the leftmost symbol, the above sentential form can be rewritten as:

S --> Det N VP

The production Det --> the now allows us to replace Det in this sentential form:

S --> the N VP

The sentential form is then developed on the above pattern, as follows:

S --> the man VP

the man V NP

the man bites NP

the man bites Det N

the man bites the N

the man bites the dog

Here we stop, because there are no more replacements to be made. An English sentence has been generated using the productions of a phrase structure grammar.

The process of generating a sentence using a phrase structure grammar is, then, the successive rewriting of sentential forms using the productions of the grammar, starting with the sentential form S. The sequence of sentential forms required to generate a sentence (as above) constitutes a derivation of the sentence according to the grammar. In other words, a derivation is a finite sequence of sentential forms of which the first is the sentence symbol S, and in which each subsequent sentential form is obtained by copying the current sentential form and rewriting the leftmost nonterminal symbol in accordance with one of the productions of the grammar. The rewriting process terminates when the sentential form contains no more nonterminal symbols, but consists entirely of words.

So far, we have been looking at phrase structure in a natural language. The same notions can, however, also be applied to artificial languages. Consider the string abcdefg. One way of looking at this string is to regard it simply as a sequence of symbols: a followed by b followed by c...and so on. It is, however, also possible to see it as composed of phrases, say ab and cdefg. One or both of these phrases may, moreover, be taken as composed of subphrases: cdefg might consist of cd and efg.Again, efg might consust of e and fg.Various other phrase structures are possible as well: abcd and efg; abc, de, fg, and so on. Indeed, for artificial language strings, there is no intuitively correct phrase structure as there is for natural language sentences. To suggest that the man rode the horse consists of the phrases the and man rode the horse, or the, man rode the, and horse would strike an English speaker as absurd, because he has an English grammar in his mind which he knows intuitively. But for a string like abcdefg we have no such intuitions. The only way one can know its phrase structure is by knowing the grammar that generated it. This last observation is crucial, and underlies everything which is said from here onwards


Lecture 5

We are now in a position to formalize the ideas we have been looking at. This formalization begins the study of formal language theory.

 

Definition: phrase structure grammar

A phrase structure grammar (henceforth PSG) has four components:

  1. A finite set of terminal symbols which represent the alphabet from which the strings of the language generated by the PSG are built. By convention, when dealing with formal languages, these symbols come from the Roman alphabet, and are written lower-case: a, b, c...

  2. A finite set of nonterminal symbols which represent the phrase structure and lexical category symbols. By convention, these come from the Roman alphabet, and are written upper-case: A, B, C...

  3. A string symbol S, which marks the starting point for string derivations. This symbol must be unique in the grammar in the sense that neither the set of terminals nor the set of nonterminals may include it.

  4. A finite set of grammatical rules or productions. Each production is an ordered pair of strings (a, b) such that

a = xNz

b = xyz

where x, y, and z are possibly empty strings consisting of terminals and/or nonterminals and N stands for either S or a nonterminal. A production (a, b) is written a --> b.

Components 1-3 of this definition should be clear enough, but 4, being stilted, requires some comment. It says that all production must have the form a--> b. What do a and b consist of? Well, a must have at least one nonterminal symbol in it; it may also have some arbitrary string of terminals and/or nonterminals in front of it, and may also have some arbitrary string of terminals and/or nonterminals following it, but need not. In other words, a is some string of terminals and/or nonterminals, with the constraint that at least one of the symbols must be a nonterminal. For b there is no constraint at all: it can consist of any finite string of terminals and/or nonterminals, including the empty string --that is, the right side of the production may have nothing at all. 

 

Types of phrase structure grammar

Phrase structure grammars can be subcategorised by stipulating various degrees of restriction on the forms which the productions can take. There are four standard categories; they constitute the Chomsky Hierarchy.

Type 0: unrestricted grammars

This type of PSG is identical to the general type of PSG just discussed. Only one further point about this class of grammars has to be made here. Implicit in the foregoing definition was that the right side of a production can be null in cases where x,y,z are all empty strings. When used to rewrite a sentential form, such a production causes some nonterminal in the current sentential form to be rewritten as the null string, with the effect that the sentential form contracts. The property of allowing the sentential form to contract is unique to type 0 PSGs, and, apparently paradoxically, makes them so powerful in terms of the languages they can generate that some linguists find their use questionable.

 

Type 1: Context sensitive grammars

A context sensitive grammar is a PSG in which all the productions have the form

xNz --> xyz where:

  • N is a nonterminal or S

  • x,y,z are arbitrary strings of terminals and/or nonterminals

  • y may not be null

Such grammars are called 'context sensitive' because N can be rewritten by y only when it is in the context of x and z --that is, when, in the current sentential form, the string x precedes N and the string y follows it.

 

Type 2: Context free grammars

A context free grammar is a PSG in which all the productions have the form

xNz --> xyz where:

  • N is a nonterminal or S

  • x and z are null strings

  • y is a non-null string

Note the difference between this specification and the one for context sensitive grammars. In the latter, the strings preceding and following the terminal on the left side were arbitrary: they could be composed of terminals and/or nonterminals, and one or both of them could be null. In a context free grammar, however, they must be null. As a consequence, the left side of a production in a context free grammar always consists of exactly one nonterminal or S. This is why such grammars are called 'context-free'. The N on the left side of the production can be rewritten by the string y on the right regardless of the context in which N finds itself in the current sentential form; it is independent of its context.

 

Type 3: Regular grammars

A regular grammar is a PSG in which all the productions must conform to the following patterns:

Either: N --> xB

Or: N --> x

where:

  • N is a nonterminal or S

  • B is a nonterminal

  • x is a terminal

Such grammars are the most highly restricted class of PSGs.

In computational linguistics, context free and regular grammars are by far the most important. Unrestricted grammars are occasionally invoked, but it is rare to find any reference to context senstitive ones. The reasons for this are not directly relevant here; suffice it to say that we can afford henceforth to ignore unrestricted and context sensitive grammars.

 

Tree diagrams

One way to pass the time in an airplane is to look at the junk in the seat pocket. There is almost always a map showing airline routes something like this:

Such a diagram is an example of a graph. In this case the graph is a diagrammatic representation of which cities are connected by direct routes, and how to get from place to place using these links. Graphs are much used in mathematics, and have a precise definition:

Definition: graph

A graph consists of three components:

i. A non-empty set N of nodes

ii. A possibly-empty set A of arcs

iii. A rule F which associates with each arc two nodes, that is, a rule for connecting nodes by means of arcs.

In the above graph, the set of nodes is {Chicago, Nashville, Miami, Dallas, St.Louis, Albuquerque, Phoenix, Denver, San Francisco, Los Angeles}. There are twelve arcs, and each arc is directed, that is, it leads from one node to another, as indicated by the arrowheads. Note also that, according to the definition, not all nodes have to be connected, as for example St. Louis.

Some terminology. A path from node n0 to nk is a sequence of nodes and arcs as one travels from n0 to nk. The length of a path is the number of arcs it contains. A graph is connected if there is a path from any node to any other node. A cycle is a path from some node ni back to ni where no arc appears more than once, and ni appears only at the beginning and end of the cycle. A graph with no cycles is acyclic.

Why all this? In formal language theory, and in linguistics generally, tree diagrams are much used, and a tree diagram is an acyclic, connected graph. For example:

One node in a tree is designated the root; in the above diagram it is the topmost node, from which all arcs emanate, either directly or indirectly. Any node from which arcs emanate is called a parent node, and any node with another node above it is called a child node; all nodes beneath a parent are its children. The depth of any node in a tree is the length of the path from the root to the node; the root itself has depth 0. A node with no children is called a leaf. All non-leaves are called internal nodes.

A tree is useful as a pictorial representation of structure. As a device for representing structure, it is applicable to any situation where a hierarchy of choices is made. A derivation is a perfect example of 'a hierarchy of choices', and as such a tree is ideal as a visual representation of a derivation. To construct a derivation tree, we start with a tree containing only the root node S. For each step of the derivation, the tree is correspondingly extended. That is, every time a production is used to replace a nonterminal in the current sentential form by the string on the right hand side of the production, lines are drawn from the corresponding nonterminal in the tree to each symbol in the replacement string. At each stage in the construction of the tree, reading from left to right, the leaf nodes will be the current sentential form.

As an example, let us return to the set of productions for generating English strings presented earlier. These are presented again here for convenience; note, incidentally, that they come from a context-free grammar:

S --> NP VP

NP --> N

NP --> Det N

N --> man

N --> dog
N --> cat

Det --> the

VP --> V NP
V --> bites
V --> catches

We now derive a sentence, and at each step build the corresponding tree:

S

S --> NP VP

S --> Det N VP

S --> the N VP

S --> the dog VP

S --> the dog V NP

S --> the dog bites NP

S --> the dog bites Det N

S --> the dog bites the N

S --> the dog bites the man


Lecture 6

AUTOMATA

Introduction

We have looked at grammars as one of two viable ways of stating functions for language definition. We now look at the other way: automata (singular: 'automaton'). An automaton is a machine. Machines are familiar enough. They are mechanisms consisting of parts that move in some specifiable relation to one another for some purpose. An old fashioned clock consists of gears, a spring, pointer arms, and a numbered face. These parts are interconnected so that the energy stored by the spring drives the gears, which in turn causes the arms to move around the clock face and so indicate the time. Now, before a physical machine like a clock is built, it has to be designed, that is, precisely described in some way --words, blueprints, etc. At this design stage the machine is abstract. It exists only by virtue of its specification, and has no physical reality yet; to construct a physical device in accordance with an abstract specification is to implement it. In what follows, we shall be both designing and implementing language definition machines.

An automaton is a function that can define a language as an acceptor. Picture an automaton as a black box whose internal workings are, for the moment at least, of no interest:

    

An acceptor looks like this:

A string from A* is fed into the machine. The machine processes it, and either accepts or rejects it by putting out a signal on either the accept or reject output. The set of all strings accepted by M is the language which M defines. Note that the result is a function: for every string, there is one and only one output value: 'accept' or 'reject'.

There are four main classes of machine: finite state automata, pushdown automata, linear bounded automata, and Turing machines. We shall confine ourselves to the first of these for the moment. This will mean looking inside the black boxes to see how finite state machines actually work. Finite state automata will be considered as acceptors in what follows. Before going on to consider them, however, some relevant concepts and terminology have to be introduced.

At the start of a chess game, the two players' pieces are arranged in rows on opposing sides of the board; the game ends when the pieces are arranged such that one player's king can no longer avoid being taken. Between beginning and end, the pieces assume a sequence of configurations on the board as each player makes his move in alternation with the other. Each successive configuration of pieces constitutes what automata theory calls a state --literally, the state of play. The initial opposing rows represent the start state. When a player makes the first move, the state of the game changes, that is, the configuration of pieces on the board has altered. When the opposing player then makes his move, the state changes again. And so on. In fact, a chess game can be seen as a sequence of state changes corresponding to the changing configurations of the board as the players take their turns. In the end, a final state is reached --checkmate-- and the game stops.

Further on chess. The pieces cannot move arbitrarily around the board. Each is constrained by:

  • The pattern of movement prescribed for it by the rules of the game. For example, a pawn may only move forward, and only one square at a time; a knight may move two squares in any direction, and from there one square left or right, and so on.

  • The current state of the board. It may not, in fact, be possible for a given piece to make a theoretically legal move because another piece is in the way.

So, at any given stage of the game, the choice of possible next moves is governed by these two constraints. Or, put another way, the transition between successive states is governed by the rules relating to the movement of specific pieces together with the current state of the game. A transition can, therefore, be regarded as the specification of the conditions under which it is possible to move from one state to another. The notions of state and transition will be fundamental to what follows.

 

Finite state automata for language definition

Finite state automata (henceforth FSA) are the simplest class of machines in automata theory. Unlike the other, more complex classes (which may have an infinite number of states), FSAs can only have a finite number of states, as the name indicates; it is this finiteness restriction which limits what FSAs can do in terms of language definition. We shall be looking at why this is so later.

Intuitively, an FSA consists of (a) a set of symbols which make up the strings which it is to process, (b) a set of states, and (c) rules which, given the current input symbol and the current state, say what the next state of the machine will be. The game of chess therefore constitutes a finite state machine. The set of symbols = the chess pieces. The set of states = the various possible configurations of the pieces in the board. The current input symbol = the piece which a player moves when it is his turn. The rules = the rules of chess which, given the current configuration of the board and the piece currently being moved, say what the next configuration of the board will be. What about the input string? It is the sequence of piece-moves in the course of the game, that is, since the pieces are symbols, the piece-move sequence becomes a symbol-sequence --a string. A game of chess can therefore be thought of as a string processing device.

We now proceed to a more formal definition of FSAs, first as acceptors and then as generators.

 

The FSA as acceptor

An FSA acceptor has three components:

1. A finite set of input symbols

2. A finite set of states, one of which is designated as the initial state, and zero or more of which are designated as final states.

3. A state transition function which tells what the next state of the machine will be, given the current state and the current input symbol.

The state transition function is often specified by a state diagram. In such a diagram, each of the machine's states is represented as a circle with the name of the state written inside it. The circles are linked by arrows which represent the transitions. On each arrow is written the symbol or symbols which, when input to the machine, will cause a transition from the state where the arrow originates to the state where the arrow points. Finally, the initial state is labelled with a minus sign, and a final state with a plus sign. Note, incidentally, that a state diagram is none other than a graph.

Here is an example of a very simple finite state acceptor:

A = {x,y,z} (the input alphabet)

S = {A,B,C} (the state set)

T = the diagram below (the state transition function)

To operate this FSA as an acceptor, we present it with an input string, which its reads one symbol at a time. Each time a symbol is read there is a transition from one state to another, starting at the initial state. The state transition function in each case uses the current state of the machine together with the current input symbol to determine the next state of the machine. After the last symbol of the string has been read, processing stops. If, at this point, the machine is in one of the states designated as final, then it has accepted the string, and if not it has rejected the string. The set of all strings accepted by the FSA is the language it defines.

Let us present the above machine with a string xyz,and start it in the initial state A. The machine reads the first symbol x, which causes it to follow the arc to state B. That done, the next symbol is read. It is y, so the machine goes back to state A. Now the final symbol z is read, and the machine goes to state C. Since C is a final state, the string xyz is accepted. What if the string had been yxz? Starting again with the initial state A, there is no arc labelled y out of A, and so the machine is said to crash, and the string is rejected.

Lecture 8

Determinism

It is characteristic of the class of machines discussed so far that a transition from one state to another is completely determined by the current input symbol: for a given input symbol, there is one any only one possible route from the current state to the next state. The sequence of state changes through which such a machine passes is completely determined by the input string. This being so, it is clear that, for a given input string, there is a unique path through the machine. No matter how many times it processes a given string, it will always go through the same sequence of transitions. Such FSAs are said to be deterministic.

It is, however, possible to build nondeterministic FSAs which permit two or more different paths through a machine for the same input string. This implies that, at one or more points in the processing of a string, the transition from the current to the next state is not completely determined by the current input symbol, but rather that, given an input symbol, the machine is free to choose between or among alternative next states. On what grounds does the machine choose? No one knows, and no one cares. It is not determined.

How can one tell a deterministic from a nondeterministic FSA? In a deterministic FSA, no two arrows coming out of any state have the same label. It is this characteristic which ensures that the machine behaves deterministically --that it never has to make a choice. In a nondeterministic FSA, this constraint does not apply. Two or more arcs coming out of any state may have the same label. That is what makes the machine nondeterministic. When it is in such a state, and the current input corresponds to labels on two or more arcs, it must choose one of the paths, and that choice is free. Inspection of a state transition diagram will, therefore, always reveal whether a machine is deterministic or nondeterministic.

In the deterministic machine on the left, an a read when the machine is in the initial state A always causes a transition to B, but in the nondeterministic version on the right, a can lead to B, C, or D.

Two observations have to be made about determinism in FSAs:

i. Both sorts of FSA are equivalent in terms of their language defining capabilities

ii. A deterministic FSA is an efficient language acceptor. It either accepts or rejects a string at one pass through that string. Nondeterminstic FSAs are not as efficient. We have seen that they allow multiple paths through a machine for a given string. For every nondeterministic choice that is made, a separate and distinct state transition sequence comes into being. Now, it might happen that, for some string, one or more of these sequences leads to rejection. This introduces an element of uncertainty. Let us assume that a nondeterministic FSA rejects a string. Can we be sure that it is a 'real' rejection? Might it not be that the machine has made a 'bad' choice at some point, and that, given a 'good' choice at that point, the string would have been accepted? Yes, indeed. But, on the other hand, it might be a real rejection after all. The only solution to such uncertainties is to examine each of the state sequences which the machine is capable of generating for a given string to see if any of them leads to acceptance. In fact, when dealing with nondeterministic FSAs, it is usual to assume that the machine exhaustively explores all possible sequences, and if at least one of them leads to a final state, then the input string is accepted.

In view of all this, the characterization of a deterministic FSA as a language acceptor given earlier has to be modified for nondeterministic FSAs. We say that a string is accepted by a nondeterministic FSA if there is at least one state transition sequence for the string which leads to an accept state. If none of the possible paths through the machine leads to an accept state, then the string is rejected. For example:

Using the alphabet {a,b,c}, we present the deterministic machine with the string abbc. The first transition is from A to B, then two loops in B, and finally a c transition to the final state D; the string is accepted. Presented with the same string, the nondeterministic machine might decide on that same path, in which case it accepts the string, but it might also take the middle arc out of A, in which case it goes immediately to the final state D. But the string has not been completely read, so it crashes, and another possible path is tried. Finally, the a arc from A to C leads to accept.

 

Lecture 9

Limitations of finite state automata

The foregoing discussion has several times referred to the 'power' of language defining devices, be they grammars or automata, and the time has come to say what this means.

There are languages which no FSA can define. The standard example of such a language is anbn, that is, the language consisting of strings where some number of a's is followed by exactly the same number of b's: ab, aabb, aaabbb and so on. For an FSA to accept strings belonging to this language, it would have to remember how many a's it has read from input, and then read exactly the same number of b's; if the a and b sequences match, the string is accepted. The only way to achieve this is to have a separate state for every a and b in the sequence. A machine like this:

doesn't work because, in addition to anbn strings, it also accept strings consisting of any number of a's followed by any number of b's: abbb, aaaaab, etc. Thus, to accept the language a1b1 the machine has to look like this:

For a2b2 it looks like this:

And so on. But an FSA for anbn cannot be built because n is unbounded --that is, of arbitrary size, and not specified in advance-- and an FSA, by definition, has a bounded number of states. No matter how many a and b states are built into an FSA, one can always present a string of greater n --say, n + 1-- and thereby cause the accepter to fail.

To overcome this problem, it is necessary to move to a more powerful type of automaton: the Pushdown Automaton. That is dealt with in what follows.

 

Lecture 10

Pushdown automata

The finite state machine is the simplest kind of computational architecture. But, as we have just seen, there are languages which no FSA can define. The standard example of such a language is anbn, that is, the language consisting of strings where some number of a's is followed by exactly the same number of b's: ab, aabb, aaabbb and so on. This limitation is not merely theoretical for the computational linguist who wants to design an NLP system: natural languages include the anbn structure. An example from English is centre-embedding sentences like

the dog died NP VP
the dog the man walked died NPNP VPVP
the dog the man the girl saw walked died NPNPNP VPVPVP

where the number of noun phrases and the number of verb phrases must be the same. FSA computational architecture cannot, therefore, be used for general NLP work. A more appropriate architecture is proposed in what follows: the pushdown automaton, or PDA.

A pushdown automaton is essentially an FSA with some extra features added to make it more powerful as a language defining mechanism, and in particular to overcome thc anbn problem just mentioned. Since we have already studied FSAs, and since PDAs are just uprated FSAs, it is probably best to preface a formal definition of PDAs by an informal discussion of the extra features which these machines have.

Input tape

In the discussion of FSAs no mention was made of how an input string is actually presented to the machine. For a PDA, an input medium is specified. It is a tape on which the input string is written. The tape can be thought of as being divided into squares or cells, as below. Initially each cell is blank; the $ symbol is used to indicate this:

The string to be input is written on the tape, starting at the left and putting one symbol in each successive cell. For example, the string abcde appears like this:

The tape extends as far as one likes to the right, so that it can accommodate any finite length string. When the machine reads the tape, it does so one symbol at a time, starting at the leftmost cell, and keeps going until it reaches the first blank cell on the right (ie, the first $), which tells it that the string has been completely read. The tape may be read from left to right only, and any given cell may only be read once.

Stack

Unlike an FSA, a PDA has a memory by means of which it can remember what symbols it has read from the input tape, and what order it read them in. It is this memory which crucially distinguishes the class of PDAs from the class of FSAs. The mechanism which gives a PDA its memory is called a stack; understanding how a stack works is essential to understanding PDAs.

Assume that you are washing dishes in a retaurant. Once you have dried each one, you put it on top of the existing pile. Every now and then, a waiter needs a plate and takes one from the top; the pile shrinks and grows as you wash and the waiter serves, depending on supply and demand.

The two important things to note are:

i. At any given time, only the top plate is readily accessible. It is, of course, possible to grab one from elsewhere in the pile, but that's inconvenient and can bring the whole thing crashing down --the normal thing is to remove each plate in succession from the top.

ii. The order of stacking is preserved. The first plate washed is at the very bottom of the pile, and, assuming as above that plates are taken only from the top, it will be the last the waiter takes. In such a pile the rule is: first in, last out, or, equivalently, last in, first out.

The stack in a PDA works just like the pile of plates except that. instead of storing plates, it stores symbols. As the PDA reads symbols, it may decide that it needs to keep some record of having done so, and inserts each successive one on top of the stack: the first symbol goes into an empty stack, the second goes on top of it, the third on top of the second, and so on. Thus, if the machine were to read the string abcd and store each successive symbol on the stack, the following would happen:

If these symbols are needed again later (and they typically are), they can be removed one at a time from the top of the stack. As with the top plate in a restaurant stack, only the top symbol of the PDA stack is accessible.

Why the apparently arbitrary restriction that only the top of the stack should be accessible? To function, a PDA needs to be able to remember what symbols it has read from the input tape, and also the order in which it read them. The first half of this memory requirement could be satisfied by simply throwing the symbols already read onto a heap, and letting the machine root through the heap when it needed to know whether or not it had read some particular symbol. But if the order of reading is also crucial, then this approach is clearly inadequate: this is where the restriction that symbols can only be added to or taken from the top of the stack begins to make sense. The machine knows that the first symbol it read is at the bottom of the stack. the second above it, and so on up to the most recently read symbol at the top. The original order of reading is presented: allowing random access to any part of the stack would destroy that ordering.

Controller

The heart of a PDA is a controller which coordinates (i) reading from the tape, and (ii) putting symbols into and taking them out of the stack. This controller is an FSA whose states are restricted to the following types:

i. START: start the machine

ii. READ: read a single symbol from the input tape

iii. PUSH: insert a single symbol on top of the stack

iv. POP: remove the topmost symbol from the stack

v. ACCEPT: accept the input string and stop

vi. REJECT: reject the input string and stop

What such a controller actually looks like will emerge shortly. 

 

We are now in a position to give a formal definition of a PDA. A pushdown automaton consists of six components:

1. a finite set of input symbols

2. a finite set of stack symbols

3. an input tape

4. a stack

5. a set of control states: {START, READ, PUSH. POP, ACCEPT, REJECT}

6. a state transition function which given the current state and the current input symbol, tells what the next state of the machine will be.

Note that items (1), (5), and (6) of this definition are exactly the same as those for an FSA: this is because a PDA is, at heart, an FSA with tape and stack added. Nothing has yet been said about the stack alphabet. This is irrelevant for the moment, but will be important when the issue of parsing or syntax analysis is addressed later in this module.

 

Lecture 11

The PDA as acceptor

Let's go back to the very simple FSA we looked earlier:

This FSA accepts all strings consisting of c or any number of ab alternations (abababab. . . ) including none, and ending with a c. The equivalent PDA looks like this:

Let us follow the operation of this machine. The transition to the START state can be made any time one likes. Thereafter, follow (1)-(2) below until the machine enters an ACCEPT or a REJECT state:

1. In the READl state do the following according to the symbol which is read from the tape:

i. If a, make a transition to READ2

ii. If b, reject the string because

  • if this b is the first symbol in the string, then the string is illegal since it does not begin with a

  • if this b is not the first symbol in the string, then it must follow another b, which is illegal.

iii. If c, accept the string.

iv. If $, the end-of-string marker, reject the string because

  • If no other symbols have previously been read, then there are no symbols on the input tape, and the null string is not part of the language.

  • If other symbols have already been read, then the string does not end in c and is therefore illegal.

2. In the READ2 state, do the following according to the symbol which is read from the tape:

i. If a, reject the string because the aa combination is illegal

ii. If b, return to READl

iiii. If c, reject the string because the ac combination is illegal

iv. If $, reject the string because no string can end in a.

Note that this PDA does not use its stack --there are no PUSH or POP states in the diagram. It is no less a PDA for that. The point about PDAs is that they can use a stack if they need to, and this one didn't. A PDA without a stack is pretty pointless, though. After all, the whole point of going to PDAs was to gain more power in terms of language defining capability, and the stack is what gives it that extra power; without a stack, it's nothing more than an FSA, and nothing is gained. What recommends the above PDA is its simplicity. It is easy to understand, both intrinsically and in relation to the equivalent FSA, and as such was useful as a first encounter with PDAs. We now go on to consider a 'real' PDA --one with a stack-- and see how it deals with the problemmatical anbn structure:

To illustrate the operation of this machine. Iet us assume that the stack is initially empty, and that the input tape contains the string aaabbb. A read head (analogous to the read head on a tape deck) is also shown. so as to make clear which cell is being read at any one time. So:

The machine STARTs and goes to the first READ state. which reads the first cell of the tape and moves the read head one cell to the right. That cell contains a, so there is a transition from READ to PUSH. The PUSH inserts a into the stack; the stack and tape now look like this:

After the PUSH there is a loop back to the beginning. The machine enters the READ state again, reads another a, and PUSHes that a onto the stack:

The same thing happens a third time, and we are back to READ again:

This READ, however, gets the symbol b, which causes a transition to a POP state, which in turn takes the top symbol off the stack. Depending on whether this top symbol is S or b on the one hand, or a on the other, the machine will either REJECT or else READ again. In this case an a is POPped, so there is a transition to READ. The stack and tape now look like this:

Note that there is now one fewer a on the stack after the POP. The READ now returns a b, which leads back to POP. This POPs another a, which takes us one again to READ. The situation now is:

This READ produces a third b, which initiates the same cycle once more, and leaves the tape and stack looking like this:

The stack is now empty, and the tape read head is over a blank square (that is, the input string has been completely read). Since it is now in a READ state, the machine gets $, which causes a transition to the rightmost POP. If that POP produces $, then the machine goes to ACCEPT. Otherwise --that is, if there are any symbols left in the stack-- it goes to REJECT. In the present case it gets a $, and all is well.

The discussion of automata ends at this point. We now have enough knowledge of them to go on to one of the main applications of automata in computational linguistics: natural language processing, and syntax analysis (also known as parsing) in particular.


Lecture 12

NATURAL LANGUAGE PROCESSING

 

Introduction

Natural language processing (NLP) aims to develop computer-based systems for applications in speech and text forms of NL. Various types of application have been or are being developed:

a) Word processing

b) Text analysis: extraction of syntactic regularities from text for purposes of stylistic analysis, author attribution, and so on.

c) Data search: search of large bodies of text for particular words or combinations of words for information retrieval, as in library catalogues or Web browsers.

d) Data mining: summarization of large text corpuses, such as case law.

e) NL front ends for computational systems, such as computerized telephone answering services or database query systems

f) Artificially intelligent systems, the aim of which is to build systems that understand NL as humans do, and respond to queries and commands appropriately.

The above list is in order of difficulty: (a) – (d) are now well established, (e) is coming along, but (f) has proven much more difficult than anyone imagined, and there is currently a huge research effort devoted to it.

In what follows, we consider a central issue in a variety of NLP applications: syntax analysis or parsing. The notion that NL strings have a complex constituent structure is fundamental to linguistic theory; parsing is a function that, given an NL string as input, outputs a description of its constituent structure. A parser for any language, natural or otherwise, is a transducer. A transducer can, as a first approximation, be seen as a black box which takes strings as input, and delivers corresponding structural descriptions as output:

What follows is devoted to specifying the contents of the black box.

 

Natural language parsing

This section is in three main parts. The first looks at two of the main problems that have to be faced in NL parsing, the second looks at what the output of the parser is like, and the third develops the actual transducer.

a) Central problems in natural language parsing: determinism, ambiguity

The languages we have dealt with so far have been formal languages. The key characteristic of formal languages is that humans invent them for some purpose: as computer programming languages, to send messages over wires, or simply to get elements of formal language theory across to students. We can endow them with whatever characteristics we like by writing an appropriate grammar and then using it to generate the strings. Natural languages, however, simply exist. We have no control over their characteristics, and, in processing them, we have to deal with them as they are, not as we would like them to be. For computational linguistics, they have certain features that cause problems. This section deals with two of the main ones.

Before going on to look at these features there is a general observation to be made. This discussion assumes that natural languages as a class are context free, that is, capable of being generated by context free grammars. This is not strictly true --there are a very few isolated instances of NL constructions which cannot be so generated-- but it is certainly true of the proverbial 99.999...% of NL syntactic structures, and that's good enough for us. What this means for practical purposes is that any grammars we work with will be context free grammars (CFG).

Any adequate NLP parser must be able to deal with the following two features of natural languages. One is easy to deal with, and the other is the most difficult problem of them all. At this stage I will simply present the features. Precisely why they are problematical and possible solutions will be offered in the section dealing with parser design. The English language is used throughout as the basis for discussion.

i. Grammatical agreement

This is the easy problem. Certain words categories in English have morphological variants: Pronouns (I, you, he, she, it...), Nouns (man, men; cat, cats), Verbs (play, plays, played; give, gives, gave, given) and so on. These variants cannot simply be used randomly, since they have particular functions in a sentence: to mark singular/plural, past/present and so on. There are, moreover, rules for determining how variants may be combined in a sentence, or, put another way, they must 'agree'. An important form of 'agreement' in English, and the one with which we shall be concerned, is subject-verb number agreement: a singular subject requires a singular verb, and a plural subject a plural one. Any NLP device will have to ensure that it accepts as grammatical only agreement-legal strings, and rejects those in which agreement fails (ie, 'The men sees').

ii. Ambiguity

This is the hard one. It's well known that some English sentences are ambiguous. Intuitively, this means that that these sentences can be taken in more than one way. A famous example is: 'I saw the man with a telescope', which can mean either that I was looking through a telescope and saw the man, or that that I was looking at the man, and he was carrying a telescope. Slightly more formally, as we know from the foregoing discussion, a sentence is ambiguous if it has more than one leftmost derivation, that is, if it has more than one structure associated with it. Thus, in the case of the famous 'They are flying planes', there are two possible structures:

A parser will have to be able to distinguish between alternative structures.

 

b) Structural descriptions

The output of our parser will be some representation of the structure of an input string. The usual representation is the familiar phrase marker or tree, as above. A computer program can output such trees, but this requires a considerable competence in graphics programming which relative beginners cannot be expected to have. The proposal, therefore, is to use an alternative structure representation: bracketed strings.

Consider the following tree:

Another way to represent the structure which this tree represents is:

B (C,D)

For this tree

the alternative representation is:

B ( C ( D (E,F)))

The leftmost tree for the string 'They are flying planes' in the previous section using this method would look like this:

S ( NP ( PRO (they)) VP (V (are) NP (ADJ (flying) N (planes))))

The advantage to this representation is that it is simply a character string, and character string output is easy to program. The result is not as easy to read for a human as a tree is, but that is irrelevant for computational linguistics purposes. The important thing is that the two are equivalent.

 

Lecture 13

c) The transducer

The mechanism inside the transducer has to take the strings of a language as input and deliver the corresponding structural descriptions as output: it implements a parsing algorithm. This section develops such an algorithm. It is in two parts. The first introduces some general issues in the construction of parsing algorithms, and the second then develops one particular algorithm in detail.

 

i. General approaches to parsing

Given a string generated by a grammar, the task of finding its derivation according to that grammar is called parsing, as we have seen. Note the implication: to build a parser for a language, the grammar which generates the language has to be known. There are two general strategies for parsing:

  • Top down: Given a grammar and a string generated by that grammar, we start with the sentence symbol S and try to find some sequence of productions whose application in that sequence generates the string.

 

  • Bottom up: Given a grammar and a string generated by that grammar, we start with the string itself, trying to infer the sequence which generated it by working backwards toward the sentence symbol S.

We will not be going into bottom-up parsing; the rest of this section explains what the above rather general account of top-down parsing means.

 

ii. Top-down parsing

Earlier, we looked at derivation of a string using a grammar. It was noted that the process of generating a string using a phrase structure grammar is the successive rewriting of sentential forms using the productions of the grammar, starting with the sentential form S. The sequence of sentential forms required to generate a string constitutes the derivation of the string according to the grammar. In other words, a derivation is a finite sequence of sentential forms of which the first is the string symbol S, and in which each subsequent sentential form is obtained by copying the current sentential form and rewriting the leftmost nonterminal symbol in accordance with one of the productions of the grammar. The rewriting process terminates when the sentential form contains no more nonterminal symbols, but consists entirely of terminals (that is, symbols of the alphabet of the language).

Now, it should be clear that the choice of which production to apply at any given step of the derivation, and the order of such choices, is what gives the string being generated its structure. This structure is usually represented as a derivation tree, otherwise known as a phrase marker. This tree is built as the derivation proceeds. We start with a tree containing only one node, the root node S. For each step in the derivation, the tree is correspondingly extended. That is, every time a production is used to replace a nonterminal in the current sentential form by the symbol(s) on the right-hand side of the production, lines are drawn from the corresponding nonterminal in the existing tree to each symbol in the replacement string. Take as an example a fragment of a CFG:

 

S --> NP VP

NP --> DET N

VP --> V NP

...........and so on.

A derivation starts with S, so the initial tree is:

S

At the next step S has to be rewritten. The grammar offers only one possibility: NP VP. The tree now looks like this:

At the next step NP has to be rewritten: the sentential form now is DET N VP. After several more steps expanding the NP, the tree looks like this:

The derivation, and the simultaneous construction of the tree, proceeds in this way until a string is generated, at which point the structure tree is complete as well.

What I'm leading up to is the basic insight that every string generated by a grammar has a structure associated with it. Parsing tries to rediscover that structure. The question is: how does one go about it?

We begin with a very general, but also very inefficient, method. It is presented here because it is a good introduction to parsing algorithms. This very general method simply tries out all the possibilities. Let's say we are given a CFG and a target string, that is, a string to be parsed. The object is to discover the derivation of the string --the sequence of production applications which started at S and ended up with the string. In the light of the foregoing discussion, this amounts to finding the derivation tree for the string. That being so, one builds a tree of all possible derivations, and picks out the one that leads to the target string. The algorithm is:

1. Start at S

2. At each step:

  • extend the tree with all possible production choices

  • eliminate those possibilities which cannot lead to the target string

3. Repeat 2 until the target string has been generated, at which point its structure will have been discovered.

 

An example should clarify this. Consider the following CFG for a small fragment of English:

S --> NP VP

NP --> NOM

NP --> DETNOM

DETNOM --> DET NOM

NOM --> N

NOM --> NPP

NPP --> N PP

PP --> PREP NP

VP --> V

DET --> the

N --> man, woman, dog, cat, house

PREP --> with, under, in

V --> ran, cried, ate

The target string is: the man in the house cried. We want to discover the derivation of this target string from the CFG.

Begin construction of the parse tree:

S

There is only one possible derivation from S. Insert the subtrees concerned into the parse tree:

Examine the possibilities NP and VP  to see if either of them can be eliminated. Neither can. There are two possible derivations from NP. Insert them into the tree

Examine the possibilities to see if any of them can be eliminated. The leftmost NOM can because, according to the grammar, NOM can ultimately generate a noun or a noun followed by a prepositional phrase, which means that the string would have to begin with a noun. But it doesn't. It begins with a determiner. Therefore discontinue the leftmost NOM, that is, don't develop its possibilities any further.

There is only one possible derivation from DETNOM Insert it into the tree:

There is only one possible derivation from DET. Insert it into the tree:

There are two possible derivations from NOM. Insert them:

Examine the possibilities to see if any can be eliminated. The leftmost N can: it generates a noun, and the next word in the target string is a noun, but it cannot generate a following prepositional phrase, which the target string has. Do not develop the leftmost N any further.

And so on; the algorithm will eventually generate the target string, and will thereby have discovered the target string's structure.

This very general algorithm is reliable, in the sense that it will either eventually discover the correct structure, or fail to and thus pronounce the string illegal relative the the grammar, that is, exclude the string from the language which the grammar generates. It is, however, also inefficient. The above grammar is very small. Any useful grammar for a natural language will be very much larger. For such a large grammar, exhaustive checking of possibilities becomes a major undertaking. This is because the number of possibilities to be checked increases very rapidly relative to increase in grammar size. As a result, this algorithm is useless for real-time natural language processing systems on account of the demands which they make on computational resources, and their consequent slowness. Something more efficient is needed.

That 'something more efficient' is the deterministic top-down NL parsing algorithm. In the general algorithm described above, the idea was to work down the proposed parse tree until it emerged either that a branch led to part of the string being parsed, or that that branch could never produce the relevant part of the parse string. In the latter case, development of that branch was abandoned. Now, ask yourself how we knew when to abandon a branch. With the first NOM we just looked to see whether the first word of the target string could be generated by the path being followed. That word was the, and reference to the grammar showed that NOM by itself could never generate a string beginning with the, so we abandoned that branch. But now consider the second NOM. We abandoned it because we knew that it couldn't generate the string we wanted, but how did we know? The answer is that we looked ahead in the string to see what was coming up, and decided on that basis. That is, we knew that the noun had to be followed by a prepositional phrase, and thus that the NOM --> N production could never generate the required PP, so we only kept the NOM --> N PP branch. Why not use this idea of looking ahead in the string to make the right choice in the first place? That is, instead of running down blind alleys, keep looking ahead in the string and, armed with knowledge of what's coming up, choose the correct branch at each step in the parse. This is the key to design of deterministic parsers; the next section designs such a parser, and thereby exemplifies the idea of looking ahead as a way of achieving determinism.

 

Lecture 14

iii. A top-down deterministic PDA parser

The PDA as parser

A PDA parser for the strings generated by the grammar G looks like this:

This machine differs from the PDAs seen so far in that it has output, since it implements a transducer, and therefore uses WRITE states in addition to the READ states already introduced. By means of this WRITE state, the PDA writes a bracketed structural description of the string in the input tape to an output tape.

Let us follow the operation of this parser on the only string the grammar generates: xyz. Initially, the input tape, stack, and output tape look like this look like this:

The tape has the input string written on it, one symbol per square, and the read head is positioned over the first symbol. The stack is has only one symbol in it: A. In both cases the symbol $ is not part of the language, but stands for 'empty'. Since the machine has not yet begun operation, the output tape is blank. Now begin the parser. Pop A, and follow the arc labelled A. This contains a sequence of states which, when traversed, yield the following machine configuration:

Go back to POP, and pop the opening bracket (. Follow the relevant arc, write the bracket, and go back to POP. The machine configuration now looks like this:

Now pop B, and follow the B-arc, returning to POP:

POP the opening bracket, write it, and return to POP:

Pop D, and follow the D-arc. Write D, and for the first time read the tape. If what is read corresponds to what is expected, write what is read and return to POP, otherwise reject the string.

Processing continues in this way until the string is either rejected at some stage, or the end of the tape has been reached and the stack is empty, at which point the string is accepted and a full structural description is written on the output tape.

 

Lectures 15,16

We are now in a position to design a deterministic PDA natural language parser. Consider these two context free grammars:

Gl   G2
S--> NP VP S--> NP VP
NP --> DET N  NP--> DET N 
  NP --> DET NOM
  NOM --> ADJ N
VP --> V  VP --> V
DET --> the   DET --> the   
N --> man  N --> man
  ADJ --> tall
V--> runs  V --> runs

Gl generates only one string (the man runs), and G2 generates two (the man runs, the tall man runs). Gl can be parsed with the PDAs developed so far, whereas G2 cannot. The problem with G2 has to do with the extra rules it contains, and in particular with the fact that NP can be rewritten in two ways. Why is this a problem? Let us attempt to use a PDA of the sort developed so far to parse the tall man runs:

 

The initial state is:

Pop S and follow the S-arc. The state now is

Pop the opening bracket. The state now is

Pop NP, and follow the arc. But which arc? There are two for NP. This is the crux of the problem. If the PDA follows the rightmost NP arc, the parse will proceed successfully, but if it follows the leftmost one the parse will fail. How can it choose the correct one? We have no answer yet. The rest of this section develops one.

Let's go back to the parse sequence we began immediately above. The initial state was:

Pop S and follow the arc. The state now is

Pop ( and follow the arc

Pop NP and follow the arc. This is where our problems arose. Let's say the machine (metaphorically) flips a coin and follows the left branch

Pop (

Pop DET

Pop N

Note now the relevant state sequence: Write N (which we have done), then Read, and if the Read was man keep going, otherwise Reject. But the read head is over tall: what would we have had to know to choose the correct path in the first place? Answer: that tall was coming up on the input string, not man. This is the basis for the solution to our immediate problem, and to the general problem of designing deterministic parsers: lookahead. At the moment, all the PDA can see is what's directly under the read head. but it is easy to add a lookahead facility. This we will now do.

The only change is addition of a type of state not seen before, SCAN, which is added to the until-now standard PDA set {START, ACCEPT, REJECT, PUSH, POP, READ, WRITE}. SCAN reads the symbol immediately next to the one currently under the read head. Let's see how this solves our problem. Consider again the above parse sequence, but this time with a SCAN facility.

The initial state was:

Pop S and follow the arc. The state now is

Pop ( and follow the arc

Pop NP, and follow the arc. First write NP to the output tape, and then Scan. The Scan reveals tall, so follow the rightmost arc:

The parse should now proceed successfully. Pop (

Pop DET

Pop NOM

Pop (

Pop ADJ

This time, following the Adj arc, the PDA reads tall and proceeds successfully to the rest of the parse.

Lookahead is the key to deterministic parsing. Is lookahead-l enough? No. there are NL constructions which require lookahead-2 or even 3, but the principle is always the same. We are concerned only with constructions that require lookahead-1.