A Simple Prolog Tree Parser
Chapter 2, Section 1


From: Beginners Workbook in Computational Linguistics

Prof. Ray C. Dougherty
New York University | Linguistics Department


If you have any comments about these workbook pages, please let us know. Should we place more of these pages on line? Do you benefit from these pages? If you are a student at NYU, you may help in developing these pages, see the HTML Geselschaft. Your ideas and comments will lead to modifications and improvements.

1. How to use Prolog to parse a simple sentence.

One of the Prolog programs in the program file work001.txt (work001.zip) is a parser, label_bracket(X), that will convert the following (b) strings to the (c) labeled bracketings. The (a) string is the sentence in English orthography. The (b) string is a Prolog data structure. The (c) data structure is a labeled bracketing indicating the organization of the words into phrases, clauses, and sentences. In (d) is the Prolog statement that will yield (c) as output. Another program in the file work001 is tree(X), which produces a phrase marker for the input X.

A labeled bracketing of constituents can be difficult to read. A common notation in linguistics is a phrase marker that spreads constituent structure information into two dimensions. The Prolog relation tree(X) contains a pretty printer that can produce a phrase marker of the above sentences.

The following diagram shows the screen contents (scriptfile, or screen capture) of what happens when you are using tree(X) in SWI-PROLOG on an IBM PC under Windows to parse a simple string (the woman sees in the house) to obtain a labeled bracketing. The program also runs in Quintus Prolog.

2. Computer requirements to run this parser.


The program tree has a tiny lexicon and a simple set of combinatory principles. It is intended as a pedigogical tool for introductory classes intending to show how to encode Chomsky's ideas about generative grammar (particularly the merge operation) into Prolog.

3. Graphic: Phrase marker of "The woman sees in the house".

The two dimensional printout in the following is equivalent to this phrase marker.

4. How to Increase and Modify the Lexicon

The sample grammar encoded into work001.pl is about the simplest grammar imaginable. It has no selection restrictions, information about complement structure, and so on. It contains enough structure however to illustrate the basic ideas involved in runnning, editing, and modifying grammars encoded into Prolog. You should become familiar with the edit-Prolog-edit loop.

The existing lexicon contains these entries:

  /*   Program Name:  lexicon   This is part of work001.pl */
det([the]).    noun([house]).  prep([on]).    verb([looks]).
det([a]).      noun([table]).  prep([in]).    verb([sits]).
det([this]).   noun([woman]).  prep([at]).    verb([sees]).
det([that]).   noun([man]).    prep([under]). verb([persuades]).
                               prep([to]).    verb([thinks]).

4.1. To cure undergeneration add lexical items.

The grammar, work001.pl, will parse these sentences:

But the grammar undergenerates since it will not assign a phrase marker to these sentences:

Suppose we use a text editor to add these entries to the lexicon of work001.pl to get work0011.pl. These files are compress in work001.zip.

/*     File Name:  work0011.pl	*/
 /*supplementary words added to work001.pl */
/* work0011.pl  is work001.pl with these words added.*/

det([some]).  noun([birds]).  prep([before]).  verb([arise]).
det([any]).   noun([worm]).                    verb([dined]).

We can now load work0011.pl as in this figure:

4.2. To cure overgeneration delete lexical items.

The grammar overgenerates since it assigns a phrase marker to these ill-formed strings:

Some problems of overgeneration, like these, can be cured by eliminating items from the lexicon. Open the file work001.pl or work0011.pl in a text editor and remove the lexical entries for 'put' and 'persuade'. Delete these two lines:

Now the grammar in work001.pl will not assign a string to the above deviant strings.

4.3. Document your work with comment lines.

The Prolog interpreter ignores any material between /* and */ on a line. So you can use a text editor to modify files and to add documentation. Instead of deleting lexical items, you could edit the work001.pl to surround the lexical item by /* */ to yield these lines:

/*   modified on 11-17-98  */
/*   by Tracy              */
/*    verb([put]).         */
/*    verb([persuades]).   */

The Prolog interpreter will ignore these lines when loading the program, so they are effectively deleted from the file.

5. Assignment

5.1. Add items to the lexicon of work001.pl

Make a file, work0012.pl, containing the necessary lexical entries to parse these sentences:

To parse these last sentences you may assume that the determiner can be any string of letters or the empty set, that is []. Remember that all words in Prolog must be in lowercase: susan not Susan.

5.2. Remove items from the lexicon of work001.pl

Open work001.pl in a text editor and eliminate the entries for 'put' and 'persuade'. Remember that the file must be saved as an ascii file and not in some word processor format.

5.3. Add comment lines to work001.pl

Add comment lines to work001.pl to indicate the modifications you made, your name, and the date.

5.4. Turn in a printout of your modified program.