|G61.1830-001 Introduction to Programming for Linguists
Prof. Ray C. Dougherty
All examples and discussion in this document relate to English time expressions. A student may elect to write a grammar of time expressions in any language they choose, as long as they maintain the same notational schema described here for the phonetic form and logical (numeric) form. If we have projects in different languages, it should be a simple task to develop a program that can translate time expressions from one language to another.
Lectures, readings and computer exercises focus on this question: How can we program a generative grammar as a Prolog parser to analyze time-date expressions like those listed below and assign them an interpretation in terms of a 24 hour clock and a 12 month solar calendar?
at six forty five on August seventh nineteen seventeen at quarter of seven on August the seventh nineteen hundred and seventeen on the seventeenth of August at fifteen till seven at six thirty in the evening on July the fourth at half past six PM on Independence Day on a Sunday afternoon in nineteen fourteen, June tenth at two PM, to be exactAt the Technical Level in this course you will learn to use the basic computational linguistic tools required to implement generative grammars in Prolog. When you complete this course you will be able to:
At a Basic Conceptual Level, you will gain insight into the framework of assumptions and definitions underlying the minimalist program. In particular, we will present a grammar based on the assumptions of Chomsky's minimalist program. The grammar (of time expressions) will be encoded into Prolog as a function timedate(PF,NF), which defines sound-meaning pairs. One element of the pair, an orthographic string, or phonetic form (PF), is a string of elements (words) relevant for time expressions. The other element of the pair, the numeric form (NF), is a string of numbers in the order [Hour,Minute,Month,Date,Year]. The following represent valid pairings: (a) is in normal orthography with spaces between the words and with capitals, (b) is a typical Prolog statement, with commas between the words and no capital letters. The (b) expressions are all Prolog statements that the computer will judge as true. On the 24 hour clock, AM goes from midnight to noon 1-12, PM goes from noon to midnight, 13-24/0.
(1) (a) on April fifteenth nineteen ninety at six fifteen in the morning [6,15,4,15,1996] (b) timedate([on,april,fifteenth,ninteen,ninety,at,six,fifteen,in,the,morning],[6,15,4,15,1996]). (2) (a) at ten to seven in the evening on July the forth in nineteen forty [18,50,7,4,1940] (b) timedate([at,ten,to,seven,in,the,evening,of,wazzu july,the,forth,in,ninteen,forty],[18,50,7,4,1940]) (3) (a) in ninteen hundred and ten at a quarter of four in the afternoon on Independence day [15,45,7,4,1910] (b)timedate([in,ninteen,hundred,and,ten,at,a,quarter,of,four,in,the,afternoon,on,independence,day],[15,45,7,4,1910]).
The Prolog function, timedate(PF,NF) will be bidirectional. It will be possible to enter a fully specified PF and obtain the NF as in these examples. Here timedate(PF,LF) is used as a recognition model.
(4) (a) PF = on October the twelfth at noon in ninteen forty three NF = [?,?,?,?,?,?] (b) timedate([on,october,the,twelfth,at,noon,in,ninteen,forty,three],X). (c ) X = [12,0,10,10,1943] (5) (a) PF = at ten minutes after six P.M. on the evening of the third of August in the Fall of ninteen hundred andsixty six NF = [?,?,?,?,?] (b)timedate([at,ten,minutes,after,six,pm,on,the,evening,of,the,third,of,august,in,the,fall,of,ninteen,hundred,and,sixty,six],X). (c ) X = [18,10,10,3,1966]It will be possible to enter the Numeric Form and the function timedate(PF,NF) will generate a correct string of words as in these examples. Here timedate(PF,LF) is used to generate English strings, i.e., as a production model.
(6) (a) PF = ??? NF = [13,13,9,2,1976] (b) timedate(X,[13,13,9,2,1976]). (c ) X = [at,one,thirteen,in,the,afternoon,of,september,second,ninteen,hundred,and,seventy,six] (d) at one thirteen in the afternoon of September second ninteen hundred and seventy six (7) (a) PF = ??? NF = [14,30,11,17,1989] (b) timedate(X,[14,30,11,17,1989]). (c ) X = [in,ninteen,eighty,nine,at,half,past,two,in,the,afternoon,of,November,seventeenth] (d) in ninteen eighty nine at half past two in the afternoon of November seventeenthNotice three facts about this presentation and our notations:
(8) at six ten AM August thirteenth nineteen sixty six (9) at ten minutes after six on the morning of the thirteenth of August in the year nineteen hundred and sixty six
One might try to relate the PF (8) directly to the NF, or the PF directly to the NF, since the order of elements in the PF corresponds almost directly to the order of elements in the NF. As discussed below, there is no particular order to the elements in the NF. We have arbitrarily assumed that the order in the NF corresponds to the order of elements in the PF of the expression,' i.e., the time expression that contains the minimum of grammatical formatives such as prepositions, conjunctions, determiners, and so on. It would be interesting to see what is the order of elements in the zero expressions of German, French, and other languages using a solar calendar and a twenty four hour clock. It would also be interesting to see the order of elements in the NF of languages using solar/lunar calendars, such as Hebrew, and languages using lunar calendars, such as some American Indian and oriental languages.
It would be difficult, however, to map the information in the PF (9) directly to the NF. And it would appear almost impossible to directly generate the PF given the NF since the PF contains words (after, on, the, of, in, and and) that reflect English syntactic structures more than information about times and dates. When these time expressions appear in sentences, they can be spread out throughout the sentence or placed in one position.
(10) In nineteen ninety nine Sean will on July fourth wake up six thirty with an empty stomach. Some of the words seem to be in groups (constituents) and tend to appear together: (11) At ten after six in the morning of July second he woke up. (12) On the morning of July second at ten after six he woke up. (13) In the morning, at ten after six, on July second, he woke up.
It is possible that in order to pair the PF and the NF, the grammar might offer some structural configuration (asyntactic structure) to hold the words in a tree in order to calculate upon them. That is, it might be the case that there is not enough structure in a string or a list to enable a computational device to link a sound and a meaning. Perhaps there must be another data structure, a phrase marker (or a list of lists), that is calculated 'zero in order to link the PF and LF. Notice, however, that the assignment of any structure beyond that of a simple ordered list to the PF and LF can only be justified by an argument that shows that there is some interesting phenomena in the operation that defines the pairs (PF,LF), and that the structure plays some crucial role in that operation. All structure (syntactic structure) injected by the calculation operations that define the pairing is virtual structure in that it 'disappears' at the two observable levels.
We will discuss two alternative syntactic approaches to dealing with time expressions, of which (8) and (9) represent opposite extremes.
The COoccurrence and PARaphrase (COPAR) hypothesis considers any time expression to be a member of a group of related time expressions. COPAR recognizes that among time expressions, some contain almost no grammatical formatives, where a grammatical formative is a preposition, conjunction, article, and so on, such as: in, of, on, at, the, a, etc. The time expression with the least grammatical formatives will be called the zero form, somewhat misleading since it may contain more than zero grammatical formatives. Following ideas of Zellig Harris, COPAR would develop a Prolog program that can link a sound and a meaning for a time expression like (8), but not for (9). Another Prolog program would link expressions like (8) and (9). If there is an expression like (8), then there is an expression like (9), and vice versa. In this way, if we want to assign (9) a Numerical Form, we would find that (8) and (9) contained the same elements give or take one or more grammatical formatives (cooccurrence) and they were pragmatically synonymous (paraphrase). Hence, the Numerical Form would only have to be paired with the zero form. All other forms could be linked by COPAR with the zero form, and be assigned the same Numeric Form as the zero form. For the zero time form, the structure (syntactic structure) assigned must (a) indicate the order of elements in the PF, (b) provide sufficient structure to relate it to the NF, and (c ) provide sufficient structure to relate the form to other relevant forms, where relevant form is defined in terms of cooccurrence and paraphrase. For any non-zero form, the structure assigned (syntactic structure) must (a) indicate the order of elements in the PF and (b) provide sufficient information to relate the form to some zero form. These considerations could be formulated as derivational constraints or as cross-derivational constraints.
The COMPositional SEMANTICS (COMPSEM) hypothesis considers each time expression on its own,independent of any other time expression in the language. COMPSEM would assign a time expression, like (8) or (9), astructure of some sort (phrase marker, relational diagram, feature structure, and so on) and then attempt to read off of thestructure, or to calculate from the structure, the relevant Numeric Form. The structure (syntactic structure) assigned to anytime expression must (a) indicate the order of elements in the PF and (b) provide sufficient structure to calculate the NF. It may or may not show the relation of the time expression to any other time expression(s) in the language. COMPSEM would make extensive use of feature percolation mechanisms, and in general, would prefer a feature structure analysis.
The numeric form is an unordered list of numbers. We arbitrarily assign the numbers an order [Hour,Minute,Month,Date,Year], but we could just as easily have selected[Year,Minute,Month,Hour, Date] or some other order.
The orthographic/phonetic form is an ordered list of words. Some linear orders of elements are acceptable and others are not: at six ten PM, *six ten at PM, *at six PM ten. We will discuss several properties of the orthographic string, among them: distribution (the fact that elements like on, in, at, and so on do not occur freely but have constraints on their appearance, as in 14-17), intonation (the fact that some orders of elements require an intonation break, pause, or appositive stress, as in 18-19), and idiomatic usage (for instance, half only occurs as half past, not *half to, as in 20-22).
Distribution: (14) on/?in the afternoon of July thirteenth (15) on July thirteenth in/*on the afternoon (16) on July the thirteenth, on July thirteenth (17) on the thirteenth of July, *on thirteenth of July (the is required) Intonation: (18) on July the thirteenth, a Thursday, on July thirteenth, Thursday, (19) on a Thursday, July thirteenth, on Thursday, July thirteenth, Idioms: (20) at quarter past six, at half past six (21) at quarter to six, *at half to six (22) *at three quarters to six, *at three quarters past six
The numeric form is the linguistic level that interfaces with the semantics of clocks (24 hour, sundials, windup, electric, and so on) and the semantics of calendars (solar, lunar, Gregorian, Jewish, Chinese, and so on). Given a clock and a numeric form, if the direction of information is from the semantic system (clock) to the numeric form, we call this reading the clock. If the direction of information is from the NF to the clock, we call this setting the clock.
Our grammar will impose basic constraints on the numeric form: Month ranges from 1 to 12, Date ranges from 1 to 31, Hour ranges from 0 to 23, Minute ranges from 0 to 59, and Year is any integer. Our grammar might mark (23) as deviant since there is no Minute that is eighty eight and no Date that is forty third. On the other hand, the sentences are not grammatically ill-formed, they are simply false because they are logically impossible. There would be no trouble translating (23) into French or German, where it preserves the same type of deviance.
(23) *at six eighty eight on December forty third
Our grammar will not mark (24-31) as deviant. We assume that the NF could be incorporated into a semantic system that contained a fully specified calendar that would mark special days. The lexicon of the semantic system could include the information that Christmas is always December 25, but other holidays can vary the date within a specified range. Easter is always a Sunday, but the date can vary. There is a simple algorithm (see the entry for Calendar in the Encyclopedia Brittanica) to calculate the day of the week (Monday, and so on) for any given date in the Julian or Gregorian calendar. Questions about the correlation of the movements of Mars, Venus, and the sun with the calendar could be incorporated into a semantic database of celestial information.
(24) *on Christmas Day July fourteenth nineteen seventy four (25) *on Easter Day, Thursday, (26) *on September second during the winter solstice (27) *on Wednesday July fourth nineteen ninety six (28) *during the leap year nineteen ninety four (29) *on February twenty ninth of nineteen ninety four (30) on October twelfth nineteen forty three during the full eclipse of the sun (31) On November seventeenth of nineteen eighty nine at the conjunction of Mars and Venus
Following Chomsky, we will assume that the items at a level are defined by a lexicon and by principles that define possible combinations of lexical items (nouns, adjectives, and so on) and grammatical formatives (prepositions, conjunctions, and so on). One might formulate a first attempt at the grammar of time expressions as a set of principles that define appositive (PP (PP...)(PP...)) and complement (PP (P..)(NP (DET..)(N..)(PP..))) constructions. The expression at six PM on October twelfth could be either (32b) or (32c). (32) (a) at six PM on October twelfth (b) (PP (PP (P at)(NP six PM)) (PP (P on)(NP October twelfth))) (c ) (PP (P at) (NP (N six PM) (PP (P on)(NP October twelfth)))) We will propose grammatical principles of combination and a lexicon, encoded into Prolog, that will assign a labeled bracketing to the orthographic strings. The two textbooks contain many illustrations of the type of programs required.
Most of the time expressions encountered in actual data, for instance, the New York Times, The Wall Street Journal, court transcripts, travel agent transcripts, and so on, do not have the time expression as a single string of words in a sentence. That is, in a corpus, one is more likely to encounter the (a) expression than the (b) expression.
(33) (a) In nineteen forty there was at six fifteen AM an earthquake on July fourteenth. (b) In nineteen forty at six fifteen AM on July fourteenth there was an earthquake. (34) (a) On October fifth, when he came to work, he did not expect a fire at ten fifteen AM. (b) At ten fifteen AM on October fifth, he did not expect a fire when he came to work.
We will see how one can use some of the existing computational linguistic tools at NYU to examine on-line corpora of millions of sentences. We will see how one might take a transcript of a court case, like the OJ Simpson Trial, and first filter the transcript for time expressions, and next order the sentences of the transcript according to the time expressions they contain. We will see how a conversation with a travel agent could be filtered for time expressions and then arranged temporally to give a travel plan. It could be possible to convert a PERT or GANT chart into English sentences, or to convert from English sentences to a PERT or GANT chart.
A statement like on Thursday, June eleventh nineteen sixteen is over determined in that it is a logical question to decide if it is true or false whether (_,_,6,11,1916) is actually a Thursday or not. If a student is a good programmer, it would be possible to program the algorithm for correlating the day of the week with the date. This would not be an acceptable project in this graduate class, since the focus is on encoding the orthographic/phonetic level into Prolog. It would be acceptable as a project for a directed reading class.
A Prolog program that would translate between English and German or French time expressions would be simple in principle to execute. One would simply have an English grammar that linked an English PF to a numeric form [Hour,Minute,Month,Date,Year], and a French or German grammar that linked a French/German PF to the numeric form. At the level of NF, English and French would be identical. They differ mainly at the lexical level, and would have entries like: month_english(1,january), month_french(1,janvier), day_english(1,monday), day_french(1,lundi).
A translation program that linked an English (solar calendar) date with the Hebrew (luni-solar calendar) would require some calculations at the level of Numeric Form. Interested students may work on a Prolog program that will translate English time expressions using the Julian/Gregorian calendar into Hebrew time expressions using the Jewish calendar. The following passage from the Encyclopedia Britannica indicates some of the problems involved in this project.
The Jewish Calendar Lunisolar structure. The Jewish calendar is lunisolar, i.e., regulated by the positions of both the moon and the sun. It consists usually of 12 alternating lunar months of 29 and 30 days each (except for Heshvan and Kislev, which sometimes have either 29 or 30 days), and totals 353, 354, or 355 days per year. The average lunar year (354 days) is adjusted to the solar year (365 1/4) days) by the periodic introduction of leap years in order to assure that the major festivals fall in their appropriate season. The leap year consists of an additional 30-day month called First Adar, which always precedes the month of (Second) Adar. A leap year consists of either 383, 384, or 385 days and occurs seven times during every 19-year period (the so-called Metonic cycle). Among the consequences of the lunisolar structure are these: the number of days in a year may vary considerably, from 353 to 385 days, (2) the first day of a month can fall on any day of the week, that day varying from year to year. Consequently, the days of the week upon which an annual Jewish festival falls vary from year to year despite the festival's fixed position in the Jewish month. Months and notable days. The months of the Jewish religious year, their approximate equivalent in the Western Gregorian calendar, and their notable days, are as follows: Tishri (September-October) Heshvan, or Marhesvan (October-November) Kislev (November-December) Tevet (December-January) Shevat (January-February) Adar (February-March) Nisan (March-April) Iyyar (April-May) Sivan (May-June) Tammuz (June-July) Av (July-August) Elul (August-September) During leap years, the Adar holidays are postponed to Second Adar.
As you work on the project it may be useful to try to classify the types of information that will (1) be considered by the generative grammar (PF and NF) and (2) be considered as semantics, or real world knowledge. The possible deviance or non-deviance of (35) is, in our formulation, a decision made by the Prolog program that represents our grammar. This is not the type of mistake made by any adult speaker, or by any adult learning English. (36) is considered grammatical in English, and could be paired with a numeric form, but it is a logical question as to whether or not the sentence is true or false. (37) may be true or false, but it is not a logical question. This sentence can only be judged true or false by making some inquiries about empirical observations. It would require our grammar to be linked to a database of celestial data.
(35) on the fifty second of August at thirty five o'clock (36) on Thursday, August fifteenth in eighteen hundred and seven (37) on September ninth in nineteen seventeen at the conjunction of Mars and Venus