Large coverage dictionaries and grammars for text processing:
the INTEX system
|
|
1. Session Summary
INTEX is a corpus processing system, used also as a linguistic development environment in over 60 research centers in Europe (Silberztein and Courtois 1990; Silberztein 1993; Gross 1993 and 1997; Fairon 1999). It has been developed at the LADL (University of Paris 7). Over 11 languages are currently being described with INTEX by the members of the RELEX network of laboratories (French, English, Spanish, Portuguese, Italian and Bulgarian are the most advanced resources). IBM has recently decided to use INTEX as their main platform for linguistic resource management and Information Retrieval systems; Educational Testing Service (ETS, Princeton) is currently using INTEX to build an item generation system (for producing "reasoning items" on-the fly).
This session will present a text analysis methodology based on the use of lexical resources and grammars:
- The first speaker will present advanced methods for information extraction based on the use of large-coverage dictionaries and finite state grammars. He will present the general methodology and describe various uses of finite state graphs for text parsing and the use of transducers for text encoding;
- The second speaker will present a linguistic study carried out with INTEX. Using concordances (based on dictionary lookup), the linguist constructs local grammars that are used to disambiguate verbal structures;
- The third speaker will present GlossaNet, an application based on INTEX that performs some linguistic parsing over Web sites. He will focus on a special version of the software that can be used on-line on the Internet. This version allows users to search for linguistic patterns in daily updated corpora.
Adopting an interdisciplinary point of view, the session will allow us to show how the INTEX system is used by researchers in linguistics, literature, information retrieval, text encoding and for teaching.
2. INTEX distribution policy
INTEX is distributed on by the Association pour le traitement informatique des langues (ASSTRIL, Paris). There is a special rate for university research & teaching.
INTEX works on Windows 95, 98, NT and 2000 (Win NT or 2000 is recommended). By default, the INTEX distribution (on cd-rom) includes linguistic resources for two languages (other languages can be ordered separately: French, English, Italian, Spanish, Portuguese). For further information, see the INTEX web page or contact:
3. The authors:
Max Silberztein is the author of INTEX. He works now as a researcher at the J.T.Watson IBM Research Center (Hawthorne, NY) and is coordinating IBM research projects involving INTEX. He will present INTEX and illustrate the combined use of finite state transducers and lexical resources for information extraction and text parsing.
Ray C. Dougherty is Professor at the NYU Department of Linguistics. He uses INTEX to teach Computational Linguistics and for his research in linguistics. He will present a practical example of a linguistic study that highlights some specific features of INTEX: by means of local grammars, will be examined cases of ambiguous semantic readings in verb versus verb particle constructions (with a particular focus on verbs like pan/pan out & wash/wash out). His conclusions will also illustrate how local grammars and INTEX concordances can be used in the field of language teaching.
Cédrick Fairon is a visiting scholar at the NYU Department of Linguistics and at Educational Testing Service (ETS, Princeton). He is the author of GlossaNet, a software based on INTEX that allows users to parse Web sites using linguistic resources. He will present the on-line version of this software (freely available on the Internet) that allows users to locate accurate patterns in 25 daily updated newspapers, in 5 languages (English, French, Italian, Portuguese, Spanish).
4. Common References
Courtois, Blandine and Max, Silberztein (eds.).1990, Dictionnaire électronique du français, Langue française 87, Paris, Larousse, pp. 11-22.
Fairon, Cédrick (ed.). 1998-1999. Analyse lexicale et syntaxique: Le système INTEX, Lingvisticae Investigationes Tome XXII (Volume spécial), Amsterdam/Philadelphia: John Benjamins Publishing Co., 450 p.
Fairon, Cédrick. 1999. "A Web-Based System for Automatic Language Skill Assessment: EVALING", In Proceedings of the Workshop Computer-Mediated Language Assessment and Evaluation in Natural Language Processing. ACL 99: College Park, Maryland.
Fairon, Cédrick. 1999. "Parsing a Web site as a Corpus". In C. Fairon (ed.). 1998-1999. Analyse lexicale et syntaxique: Le système INTEX, Lingvisticae Investigationes Tome XXII (Volume spécial), Amsterdam/Philadelphia: John Benjamins Publishing Co., 450 p.
Garrigues, Mylène. 1992. "Information linguistique et didactique des langues", in Moyens technologiques de l'information et de la communication au service de l'enseignement. Apprentissage des langues, Actes de l'atelier 7a du Conseil de l'Europe (CIEP Sèvres, 15-21 déc. 1991), Strasbourg, Conseil de l'Europe.
Gross, Maurice. 1993. "Local grammars and their representation by finite automata", in Michael Hoey (ed.), Data, Description, Discourse, Papers on the English Language in honour of John McH Sinclair, Londres, Harper-Collins, p. 26-38.
Roche, Emmanuel and Yves, Schabes (eds.), 1997, Finite-State Language Processing, Cambridge, Mass./ London, MIT Press.
Silberztein, Max. 1993. Dictionnaires électroniques et analyse automatique de textes: le système INTEX, Paris:Masson.
Silberztein, Max. 2000. Manuel d'utilisation INTEX 4.3. LADL, Université Paris 7: Paris. Downloadable from www.ladl.jussieu.fr/ INTEX.Silberztein, Max, 1999. "Text Indexing with INTEX", in Computers and the Humanities 33:3, Kluwer Academic Publishers.
Silberztein, Max. 2000. "INTEX at IBM", in Proceedings of the Third INTEX workshop, RISSH ed: Université de Liège (Belgium), forthcoming.Tutin, Agnès; Georges Antoniadis; Catherine Clouzot. 1999. " Annoter des corpus pour le traitement des anaphores ". In Actes de la 6e conférence annuelle sur le Traitement Automatique des Langues Naturelles. 12-17 Juillet 1999. Cargèse, Corse.
5. Web pages
INTEX system: http://ladl.univ-mlv.fr/INTEX/index.html
GlossaNet (free on-line concordancer based on INTEX): http://glossa.ladl.jussieu.fr
|
PAPER 1 TITLE: Information extraction with INTEX
|
INTEX is a development environment that allows users to rapidly construct, test and maintain descriptions of specific patterns that occur in texts written in natural language. See an overview of the system in [Silberztein 1999]. Each description is represented by a local grammar, usually entered via the INTEX graph editor.
Local grammars can be used to represent:
-- character-based patterns, for the recognition of phone numbers (e.g. "sequence of 3 digits, followed by a space or an hyphen, followed by 4 digits"), email or Internet addresses, hours or dates expressed numerically, reference or serial numbers, sentence endings, etc.
-- orthographical patterns, for the recognition of spelling variants (e.g. "centre" or "center"), company names and their variants ("International Business Machines Corp. ", "Big Blue"), etc.
-- morphological patterns, for the recognition of families of derived words (e.g. "France, French, Frenchmen, frenchify") and inflected forms (conjugation of verbs, inflection of nouns);
-- families of lexical entries, for the recognition and indexing of related terms and concepts (e.g. "credit card, debit card, MasterCard, visa card...");
-- morphosyntactic patterns, for the recognition of frozen or semi-frozen expressions, such as complements of dates and times (e.g. "on Monday the 15th at 3PM", "two days ago in the early afternoon"), of locations, addresses, etc.
-- other morphosyntactic patterns for the recognition and co-indexing of transformed syntactic constructions (e.g. "N0's trip to N1 = N0 went, traveled to N1"). These techniques involve the use of transducers and can therefore be applied to text encoding. A. Tutin, for example, has used INTEX for XLM encoding and for semi-automatic tagging of anaphora.
One important characteristic of INTEX is that each local grammar can be easily re-used in other local grammars. Developers typically construct simple, elementary graphs that are equivalent to finite-state transducers (FSTs), and re-use these elementary graphs to construct more complex graphs.
This process is similar to the method by which engineers build "black boxes" with Computer Aided Design systems to design for instance simple logical operators (AND, XOR) that are subsequently reused in elementary arithmetic operations (ADD), reused in large numbers in more complex arithmetic operations (ADD64), in ALUs, processors, etc. INTEX provides tools to help design, test, debug, refine and maintain large numbers of local grammars in libraries.
Another characteristic of INTEX is that all the objects processed (grammars, dictionaries and texts) are internally represented by FSTs. Therefore, all the functionalities provided by the system are expressed as a limited number of operations on FSTs. For instance, applying a grammar to a text is performed by computing the union of the grammar FSTs, and then the intersection of the resulting FST and the text FST. This architecture allows for very efficient algorithms (e.g. when applying a deterministic FST to indexed texts) and gives INTEX the power of a Turing machine (thanks to the ability to cascade FSTs).
I will describe the implementation of a large-coverage description of French determiners, based on the description available in Goosse & Grevisse (1986),Gross (1986) and Salkoff (1999). The grammar is organized by means of a hundred local grammars represented by Finite State Automata.
References:
Goosse, André; Grevisse Maurice. 1986. Le Bon Usage. Duculot : Paris-Gembloux.
Gross, Maurice. 1986. Grammaire transformationnelle du français : syntaxe du nom. Cantilène : Malakoff.
Salkoff, Morris. 1999. A French English grammar. John Benjamins Ed. Amsterdam, Philadelphia.
Silberztein, Max. 1999. " Text Indexation with INTEX ". In Computer and the Humanities vol. 33. Kluwer Academic Publishers: Amsterdam.
|
PAPER 2 TITLE: Parsing words and sentences ambiguous between opposite meanings using INTEX finite state grammars: You can’t take ambiguity too seriously
|
Ambiguity poses a serious challenge to any computational system that attempts to extract a meaning or sense from an input sentence. Some ambiguities can be clarified by reference to the surrounding text. G. Miller has pointed out that the word line is at least six ways ambiguous considered only as a noun. The word right is ambiguous in meaning and can be at least four different parts of speech (noun, adj, adverb, verb) according to WordNet.
For a parser that analyzes a sentence one word at a time from left to right an ambiguous word can sometimes lead to ‘garden path’ phenomena, widely discussed in sentence processing literature, and most recently in Fodor and Inoue (eds.) A sentence like The horse raced past the barn died may ‘fool’ a parser into marking raced as the past of the verb race and assigning the string The horse raced past the barn a sentence structure. When the word died is encountered, this indicates that the analysis of the earlier string must be changed to yield a relative clause: The horse (that was) raced past the barn died. A garden path sentence contains (a) an ambiguous element that can be assigned two structures or meanings when the parser encounters it and (b) must be assigned only one of those meanings when the parser encounters a later element in the sentence, called the ‘disambiguator’.
My study focuses on garden path sentences that are ambiguous between opposite readings. In all cases, the ambiguous element is a verb (put, wash) such that a later element (a particle: out) changes the sentence to an opposite reading. I will discuss Quine’s ideas about the semantic analysis of sentences like You cannot take the newspapers too seriously, which can mean take them more seriously or pay them little heed. These ambiguities in a computerized natural language interface to a database are particularly pernicious. A simple ambiguity (line: telephone line, a new line of clothes, a line in the sand, she gave him a line, I have a line on it) may return information about Gucci’s latest line rather than AT&T’s lines. But an intelligent reader can sift the wheat from the chaff. The particle out is like the element not. It reverses the sense of the sentence, but the sentence is usually well-formed with or without the out.
I discuss the INTEX lexical entries required to deal with cases of ‘ambiguous’ semantic readings in ‘verb’ versus ‘verb particle’ constructions that occur with examples like pan and pan out. My theory of cognition panned out. This implies that my theory had some success and is a positive statement. The critic panned my theory of cognition. This implies that my theory had a setback and is a negative statement. For some, The play panned in London, means that it failed, while The play panned out in London means that it was successful. I present the INTEX finite state analysis for examples involving …pan… which would have a ‘negative’ interpretation until an …out… is encountered, whereupon …pan…out… would be given a ‘positive’ interpretation.
If pan is negative and pan out is positive, then out has the opposite effect with wash. The noun washout usually means ‘failed’ as in: Our complicated financial pans were a washout. But wash can mean ‘succeed’ or ‘come out even’ as in: Our complicated financial plans were a wash. As a verb, wash is ‘positive’ implying success, as in: Your financial plans will wash. Wash out is ‘negative’ implying failure, as in: Your financial plans will wash out. A left-right parser assigning semantic interpretations must ‘change’ its semantic interpretation for pan and wash. It worked and it worked out are both positive implying success. It flunked (out) and it punked (out) are all negative suggesting failure. I present the INTEX English grammar and lexicon of common examples found in English newspapers and machine readable journals by GlossaNet.
I am using INTEX to examine the occurrences of pan, pan out, wash, washout, and wash out in the New York Times, Wall Street Journal, and several other papers to study the possible ambiguities and specific interpretations in actual samples. Since there is a federal tax law popularly known as the ‘wash law,’ there are many examples with wash and wash out both in professional journals and in newspapers. INTEX has parsed the past two years of these newspapers. Research having native and non-native English speakers semantically tag texts suggests that non-native speakers of English – who are unaware of the meaning reversal in these verbs – misunderstand passages in leading newspapers. One must speak excellent native English to understand financial and political discussions in newspapers. Non-native speakers sometimes think a financial failure (pan, washout, wash out) is a success (pan out, wash). If this research pans out, it will wash, but if it pans, it will wash out and be panned as a washout.
Words that are ambiguous between opposite readings can cause havoc in two situations I discuss. Search engines may return cases that are the opposite of what you want with no indication that they are the opposite, for instance it may list projects that collapsed (panned, washed out) when you want projects that succeeded (panned out, washed). And, if the out is carelessly handled in translation machines, the machine may take an input sentence that claims some project succeeded and translate it to indicate that the project failed. My grammar and lexicon are INTEX finite state graphs. All my example sentences were located in machine readable text by GlossaNet.
References
Gross, Maurice. 1993. "Local grammars and their representation by finite automata", in Michael Hoey (ed.), Data, Description, Discourse, Papers on the English Language in honour of John McH Sinclair, Londres, Harper-Collins, p. 26-38.
Gross, Maurice. 1997. "The Construction of Local Grammars", in E.Roche et Y.Schabes (eds.), Finite State Language Processing, Cambridge, Mass., The MIT Press, p. 329-352.
Fodor, J. D. and Inoue, A. 1998. "Attach anyway". In Fodor, J. D. and Inoue, A. (Eds.), pp. 101-142.
|
PAPER 3 TITLE: Parsing a Web site with linguistic resources : GlossaNet
|
GlossaNet is an automated system that monitors Web sites. On dates and at intervals selected by the user, GlossaNet downloads the Web site, converts it to an electronic corpus and uses the INTEX programs (M. Silberztein 1993) and the linguistic resources of the LADL (electronic dictionaries and libraries of local grammars) to parse it (B. Courtois and M. Silberztein 1990). We present the on-line version of GlossaNet. This version is accessible on the Internet and offers an automatic service for making concordances (http://glossa.ladl.jussieu.fr). It is mainly designed for use by linguists, but it is also used by some for information retrieval purposes since the corpora available in GlossaNet are the daily updated on-line editions of 25 newspapers in French, English, Italian, Portuguese and Spanish.
Dynamic corpora
We borrow the term dynamic corpus from A. Renouf (1992, 1994) to characterize the way corpora are treated in GlossaNet. In linguistic studies, the term corpus is generally used to refer to a static and finite collection of texts gathered on the basis of criteria chosen according to the planned applications. Once the corpus has been set up, it does not change. But, as A. Renouf showed through the AVIATOR project (Birmingham University), it is possible to imagine another approach to corpora designing, where the corpus is viewed as a flow of electronical textual data. The technical difference between AVIATOR and GlossaNet is the full automation of GlossaNet: a module of GlossaNet called CorpusWeb downloads and converts a Web site into a corpus that feeds the flow of electronic data (see Figure "GlossaNet Process"). In our system, Web sites are treated and parsed as corpora…in fact, as dynamic corpora, since their content changes over time (C. Fairon 1999). D. Walker (1999) has also used a Web crawler to create corpora.

GlossaNet Process
Here is the process that GlossaNet relaunches automatically each time the Web site is updated:
On-line service
Users must register on the GlossaNet server to access the system. Once this one-time registration is completed, they have to choose a working language so that GlossaNet displays the list of available corpora for this language (i.e. Chicago Tribune, Los Angeles Times, Philadelphia Inquirer, New York Post, The Guardian, The Times, The Herald Tribune, etc. are available for English). The user chooses a corpus and composes his/her request under the form of a regular expression or a graph (Finite State Automaton). Here are examples of valid regular expressions that can be applied to an English corpus:
|
Example of regular expressions |
Matched patterns |
|
((<be><V:G> to)+<will>)<V:W> <be>in(<DET>+<E>)<N> the (FBI+Federal Bureau of Investigation+Bureau) <be>a good <N+hum> |
am going to rent, will check, etc. was in a hurry, are in a sweat, etc. the FBI, the Federal Bureau of Investigation, the Bureau is a good man, was a good teacher |
In theory, graphs are equivalent to regular expressions, but practically, they offer a more convenient interface to represent complex structures. For instance, the following graph is equivalent to the first regular expression in the frame presented above:

Each path of the graph defines a "valid" pattern that will be found if the graph is applied on a text.
Results are sent by e-mail to the user under the form of a concordance. If the user has opted for an HTML concordance, the pattern matched by the user’s request and presented in concordance is a hyperlink that enables the user to access the original Web page where the occurrence has been found. The occurrence is automatically highlighted in the original Web page.
Applications
The on-line version is mainly used by linguists for locating examples of lexical/syntactic structures but also by people who have to survey the press for professional reasons. This second category of users does not look for lexical or syntactic structures, but uses keywords instead.
For each language, GlossaNet includes several newspapers from various parts of the world, so GlossaNet can also be used for comparative studies (for example, in French, there are corpus from France, Belgium, Quebec and Switzerland).
Lately, the system has been used at the LADL (Laboratoire d’Automatique Documentaire et Linguistique, Université Paris 7) to update the DELA electronic dictionaries of English. Maintaining and extending these dictionaries is a considerable task, and an automated system that simplifies it is very useful. GlossaNet was used to automatically retrieve unknown common words in newspapers. Methodology and results are discussed in C. Fairon and B. Courtois (2000).
Because GlossaNet on-line requires no installation or special configuration on the user’s machine, it can be easily used for teaching.
Conclusion
GlossaNet combines several pre-existing technologies (a Web grabber, a corpora parser and linguistic resources) in order to parse Web sites as corpora.
The on-line system offers linguists a simple way of finding attestations of lexical and syntactic patterns in press corpora. It is no longer necessary to manipulate corpora and software to find new attestations: once the request is recorded, the system repeats the task automatically and sends a new concordance by e-mail every day or week.
During the first period of test, GlossaNet on-line was used by more than 450 persons and was sending more than 600 concordances on a daily basis.
References
Fairon, Cédrick ; Blandine Courtois. 2000. " Corpus dynamique et GlossaNet : Extension de la couverture lexicale des dictionnaires électroniques du LADL à l'aide de GlossaNet " in Actes du Colloque JADT 2000 : 5e Journée Internationales d'Analyse Statistique des Données Textuelles, Lausanne.
Fairon, Cédrick. 1999. "Parsing a Web site as a Corpus". In C. Fairon (ed.). 1998-1999. Analyse lexicale et syntaxique: Le système INTEX, Lingvisticae Investigationes Tome XXII (Volume spécial), Amsterdam/Philadelphia: John Benjamins Publishing Co., 450 p.
Renouf, Antoinette. 1992. " A Word in Time : first findings from the investigation of dynamic text ", ICAME Conference, Nijmegen.
Renouf, Antoinette. 1994. " Corpora and Historical Dictionaries ", in I. Lancashire et T. Russon Wooldridge (eds.), Early Dictionary Databases, Center for Computing in the Humanitie. University of Toronto, pp. 219-235.
Silberztein, Max. 1999. " Transducteurs pour le traitement automatique des textes ". In B. Lamiroy (ed.), Le Lexique-grammaire. Travaux de Linguistique 37, pp. 127-142. Bruxelles: Duculot.
Walker, Derek. 1999. "Taking Snapshots of the Web with a TEI Camera". In Computers and the Humanities 33(1/2), pp. 185-192.