AUTHOR :
Jean-Frédéric de Pasquale
AFFILIATION :
Laboratoire d'ANalyse Cognitive de l'Information (LANCI)
Université du Québec à Montréal (UQAM)
E-MAIL :
jfdepasquale@yahoo.com
AUTHOR :
Jean-Guy Meunier
AFFILIATION :
Laboratoire d'ANalyse Cognitive de l'Information (LANCI)
Université du Québec à Montréal (UQAM)
E-MAIL :
meunier.jean-guy@uqam.ca
CONTACT ADDRESS : Laboratoire
d'ANalyse Cognitive de l'Information (LANCI)
Université du Québec à Montréal (UQAM)
C.P. 8888, Succ. Centre-Ville
Montréal (Québec) Canada
H3C 3P8
FAX NUMBER :
(514) 987.6721
PHONE NUMBER :
(514) 987.3000 ext. 0339
1- Mathematical classification and categorizations strategies
There are two important recurring strategies in computer assisted reading
and analysis of text (CARAT). A first one relates to the classification
process which, through various clustering techniques must discover classes
of segments on the ground of some type or other of similarity criterion.
This is typical in lexical, semantic, narrative, thematic or stylistic
analysis. The second strategy pertains to the categorisation, that is,
in the information retrieval sense ( not the cognitive one): the attribution
of tags from a finite set of tags to each segment, sentence, or word of
the whole text. These tags are used as descriptor for some aspect of the
content. They may be morphological (vg Masc. Fem) syntactical (e.g. Name,
verb) but they may also be semantic. For instance, these last types may
define the individual senses of words (Paliouras, Karkaletsis and Spyropoulos
1999, Rastier, 1994) by relating them to some conceptual, notional, ontological
category such as "HUMAN", "MATERIAL OBJECT", "ETHICAL SUBJECT", etc...
These two often interrelated operations have been regularly recognized
as essentials components of text analysis (Beaugrande 1980, Landow &
Delany 1993, Jansen 1992, Hearst 1994, Hayes 1979, Barrett 1989, Rastier
1994, Robert & Bouillaguet 1997). It is through these two main operations
that content analysis and interpretation of texts are usually performed.
Altough some of these operations can be computer assisted if they belong
to basic grammatical level (lemmatisation, morphological tagging, syntactic
tagging) they are seldom found at the more complex semantical and logical
level. This is why systems such as NUDIST*, ATLAS,... are so welcome (Alexa
& Zuell 1999a, 1999b). These systems assist and manage the manual classification
and categorization process. But even so, these two operations are highly
time consuming. For relatively small corpora, such manual operations may
be possible, but for large and complex philosophical or literary text corporaor
even a large corpora of psychological interview, the process is energy
consuming and will be practically unrealizable. A possible solution to
this problem calls upon more inductive or bottom-up strategies that are
numerical and statistical. These classification and categorisation techniques
are used in the information retrieval field and in what is more and more
named text mining strategies (Hearst 1994, 1999). By comparison, these
techniques are fast, easy to use and entirely or quasi-entirely automatics.
The classification techniques are usually realized through various clustering
strategies such as factorial analysis, k means, principal component analysis,
etc. (Bouroche et Saporta 1980). The categorisation techniques are realized
through neural nets (Wermter, Panchev and Arevian 1999), k-NN, linear regression
(Yang and Liu 1999), decision trees (Lewis and Ringuette 1994), genetics
algorithms (Tauritz, Kok & Sprinkhuizen-Kuyper 2000), etc. Both types
of strategies may be combined. Both techniques are known to have obtained
important success. And the categorizations algorithms in recent research
(Sebastiani 1999) may even obtain more than 80% mark on breakeven point
scale. But application of these techniques in the fields of humanities
texts have not been frequents. Most of the time, the categorisation algorithms
are used with simple and easy to process corpus (like the standards test
corpus, the different Reuters Corpora); the humanities texts, and more
so philosophical or literary texts or psychological interviews, need finer
discriminations.
Our research aims to find answers to the following question: Can these
text classification and categorisation techniques be applied successfully
to the reading and analysis or texts in the humanities and social sciences?
A positive answer would allow important methodological innovations for
the computer text analysis as practices in theses researches, because machine
learning algorithms allows the reader to make there own categories without
an explicit theory of necessary and sufficient conditions for belonging
to the categories. Some researchers (Hearst 1999), think that these text
mining tools should be used as new scientific tools, just as were microscopes
or telescopes. For the moment, we think more modestly, that these methods
have to be explored more systematically on large and complex corpora before
we can pronounce ourselves on their strength and weakness.
In our own research, we are exploring a few of these techniques and
their combinations. We now know, through our own past research and other's
works, that the classification methods allows a good empirical thematic
exploration of a corpus (Meunier, Remaki, Forest, 1999; Memmi, Meunier,
Gabi, 1998) and may be used in hypertextualisation of corpus (Nault, Rialle,
Meunier, 1999). More specifically, in this paper we shall concentrate mainly
on the problem of assisting the automatic categorization of small segments
of a philosophical text into a set of thematic categories. The main goal
in this experience is to make a "proof of concept": is the idea of using
these Information Retreival tools in content analysis a viable idea ? More
work must be done before we can have a definitive answer; but this experience
can give a general idea of the possibility and the limit of the actual
tools - the perceptron beeing one of the best ones.
2.- Methodology
Because of the particular complex nature of humanity texts, the design of our methodology contain 6 main steps. In the first one, the text is filtered. Here we may eliminate from the text all functional and subjectively non-pertinent words either manually or automatically. In the present experiment, for simplicity of evaluation we have skipped this step. In the second step, a set of categories or tags is chosen. This set of tags is the working hypothesis for the expert reader. They are usually taken from an a priori knowledge that the expert has about the corpus. In the third step, the original text is automatically transformed into a matrix, using the Vector Space Model (Salton, 1983; Manning and Schütze, 1999).Here, each segment is seen as a binary vector and each element of the vector represents the absence or presence of a specific word. The fourth step is the training one. Here, as usual in these algorithms, the expert reader, manually tags a sample set of segments. Then a neural net "learns" what "counts" as typical exemplars of a particular tag. Technically, this learning is realized by defining a partition of the vector space by an hyper plane, using linear regression. In the fifth step, the neural net now takes on the whole text. It then tags the rest of the segments of the text into each one of the categories. This is realized through the matrix built in the second step, and the categorisation techniques are then applied to the matrix. In the sixth step, the various segments of the text are then presented to the expert for analysis and evaluation. Here the expert may accept or reject the classification realized according to some type or other of templates (e.g. experts in the field or his own working hypothesis, etc.) Further development will explore the possibility of using some type or other of dynamic relevance feedback techniques (Salton and Buckley 1990) e.g. genetic algorithms. (Nault, 1999).
3.- The experiment
The preceding methodology has been applied to a philosophical text of Bertrand Russell (about 43 000 words). The text is segmented in 50 words segments. The set of categories chosen pertain to various dimensions of the various possible types of philosophical dimensions a russellian discourse can present. The ones chosen here were : PERCEPTION", "KNOWLEDGE", "MIND". The categories are not exclusive and do not form a structured ontology. This computer processing was realized on an in-house system called CONTERM in which perceptron neural net modules has been included and specially programmed for this experiment. The one-layered perceptron algorithm is a classical but robust neural network. The current research seems to show that the multilayered perceptron is not better than the one-layered one in the text categorisation task.
4.- Results
After training our system on some first segments, the system then had
to categorized the rest of the text. The results were positive. As example,
it correctly categorised the following segment into the category "KNOWLEDGE":
"In this respect our theory of belief must differ from our theory of
acquaintance, since in the case of acquaintance it was not necessary to
take account of any opposite. (2) It seems fairly evident that if there
were no beliefs there could be..."
But the sentence :
"Some relations demand three terms, some four, and so on. Take,
for instance, the relation 'between'. So long as only two terms come in,
the relation 'between' is impossible: three terms are the smallest number
that render it possible. York is between London."
is rigthly rejected as not belonging to the category. As we can see
the machine learning tool manages to categorize the first segment as belonging
to the category, although the word "knowledge" does not appear in it. This
illustrate the basic reason of using such tools: the definition of a category
learned by the algorithm may be not a priori evident to the user. And it
may heuristicaly deliver to the user segments that could not appear in
a classical concordance or in a key word retrieval.
More so, the system directly finds segments that can be considered
as prototypical of a category because of the high synaptic weight it attributes
to certain words in it. For instance words as "acquaintance" (7.5), "knowledge"
(5.5), "about" (4.0), "could" (4.0), "nature", "truths", "know", "should"
(3.0), "reason" (2.5) in a segment are among those found as having the
more high weights. This is common in neural networks technologies (McLeod,
Plunkett, Rolls 1998).
By using this algorithm with the Russell corpus, we cannot hope to
reproduce the 80% results obtain by others with this kind of algorithm.
But the result are encouraging: without any pre-filtering, ( lemmatization,
complex names, elimination of hapax, etc ) we have obtained more than we
obtain a recall of 0.658 and a precision of 0.531 in test phase with the
category "Knowledge". But for "Mind" and "Perception", the Perceptron results
are near random, probably due to the low cardinality of the positive training
set.
5. Discussion
We can see that categorizing a philosophical text is not like
categorizing sports or business news. We think that because of the particular
nature of philosophical texts some specific modifications should be added
to the process before the perceptron or another similar algorithm can used
with more precision in a content and thematic analysis. Although the results
of this experiment were positive much more work has to be realized in order
to discover the various pertinent factors that come into play in the application
of these numerical classification and categorisation strategies to humanities
texts and to increase the success of the categorization. Among these we
can cite: 1) more complex pre filtering (lemmatisation, elimination of
functional and subjectively non-pertinent words, use of compound-word detector
2) better understanding of the nature of categories set and training set
for training purpose, 3) better parameters for correct segmentation for
categorisation purposes (by words, by sentences or according to some predefinite
criterion). 4) better design of the categorizing algorithm, especially
for dynamical corpora. 5) specific evaluation strategy for bench marking
text categorization according to text interpretations by expert of the
domain.
BIBLIOGRAPHY
ALEXA, M. & C. ZUELL (1999) A review of software for text analysis.
ZUMA: Mannheim.
BARRETT, E. (1985). The Society of Text. Hypertext, Hypermedia,
and the Social Construction of Information. Cambridge, Mass.: MIT Press
BEAUGRANDE, R. (1980) Text Discourse and Process. Longman.
BOUROCHE, J.M., SAPORTA, G. (1980), L'analyse des données,
Paris, Presses Universitaire de France.
CARPENTER, G.A. & GROSSBERG, S. (1988) The ART of Adaptative
Pattern Recognition by a Self-Organizing Neural Network, IEEE Computer
12(3): 77-88.
HAYES, P. J. (1980). "The Logic of Frames". In D. Metzing (Ed.), Frame
Conceptions and Text Understanding. New York: Walter de Gruyter.
HEARST, M.(1994a) Context and Structure in Automated Full-Text Information
Access. PhD thesis, University of California at Berkeley.
HEARST, M.(1999) Untangling Data Mining, in the Proceeding of ACL'99
: the 37th Annual Meeting of the Association for Computational Linguistic,
University of Maryland, June 20-26.
JANSEN, S., OLESEN, J., PREBENSEN, H., & THARNE, T. (1992). Computational
approaches to text Undestanding. Copenhaguen: Museum Tuscalanum Press,
LACHARITÉ, N. (1989), Introduction à la méthodologie
de la pensée écrite, Presses de l'Université du
Québec, Québec.
LANDOW, G.P. & DELANY, P. (1993). The Digital Word: Text-Based
Computing in the Humanities. Cambridge: MIT Press.
LEWIS, D.D., and M. RINGUETTE (1994), A comparison of two learning
algorithms for text categorization, Proceedings of SDAIR-94, 3rd Annual
Symposium on Document Analysis and Information Retrieval, pp. 81-93.
MANNING, C.D., SCHÜTZE, H. (1999), Foundations of statistical
natural language processing, Cambridge, Mass. : MIT Press.
MCLEOD, P., PLUNKETT, K., ROLLS, E. T. (1998), Introduction to Connectionist
Modelling of Cognitive Processes, Oxford University Press.
MEMMI, D. (2000), Le modèle vectoriel pour le traitement
de documents, Les cahiers du laboratoire Leibniz, Leibniz-Imag, Grenoble.
MEUNIER,JG. MEMMI, D. GABI, K. (1998) Dynamical Knowledge extraction
from texts by Art Networks. Proceedings of Neurap.Marseille. p. 205-210.
6 p.
MEUNIER.J.G.REMAKI, L. FOREST D. (1999), "Use of classifiers in
Computer assisted reading and analysis of text", Proceedings of the
1999.Internat, Conf. on Imaging Science, Systems, and Technology (CISST'99),
pp.437 à 443. 7 p.
NAULT G., V. RIALLE et J.G. MEUNIER (1999), PROGEN : a Genetic-Based
Semi-automatic Hypertext Construction Tool - first steps and experiment.
In Smith, R. E. (eds.). GECCO-99: Proceedings of the Genetic and Evolutionary
Computation Conference, July 13-17, Orlando, Florida USA. San Francisco,CA:
Morgan Kaufmann.
PALIOURAS, G., and KARKALETSIS, V. and C. D. SPYROPOULOS, Learning
rules for large vocabulary word sense disambiguation, Proceedings of
IJCAI-99, 16th International Joint Conference on Artificial Intelligence,
pp. 674-679, Morgan Kaufmann Publishers, San Francisco, US, 1999.
RASTIER, F. et al. (1994), Sémantique pour l'analyse. De
la linguistique à l'informatique. Paris : Masson.
ROBERT, A. D., BOUILLAGUET, A., L'analyse de contenu, PUF, 1997.
RUSSELL, B. (1959), Problems of philosophy, London, Oxford University
Press.
SALTON G., & Mc Gill, M. (1983). Introduction to models of Information
Retrieval, New York: Mc Graw Hill.
SALTON,G. BUCKLEY C.(1990) Improving retrieval performance by relevance
feedback. Journal of the American Socity for information Science. 41(4)
288-297
SEBASTIANI, F., Machine learning in automated text categorisation:
a survey, Technical Report, Istituto di Elaborazione dell'Informazione,
Consiglio Nazionale delle Ricerche, Number IEI-B4-31-1999, 1999.
TAURITZ, D.R., and KOK, J.N., and I.G. SPRINKHUIZEN-KUYPER, Adaptive
information filtering using evolutionary computation, Information Sciences,
Vol. 122, Number 2-4, pp. 121-140, 2000.
WERMTER, S., Panchev, C. and G. Arevian, Hybrid Neural Plausibility
Networks for News Agents, Proceedings of AAAI-99, 16th Conference of
the American Association for Artificial Intelligence, pp. 93-98, AAAI Press,
Menlo Park, US, 1999.
YANG, Y., and X. LIU, A re-examination of text categorization methods,
Proceedings of SIGIR-99, 22nd ACM International Conference on Research
and Development in Information Retrieval, pp. 42-49, ACM Press, New York,
US, 1999.