TYPE OF PROPOSAL: paper TITLE: Tracking Culture on the Web; An Experiment KEYWORDS: Culture, World Wide Web, Text-analysis AUTHORS: Dr. Geoffrey Rockwell School of the Arts McMaster University 1280 Main St. W. Hamilton, ON Canada, L8S 4M2 Phone: (905) 525-9140 x 24072 Fax: (905) 577-6930 grockwel@mcmaster.ca Dr. W.F.S. Poehlman skip@cas.mcmaster.ca Computing and Software McMaster University Michael Picheca mikep@ritchie.cas.mcmaster.ca Computing and Software McMaster University ABSTRACT: A couple of years ago a friend suggested that a way to get people interested in trends in popular culture would be to create an online cultural stock market or "horse race" where people could bet on cultural items and then watch their stock go up or down as the objects rose or fell in popularity. The problem was how to measure the popularity of a cultural item like "Star Wars" or "XML". It wasn't until later that it occurred to us that we ask WWW search engines for the number of pages indexed that included the word or phrase in question as a way of tracking the relative popularity of the item. The problem was whether and how one could systematically gather such information from search engines and whether such information would provide a reliable guide to cultural shifts. This paper reports on a two stage experiment that we ran, first to design a system capable of tracking items, and second to gather a significant amount of data so as to see if the system did in fact reflect known events in popular culture and contemporary ideas. In effect we wanted to see if we could treat the WWW as an enormous text corpus with the search engines as our text-analysis tools for the purpose of cultural and intellectual study. In the presentation we will do three things: 1. We will discuss the case for using the WWW to track ideas and culture. 2. We will report on the initial tests of a system for tracking selected items and the resulting design. 3. We will report on the results of a four month study during which we gathered data daily on selected items and compared it to known events. In a paper that Rockwell and Bradley gave at the ALLC-ACH conference in Paris in 1994, "A Growing Fascination With Dialogue: Bibliographic Databases and the Recent History of Ideas", they reported on a technique of using bibliographic databases for tracking the recent history of ideas.(1) In that paper they argued that databases provide evidence of changes in the symptoms of intellectual culture comparable to the types of evidence that epidemiologists use to track epidemics.(2) The problem with the technique they used is that bibliographic databases reflect academic work not popular and commercial subculture. Bibliometrics is useful for tracking bibliographic trends, less so for cultural trends. The WWW on the other hand can be argued to be a better reflection of popular and commercial subculture. The WWW has the additional advantage that it is the work of the millions who write WWW pages and is therefore not keyworded by experts who might, as cataloguers do in the case of bibliographic databases, impose their organizational categories on the evidence. The WWW is a significant expression of North American culture and therefore better represents the relative complexity of the whole blooming buzzing confusion. Further, the WWW is already digitized and in a relatively standardized format so that it can be searched and indexed as a growing whole (with great difficulty.) It is the accessibility of the WWW to quantitative methods that makes it ideal for tracking the movement of ideas and popular culture. The problem with the WWW as evidence is that it is not conveniently organized into a database that one can search diachronically. For this reason we turned to the popular search engines that index WWW pages and provide statistics on demand as a reasonable source of evidence for the WWW as a whole. This is not an original idea; in the presentation we will show some "voyeur" pages that allow you to see what are the popular terms others are searching for.(3) Unfortunately, when we contacted the owners of such pages to see if they would collaborate with us we were rebuffed. Such statistics are a closely guarded secret with commercial value. The tactic we settled on was to then test the feasibility of a system that would gather statistics from the search engines for terms we chose and we ran a series of tests to see if we could gather data regularly from the search engines. We also wanted to see what the resource implications of such system were, given that ultimately we might want to track thousands of items. One of the results of our test was that we found that statistics on news articles gave us greater variation and more detail over short periods. In effect WWW page statistics seem to be useful for tracking long term change while news articles are more responsive to short term changes. In the presentation we will review the tests and resulting statistics from this first phase. Once we demonstrated that this could be done we built a system that gathered data from three search engines (Excite, Yahoo, and Thunderstone) on both WWW page statistics and news article statistics for a selection of words and phrases in three areas: guitars and popular music, popular movies and characters, and text-analysis and markup languages. The system gathers these statistics every night and writes them to a database to which we built a front end that can plot items over time. (In the presentation we will demonstrate the WWW front end where viewers can view the data by item, by search engine, and by date range.) The system was run for four months (September 1999 to January 2000) and the data was exported to Excel for more analysis. We found in a number of cases a clear correlation between spikes in activity and known events. For example, the release of the latest James Bond movie was reflected in the data we gathered by a dramatic surge in hits for news articles for the phrase. In the presentation we will present the statistics for selected items over the period and comment on what events we believe these statistics reflected. We believe these correspondences empirically demonstrate the usefulness of this technique. There are a number of theoretical problems with this approach to tracking ideas and cultural items that will be discussed in the presentation. I summarize them here with questions and tentative answers: 1. Does the WWW reflect popular culture or does it reflect only the culture of the community of its authors? While the WWW undoubtedly reflects only the interests of a geographically and economically limited set of people it is still an enormous body of evidence and there are few alternatives if one wishes to avoid impressionistic studies or use "top ten" lists generated by the media on selected topics. Whatever the degree to which the WWW reflects popular culture there are enough people authoring WWW pages to argue that it is interesting what they are writing about if one can accurately measure it. 2. Do counts of words or phrases accurately reflect interest in culture? This is a problem common to any form of text-analysis. Certain cultural topoi are no doubt not going to be tracked by a system that searches only for words and phrases, that is what cultural historians are for. Disambiguation is another problem. That said, we believe that information about key words and phrases over time could provide useful evidence for more sophisticated interpretation and aggregation if it can be shown to be something that can be gathered and if it can be shown to reflect real events. Further, searching for key words and phrases is the way many people access information on the WWW so there is some justification for using this approach. More importantly our system is designed to track "hits" over time. We believe that what is important is not the number of "hits" for an item, but changes in the number and comparisons between items. It is hard to say what it means if there are over 4 million WWW pages devoted to the "Spice Girls", but if that number changes dramatically that may indicate a change in interest in the subject. 3. Can we trust data gathered from search engines that are not open to scrutiny? Certainly not. This is why we can't use the "voyeur" pages and why we gathered statistics from more than one engine and for both WWW pages and news articles. That said, the search engines are the best source of statistics (without building our own spiders) and their statistics do seem to match known events. Further, as mentioned above, the search engines are used to find information - they are part of the culture of the Internet and consequently would have to be taken into account anyway. 4. Is it ethical to gather such statistics from search engines? Given the general climate of concern about the gathering and aggregation of data on the Internet it is worth asking whether this system or ones like it pose ethical problems. As we do not gather information about individual WWW pages or authors it is unlikely that such a system could be used to predict more than general trends, but the fact remains that systems like ours could be designed to track the interests of an individual author over time. A more pressing issue we faced arose when, in our initial tests to see what was the correlation between the number of items searched for and the time it took to conclude the searches, our server was denied access by one particular search engine, Northern Light. In an e-mail exchange they explained that their service was available for humans not robots and that such robot searches depress click-through advertising rates which in turn could cause financial harm to their investors. Respecting their wish we dropped them from the list of engines to search. Pragmatically, the best way to test the value of such statistics was to implement it and see if the results correlated with events, which it did in cases we could confirm. We believe our experiment shows that such statistics can be a useful monitor of changes in culture, with certain reservations, and we will conclude by discussing how we plan to implement the system for targeted research and as a teaching tool. The system was built so that it can now become a module in a larger system where students could play at investing in culture or researchers can provide lists of terms related to a field to track over a given period. Notes (1) Rockwell, Geoffrey and John Bradley, "A Growing Fascination With Dialogue: Bibliographic Databases and the Recent History of Ideas" was presented at the ALLC-ACH '94 conference in Paris in April 1994. (2) For more on the epidemeology of culture and ideas see Dan Sperber, _Explaining Culture: A Naturalistic Approach_. Oxford: Blackwell, 1996. Sperber's project is more ambitious than ours and differs in interesting ways. For him the epidemeology of culture should help us generate natural explanations of how ideas are transmitted by linking cognitive psychology and sociology/anthropology. We are skeptical that a technique such as ours could do this. At best it can accurately track changes in the symptoms of culture not explain what is happening in the minds of people. (3) See www.searchterms.com. The idea that search engines are gathering information about trends that could be used is not original. Eric Knight, the owner of Searchterms.com, describes the Internet as a "cultural barometer". The companies that run the search engines no doubt gather and sell information about market trends; unfortunately they don't make it available for academic study.