[LITA Logo]

Knowbot Explorations in Similarity Space

Martin Halbert
Fondren Library, Rice University



Carmen: Tamara, I thought you were going to the library to work on the literature review for your thesis research, not to play video games.

Tamara: Oh! Hi, Dr. Rodriguez, you startled me. But, I AM working on....

Carmen: I'm just your thesis advisor, not your den mother. I'll see you later.

Tamara: No, wait! It really isn't a video game, I'm doing an online search of Philosopher's Index.

Carmen: Looks more like you're flying over Mars.

Tamara: No, that "planet" is Philosopher's Index. I'm not making this up! That's just how the database looks on this workstation....

David: How's everything going with the search, Tamara?

Tamara: David, I'm glad you came back! I'd like you to meet my thesis advisor, Dr. Carmen Rodriguez. I was just trying to explain to her that I'm doing an online literature search.

David: That's right, Dr. Rodriguez, she's searching a similarity space visualization of a philosophy database.

Carmen: Hmm. I don't remember online searches looking like video games when I did my doctoral research.

David: These holographic simulations are very new. Would you like a quick demonstration of how it works?

Carmen: Sure, you've piqued my curiosity. Could you tell me first of all what we're looking at? The holographic monitor Tamara is using seems to show the surface of some kind of electronically simulated alien world. And what are the other spherical things that look like psychedelic planets hanging in the black sky?

David: All of them are graphical representations of literature databases. The program Tamara is using can present data in a variety of formats, including this "planetscape" type. The program is called a similarity space visualizer.

Carmen: Hold on, back up and explain a few things. I haven't come to the library for an online literature search in several years, but what I remember getting were reams and reams of paper printouts of citations that matched keywords I gave the librarians. I had to sift through all the irrelevant citations looking for the good ones.

David: If you haven't had a search done for you in the last few years, I think you'll be pleasantly surprised. Online searching has advanced quite a bit lately. One of the main things the new systems are good at is prioritizing the output retrieved in terms of relevance to your query.

Carmen: How do they do that?

David: The new similarity comparison systems can statistically compare the words in the database documents to both your initial query and sample documents that you identify as relevant. Then they produce an output list ranked in descending order from most relevant to least relevant.

Carmen: That sounds pretty complicated. I'm not a computer programmer.

David: You don't have to be. The software handles all the complicated and tedious parts. At most, all you have to do is answer simple yes or no questions the system prompts you with. Here, let me show you a practical example, the search I helped Tamara with today. And let me also introduce you to the search tool we used, which is called a knowbot.

Carmen: A knowbot? What's that?

David: Knowbots are programs which collect information from network databases and organize it for you. Tamara and I created a knowbot this morning to do the literature review for her thesis.

Carmen: I gather that a knowbot isn't a physical robot. But what is it, exactly? How do you make one?

David: The same way you set up a spreadsheet or an electronic mail message, on the computer. A knowbot, like a spreadsheet, is basically a collection of data that you use the computer to manipulate and produce new information. A knowbot is a bit more sophisticated, in that it can also make decisions and recommendations for you, like an expert system. In fact a knowbot is really just a kind of specialized expert system that deals with databases accessible through computer networks.

Carmen: Are knowbots intelligent, then?

David: No, certainly not intelligent in the sense that a human is intelligent. However, they can transmit, transform, and store immense amounts of data. Knowbots developed inevitably from the point when computers were networked together and used to store large distributed databases. You need tools similar to knowbots in order to gather and organize information from computer networks. Knowbots grew out of a combination of ideas from network bulletin boards, electronic mail systems, and expert systems.

Carmen: Show me the one you made for Tamara.

David: Do you see this cluster of text windows in the holographic display? That's the knowbot. This window displays all the information on Tamara's query request, starting with the sentence she typed in which framed her information need: "I am writing a doctoral thesis on the historical development of the concept of human rights, with a focus on the philosophical doctrine of heterotelism, and I want to review the literature on the subject."

Carmen: It can understand sentences?

David: The knowbot has a natural language parser, or program to break down sentences into components which it can analyze. It prompts you with yes or no questions to make sure it has analyzed your query correctly. As I said before, it's important to remember that it really is not intelligent. It can only apply its rules of programming to your query. We refined the query in various ways as we went along, but we started out with Tamara's sentence.

Carmen: I think I understand. What happened then?

David: Then we went to the window of the knowbot which governs network operations, and typed in where we wanted the knowbot to search. You can either type in the identifying names for specific databases, or pick them from a pop up menu. Based on topical focus and journal coverage, Tamara and I picked out five databases that we thought would give us the best results. If we had wanted to, we could have picked clusters of databases or entire regions of the network. You usually don't want to tell the knowbot to search entire network regions because of the cost involved in searching through so much data. That brings us to another very important piece of information that goes in the network window, the knowbot's operating budget.

Carmen: What? I have to put it on a budget, like my husband?

David: You sure do. When the knowbot is activated, it will connect to the network systems that you have chosen, translate your query into executable searches on those systems according to their protocols, gather information, and collect it on your workstation. Most of the network databases that you will typically want to search are commercial repositories of information and charge fees for providing your knowbot with information. Your knowbot has to know how much money you want to spend on the search!

Carmen: But how on earth would I know how much money would be needed or reasonable to spend?

David: The knowbot has current information about the fee schedules of different database systems and can estimate the cost of a given query. In other words, it can tell you about how much the search will cost.

Carmen: Hmm. But since we're searching so many different databases at once, won't the costs of retrieving all those articles and citations be astronomical?

David: Not at all, because at this point we aren't retrieving the actual information yet, just information about the information!

Carmen: That sounds like it would cost even more money.

Tamara: No, it's like asking a store for a catalog of their merchandise. You haven't bought anything yet, you're just trying to find out if they have anything you want.

David: Right. They may still charge you for the catalog, but not a great deal. After all, they want you to see all the great things they have to offer.

Carmen: Okay. So then what happens?

David: The knowbot "looks through the catalogs." It compares the query you gave it to the documents in the various databases. It finds the documents which are the most similar to your query in statistical terms. It ranks them in descending order and, depending on how much budget it has left, retrieves as many as are affordable of the top candidate documents. These will be the items that are most likely to be relevant to your needs and not just any citations which happen to contain a few keywords you entered.

Tamara: It produces great results. The first time we did a search we retrieved and printed out the top 50 hits. Almost all of them were exactly on my topic. We told the knowbot which ones I liked best. We also told the knowbot about some that weren't useful to me.

David: That's right. An important part of using knowbots for searching is that you can refine the knowbot's understanding of your information needs. After we picked the top five and the worst five items in the first list it retrieved for Tamara, it used the information to statistically modify its profile of her query. After that it didn't make any mistakes that we noticed in ranking its output.

Carmen: That's quite impressive. Now I have a better idea of what you're talking about. But you still haven't explained what a "similarity space" is.

David: One of the things that the knowbot can do is feed data into another specialized program that can graphically represent the characteristics of the databases the knowbot has explored. Graphical representations can show you patterns in the literature that are useful for the researcher to know about. A graphical representation of the similarity relationships of documents in a database is called a SSV for similarity space visualization, or simspace for short. It's a "space" in the sense that the statistical measurements of similarity between documents are represented by spatial relationships. A SSV typically looks like zillions of multicolored dots clustered in clouds or surface shapes. Each dot represents a document in the database. For any given document, the neighboring dots will be other documents that are statistically similar to it. The colors, heights from the surface plane, and other aspects of the graphics can represent other data, but the main idea is that documents that are similar in concepts will be close spatially. Let me reactivate the simspace associated with Tamara's knowbot so you can see an example.

Carmen: It certainly is colorful.

Tamara: I think these simspace graphics are beautiful.

David: I find the simspace visualizer fun to work with, because it can transform an enormous amount of data into gorgeous patterns that you can grasp intuitively.

Carmen: So these "planets" are databases indexing different kinds of literature. What a strange concept.

David: The visualizer can represent the similarity data in lots of ways. Those spheres are representations of the five databases that Tamara and I decided to explore. Each representation is a simspace normalized or "mapped" onto a sphere. Some of the spheres are bigger than others because they have more documents than others. Color and "altitude" on the surfaces of the simspace spheres indicates relevance to Tamara's query. Let's take a closer look at one of these simspaces, the little one there that sort of looks like Mars.

Carmen: Hmph. Those other spheres that look like gas giants must contain some of my colleagues' work. You use a joystick to move the view around? I thought joysticks were just for video games. David: Video games usually involve a lot of three dimensional movement, and most people are accustomed to using joysticks in that context. We use joysticks when working with simspaces because they are an effective control device. It is sort of fun though, too....

Tamara: Yes, the effect is like flying a spacecraft down to the simspace planets.

David: Would you like to "fly" it, Dr. Rodriguez?

Carmen: Well, now that you've offered, yes! It looks intriguing. How do I make it work?

David: That slider controls your "speed". The joystick handles just like a video game aircraft. See if you can maneuver the view down to the surface of the red "planet".

Carmen: This is fun. Hmm. The surface detail is incredible. What are all the red continents with spiky mountains? This "planet" is actually a database, right?

David: Right, you're currently cruising over the "surface" of Philosopher's Index, a database published in Bowling Green, Ohio. The "mountainous regions" are clusters of documents that are more akin to Tamara's research interests than the surrounding "plains". Fly in closer to that region of peaks.

Carmen: The one with the blinking lights? What is that, anyway, a colony?

David: Sort of. Quit laughing, Tamara. Those blinking lights at the peaks are the documents that Tamara retrieved during her first search. The visualizer marks them for future reference.

Carmen: Why are the mountains red, and the flatlands green and blue?

David: When I asked Tamara how to color the landscape, she picked a spectrum in which red means "most relevant" on one end, and blue means "least relevant" on the other end. The color is just another way of viewing the relationships in the database. Red mountains are what she wanted to look for. You could just as easily reverse the graphical representation so that the most relevant regions of the database showed up as purple valleys.

Carmen: Hmm, let me see if I can "land" this spaceship of the mind. Hey! Now that we're near the surface I can see all kinds of ... objects? What are those things, trees or what?

David: The visualizer is programmed to represent some kinds of documents in special ways to distinguish them. The "trees", as you call them, in this simspace represent review articles. Critiques, overview articles, and other identifiable classes of documents are represented by other distinctive shapes.

Carmen: I see. So how do I actually retrieve one of these documents if I want to read it?

David: Hold down this button on the stick and cross hairs will appear. Center the cross hairs on a surface object, pull the trigger and the system will retrieve the text of that item from the database.

Carmen: Fascinating. So I could investigate why this entire ridge of spiked shrubs is so reddish and therefore relevant to Tamara's needs presumably?

David: Exactly.

Carmen: Let me try the trigger. I wish I could shoot down the arguments of some philosophers this easily. Interesting! This window that's popped up contains ... what, a citation and abstract?

David: Right. It's from a colloquium on heterotelism, Tamara's main topic of interest. The "shrubs" as you call them must all be citations to articles from this colloquium.

Tamara: And now that spot shows a blinking light, like the others I've looked at. I really think this search will help my thesis a lot.

Carmen: It certainly livens up the process of scholarly research. It also poses an ethical question.

David: What's that?

Carmen: This holographic workstation looks pretty expensive. Businesses, research centers, and rich universities like ours may be able to afford this kind of gadget, but what about the average people on the street? When do they get to explore similarity space?

David: You have a very good point there. It's the old question of the information haves and have nots. Information technology is only liberating for those who have access to it.

Tamara: I never thought of that. Now I feel kind of guilty sitting here using this computer.

David: There are a lot of other ethical questions raised by new systems like this. It's easier than ever to pirate information. The database vendors lose a lot of money to software pirates who use knowbots to illegally copy similarity spaces, falsify account information, and other shady activities.

Carmen: I suppose Pandora's technology box never gives you uncomplicated gifts.

Tamara: Well, I still like the searching I can do on this workstation.

Carmen: I have to admit, I find the possibilities fascinating. Has anyone used similarity spaces to study patterns of scholarly research and activity? It strikes me that you could use these visualizations to study all kinds of patterns in the literature. For instance, does a paradigm shift look like a cresting wave on a simspace beach?

David: Now you're beyond my expertise. Why don't you log on and find out?

Notes

The idea of similarity spaces was inspired by a presentation by Scott Deerwester at the 1991 annual conference of the American Library Association. His presentation involved the use of a NeXT workstation to graphically show clustering properties of citations in terms of similarity. Although he did not use the term "similarity space", and his graphical representations were not much like what I have described in this dialog, his presentation nevertheless inspired in me the strong belief that graphical representations of database properties are a wave of the future.

For an excellent discussion of the general concept of similarity space (the more traditional term in information retrieval research is "Vector Space") see the classic textbook Automatic Text Processing by Gerard Salton (Reading, Mass.: Addison-Wesley, 1988), chapter 10.

For a more abstracted and advanced discussion of the problems of similarity analysis, see the book Multidimensional Similarity Structure Analysis by Ingwer Borg and James Lingoes (New York: Springer-Verlag, 1987).


Martin Halbert is Automation and Reference Librarian at the Fondren Library of Rice University in Houston, Texas.

halbert@ricevm1.rice.edu

[Backward] [To Index] [Forward]

© Copyright 1992 by the American Library Association.
All rights reserved except those which may be granted by
Sections 107 and 108 of the Copyright Revision Act of 1976.