How Forum 2000 Works

Due to increasing interest in quadratic search algorithms (QSA), I was asked to give a short overview. QSA is an algorithm used by Forum 2000, developed at Carnegie Mellon University.

1. The theory behind the Quadratic Search Algorithm

QSA is an algorithm that combines ideas from computational linguistics and natural language processing, connectionist methods and category theory. It consists of two parts: a) the inference phase, which derives the meaning of a sentence, b) the executive agent, which composes a reply to the sentence. Each phase is described briefly in the following sections.

1.1. Connectionist Approach to Derivations in Lambek Calculi

Lambek Calculus is a structurally free logic which gives a good formal model for natural language understanding and generation. In a Lambek system, a sentence corresponds to a proposition, and its meaning corresponds to a proof of the proposition. Different proofs of the same proposition give alternate meanings of ambiguous sentences.

The propositions as types paradigm (for a nice introduction, see this), yields a correspondence between a propositional system and a lambda calculus -- variants of this are also known as the "Curry-Howard Isomorphism". Lambek calculus corresponds to a typed lambda calculus with subtypes. The more specific the type, the better the meaning of the sentence is preserved. For example, consider the sentence "Socrates is a man.". In the empty context, i.e., without any assumptions, the only meaningful type that can be assigned to Socrates is subject. If more information is available, e.g., if we include a story about Socrates's life in the context, then it is possible to derive a much more specific type, such as greek*human*male, where * stands for the intersection type. Thus, by including a large corpus of text in the context, the meaning of a sentence can be well parsed, assuming the text reveals any information about the sentence.

Having a large context can clog up the inference machine. In the Forum 2000 project, the context are all the Usenet newsgroups! Conventional theorem proving techniques cannot be used in this case. At the expense of not deriving the most specific types, performance can be significantly boosted if an artificial neural network (ANN) is incorporated into the system. Connectionist systems are a well established Artificial Intelligence technique for iteratively converging on an optimal solution where closed form methods are either impossible or prohibitively expensive. Recent work has looked at how to implement symbolic reasoning in an ANN framework [3].

A sentence can be analyzed in a parallel bottom-up fashion. In the first pass single words are analyzed, then the types of phrases are derived, then clauses, and in the end the whole sentence. This gives an average time complexity O(N log N), where N is the number of words in the sentence, assuming N processors are available.

1.2. An Executive Agent Based on a SOMAD

When a sentence S is parsed into a typed lambda term s -- lambda terms correspond to proofs, types correspond to propositions -- a reply is constructed by an executive agent. This is done via a translation of lambda calculus into a category theory setting. (If you are not familiar with category theory, see [2], or click here for WWW resources on the topic.) The equivalence of typed lambda calculi and Cartesian closed categories [1] turns the lambda term s into a morphism t. This section describes how the agent computes a reply.

The collection of all possible meanings forms a category Omega, with types as objects and meanings of sentences as morphisms. A given context C, such as the Usenet groups, forms a reflective subcategory Sigma. Consider the monad T induced by it. Intuitively, this monad represents all possible responses to all possible sentences in the given context C. Consider an adjointness pair (A,X) that corresponds to the monad T. There is a whole spectrum of them, and each represents a state of mind. By choosing one of these pairs (A,X), we can simply compute a reply to t as A(X(t)).

We call these State Of Mind ADjointness pairs SOMAD's, also popularly known as "Forum 2000 personae". Each persona is just a SOMAD. The computation of a SOMAD is expensive, but it only has to be done once. In fact, a SOMAD can be incrementally updated. In Forum 2000, the neural nets that simulate the personae are incrementally trained on each day's Usenet feed.

The simplest SOMAD is obtained by the Kleisli construction of adjointness (A,X) for monad T. This is known as the "Kleisli SOMAD", an unfortunate terminology since people tend to misunderstand it as "Kleisli persona" and then wonder why the Kleisli SOMAD does not behave like Kleisli. (Of course, a mathematical construction has nothing to do with the personality of its inventor!) The Kleisli SOMAD yields the statistically uncorrelated responses. It is in fact simulated by a randomly trained neural net. It is not computationally expensive, and thus usually gives very fast, and senseless, responses. On Forum 2000 the Kleisli SOMAD is known as the Cube persona, which was trained on random CMU Zephyr traffic.

On the other side of the spectrum is the Eilenberg-Moore construction, which represents a much different state of mind. It is the state of mind of an oracle that is allowed to draw conclusions from the knowledge represented by the whole monad T. In other words, in the Eilenberg-Moore adjointness, all facts represented by T are considered. Alas, this "omniscient" adjointness is non-constructible and undecidable, and consequently only of a theoretical value. One could say that it represents the thoughts of God, never to be known by man.

Other SOMAD's can be generated by various methods. See the implementation section 2.2 for discussion of Kosak & Kindred's work on this topic.

1.3. Natural Language Generation

The executive agent computes a reply in the form of a morphism in Omega. The morphism is translated back to Lambek calculus, via the lambda calculus, and standard natural language generation techniques are used for translating it into English (more precisely, it is the type of the reply that is translated into a sentence, not the reply iteself).

2. Implementation

Although the pieces of the theory have been known since the mid 80's, no one has ever tried to implement a QSA. Corey Kosak and Darrell Kindred, both graduate students at CMU, initially implemented a protoptype of QSA as a background application on the CMU network, dynamically redistributing its load onto workstations as they became idle. Initial results were promising enough to attract commercial interest; the system now runs on a "farm" of sixty-four 200MHz Pentium Pros provided by Intelleq, Inc.

2.1. Semantic Inference Machine

The semantic Inference Machine (SIM), is a large ANN that inferes the meaning of a given sentence in the context of Usenet groups. Kosak & Kindred used the Usenet newsgroups because they wanted SIM to work well on everyday spoken English. This turned out to be quite successful, because there is a high correlation among people who use Forum 2000 and those who write to Usenet newsgroups.

SIM uses techniques for rapid indexing and relevant information look-up similar to those in the Lycos WWW catalog. This significantly decreases search times and communication overload.

2.2. The Executive Agent

Kosak & Kindred examined several ways of SOMAD construction. They found that in practice two approaches worked best.

The first kind of SOMADs is computed from a large corpus of text of a known writer or philosopher. On Forum 2000 the following were used:

Collected works of Friedrich Nietzsche
"Confucian Analects, the Great Learning and the Doctrine of the Mean" by Confucius
A series of Dr. Laura's radio talk-shows, processed by Informedia speech recognition system.
"It Takes a Village" by Hillary Rodham Clinton
"The Road Ahead" by Bill Gates
"The Fountainhead" and "Atlas Shrugged" by Ayn Rand

The second kind of SOMADs is the so-called multi-parametric persona SOMAD. In this case a SOMAD is determined by a number of parameters. Different settings of parameters give different personae. By using various optimization and best-fit techniques, the parameters are tuned to fit a given persona. This technique works well for personae for whom no large corpus of text is available, such as the Kosh persona, based on the enigmatic Kosh Naranek from TV series "Babylon 5" (Thanks go to David Nolan for watching all the episodes of Bablyon 5 and transcribing Kosh's text and actions into computer form).

The multi-parametric SOMADs are much more flexible and easier to generate. However, the criteria for setting the parameters are somewhat subjective, and do not always give best results. Kosak & Kindred are continuously experimenting with new techniques, and generate approximately one new persona every week. They use Forum 2000 as a testing ground.

3. Results & Future Work

3.1. Results

Corey Kosak and Darrell Kindred, both graduate students at CMU, implemented QSA as a side project. It was only a matter of days, once it started to work, before they got the idea to put it on the web. The Internet offers researchers a unique opportunity to show the results of their work to the general public. They added humor so that it would appeal to most people.

For results, it is best if you check it out yourself. Forum 2000 is located at

http://www.forum2000.org/

3.2. Future Work

The possibilities are limitless but may be hard to attain. Future research will focus on three main points:

How the choice of the database influences the performance of QSA. Would we get different replies if we used CNN news instead of Usenet, for example?
Currently, Forum 2000 only replies to a sentence. It cannot lead a meaningful prolonged conversation. A group of students at CMU works on ways to interactively adapt to the given sentences. If successful, this will make it possible to incrementally change the monad T in real-time, which is a basic requirement for a multi-way conversation.
Kindred & Kosak hope that generation of multi-parametric persona SOMAD's can be successfully automated. This way they will be able to offer the Forum 2000 as a stand-alone self-adapting expert system. Such a system would have a wide range of applications -- from intelligent characters in computer games, to intelligent self-adapting computer tutors.

Acknowledgements

I would like to thank Sean Slattery for his insights on the connectionism aspects of the Semantic Inference Machine.

Michael Harkavy, with his experience in complexity theory, pointed out the correct interpretation of the Eilenberg-Moore SOMAD.

I would also like to thank Darrell Kindred and Corey Kosak for implementing Forum 2000 in their free time, despite the fact that their main line of research is not directly involved with computational linguistics.

References

[1] J. Lambek, P.J. Scott: "Introduction to Higher Order Categorical Logic", Cambridge University Press, 1986

[2] S. Mac Lane: "Categories for the Working Matehmatician", Springer-Verlag, 1971

[3] Ajjanagadde, V.G.: "Unclear Distinctions lead to Unnecessary Shortcomings: Examining the rule vs fact, role vs filler, and type vs. predicate distinctions from a connectionist representation and reasoning perspective", 1994, in Proceedings of the Twelfth National Conference on Artificial Intelligence, 846-851

Andrej Bauer, Pure & Applied Logic, CMU