The propositions as types paradigm (for a nice introduction,
see this), yields a correspondence between a propositional system and
a lambda calculus -- variants of this are also known as the "Curry-Howard Isomorphism". Lambek calculus corresponds to
a typed lambda calculus with subtypes. The more specific the type, the
better the meaning of the sentence is preserved. For example, consider
the sentence "Socrates is a man.". In the empty context, i.e.,
without any assumptions, the only meaningful type that can be assigned
to Socrates is subject
. If more information is
available, e.g., if we include a story about Socrates's life in the
context, then it is possible to derive a much more specific type, such
as greek*human*male
, where *
stands for the
intersection type. Thus, by including a large corpus of text in
the context, the meaning of a sentence can be well parsed,
assuming the text reveals any information about the sentence.
Having a large context can clog up the inference machine. In the Forum 2000 project, the context are all the Usenet newsgroups! Conventional theorem proving techniques cannot be used in this case. At the expense of not deriving the most specific types, performance can be significantly boosted if an artificial neural network (ANN) is incorporated into the system. Connectionist systems are a well established Artificial Intelligence technique for iteratively converging on an optimal solution where closed form methods are either impossible or prohibitively expensive. Recent work has looked at how to implement symbolic reasoning in an ANN framework [3].
A sentence can be analyzed in a parallel bottom-up fashion. In the first pass single words are analyzed, then the types of phrases are derived, then clauses, and in the end the whole sentence. This gives an average time complexity O(N log N), where N is the number of words in the sentence, assuming N processors are available.
The collection of all possible meanings forms a category Omega, with types as objects and meanings of sentences as morphisms. A given context C, such as the Usenet groups, forms a reflective subcategory Sigma. Consider the monad T induced by it. Intuitively, this monad represents all possible responses to all possible sentences in the given context C. Consider an adjointness pair (A,X) that corresponds to the monad T. There is a whole spectrum of them, and each represents a state of mind. By choosing one of these pairs (A,X), we can simply compute a reply to t as A(X(t)).
We call these State Of Mind ADjointness pairs SOMAD's, also popularly known as "Forum 2000 personae". Each persona is just a SOMAD. The computation of a SOMAD is expensive, but it only has to be done once. In fact, a SOMAD can be incrementally updated. In Forum 2000, the neural nets that simulate the personae are incrementally trained on each day's Usenet feed.
The simplest SOMAD is obtained by the Kleisli construction of adjointness (A,X) for monad T. This is known as the "Kleisli SOMAD", an unfortunate terminology since people tend to misunderstand it as "Kleisli persona" and then wonder why the Kleisli SOMAD does not behave like Kleisli. (Of course, a mathematical construction has nothing to do with the personality of its inventor!) The Kleisli SOMAD yields the statistically uncorrelated responses. It is in fact simulated by a randomly trained neural net. It is not computationally expensive, and thus usually gives very fast, and senseless, responses. On Forum 2000 the Kleisli SOMAD is known as the Cube persona, which was trained on random CMU Zephyr traffic.
On the other side of the spectrum is the Eilenberg-Moore construction, which represents a much different state of mind. It is the state of mind of an oracle that is allowed to draw conclusions from the knowledge represented by the whole monad T. In other words, in the Eilenberg-Moore adjointness, all facts represented by T are considered. Alas, this "omniscient" adjointness is non-constructible and undecidable, and consequently only of a theoretical value. One could say that it represents the thoughts of God, never to be known by man.
Other SOMAD's can be generated by various methods. See the implementation section 2.2 for discussion of Kosak & Kindred's work on this topic.
SIM uses techniques for rapid indexing and relevant information look-up similar to those in the Lycos WWW catalog. This significantly decreases search times and communication overload.
The first kind of SOMADs is computed from a large corpus of text of a known writer or philosopher. On Forum 2000 the following were used:
The second kind of SOMADs is the so-called multi-parametric persona SOMAD. In this case a SOMAD is determined by a number of parameters. Different settings of parameters give different personae. By using various optimization and best-fit techniques, the parameters are tuned to fit a given persona. This technique works well for personae for whom no large corpus of text is available, such as the Kosh persona, based on the enigmatic Kosh Naranek from TV series "Babylon 5" (Thanks go to David Nolan for watching all the episodes of Bablyon 5 and transcribing Kosh's text and actions into computer form).
The multi-parametric SOMADs are much more flexible and easier to generate. However, the criteria for setting the parameters are somewhat subjective, and do not always give best results. Kosak & Kindred are continuously experimenting with new techniques, and generate approximately one new persona every week. They use Forum 2000 as a testing ground.
For results, it is best if you check it out yourself. Forum 2000 is located at
http://www.forum2000.org/
Michael Harkavy, with his experience in complexity theory, pointed out the correct interpretation of the Eilenberg-Moore SOMAD.
I would also like to thank Darrell Kindred and Corey Kosak for implementing Forum 2000 in their free time, despite the fact that their main line of research is not directly involved with computational linguistics.
[2] S. Mac Lane: "Categories for the Working Matehmatician", Springer-Verlag, 1971
Andrej Bauer, Pure & Applied Logic, CMU