Saturday, May 14, 2011

Artificial intelligence, the Bayes way

A general review describes progress in modelling general human intelligence.

This will be an unusual review, since I am reviewing a review, which itself is rather "meta" and amorphous. Plus, I am no expert, especially in statistics, so my treatment will be brutally naive and uninformed. The paper is titled "How to Grow a Mind: Statistics, Structure, and Abstraction". Caveats aside, (and anchors aweigh!), the issue is deeply interesting for both fundamental and practical reasons: how do we think, and can such thought (such as it is) be replicated by artificial means?

The history of AI is a sorry tale of lofty predictions and low achievement. Practitioners have persistently underestimated the immense complexity of their quarry. The reason is, as usual, deep narcissism and introspective ignorance. We are misled by the magical ease by which we do things that require simple common sense. The same ignorance led to ideas like free will, souls, ESP, voices from the gods, and countless other religio-magical notions to account for the wonderful usefulness, convenience and immediacy of what is invisible- our unconscious mental processes.

At least religious thinkers had some respect for the rather awesome phenomenon of the mind. The early AI scientists (and especially the behaviorists) chose instead to ignore it and blithely assume that the computers they happened to have available were capable of matching the handiwork of a billion years of evolution.

This paper describes, in part, a theological debt of another kind, to Thomas Bayes, who carried on a double life as a Presbyterian minister in England and as a mathematician member of the Royal Society (those were the days!). Evidently an admirer of Newton, his only scientific work published in his lifetime was a fuller treatment to Newton's theory of calculus.
"I have long ago thought that the first principles and rules of the method of Fluxions stood in need of more full and distinct explanation and proof, than what they had received either from their first incomparable author, or any of his followers; and therefore was not at all displeased to find the method itself opposed with so much warmth by the ingenious author of the Analyst; ..."

However, after he died, his friend Robert Price found and submitted to the Royal Society the material that today makes Bayes a household name, at least to statisticians, data analysts and modellers the world over. Price wrote:
"In an introduction which he has writ to this Essay, he says, that his design at first in thinking on the subject of it was, to find out a method by which we might judge concerning the probability that an event has to happen, in given circumstances, upon supposition that we know nothing concerning it but that, under the same circumstances, it has happened a certain number of times, and failed a certain other number of times."

And here we come to the connection to artificial intellegence, since our minds, insofar as they are intelligent, can be thought of in the simplest terms as persistent modellers of reality, building up rules of thumb, habits, theories, maps, categorizations, etc. that help us succeed in survival and all the other Darwinian tasks. Bayes's theorem is simple enough:

From the wiki page:
"The key idea is that the probability of an event A given an event B (e.g., the probability that one has breast cancer given that one has tested positive in a mammogram) depends not only on the relationship between events A and B (i.e., the accuracy of mammograms) but also on the marginal probability (or 'simple probability') of occurrence of each event."

So to judge the probability in this case, one uses knowledge of past events, like the probability of breast cancer overall, the probability of positive mammograms overall, and the past conjunction between the two- how often cancer is detected by positive mammograms, to estimate the reverse- whether a positive mammogram indicates cancer.

The authors provide their own example:
"To illustrate Bayes’s rule in action, suppose we observe John coughing (d), and we consider three hypotheses as explanations: John has h1, a cold; h2, lung disease; or h3, heartburn. Intuitively only h1 seems compelling. Bayes’s rule explains why. The likelihood favors h1 and h2 over h3: only colds and lung disease cause coughing and thus elevate the probability of the data above baseline. The prior, in contrast, favors h1 and h3 over h2: Colds and heartburn are much more common than lung disease. Bayes’s rule weighs hypotheses according to the product of priors and likelihoods and so yields only explanations like h1 that score highly on both terms."

So, far it is just common sense, though putting common sense in explicit and mathematical form has important virtues, and indeed is the key problem of AI. The beauty of Bayes's theorem is its flexibility. As new data come in, the constituent probabilities can be adjusted, and the resulting estimates become more accurate. Missing data is typically handled with aplomb, simply allowing wider estimates. Thus Bayes's system is a very natural, flexible system for expressing model probabilities based on messy data.

Language is a classic example, where a children rapidly figure out the meanings of words, not from explicit explanations and grammatical diagrams, (heaven forbid!), but from very few instances of hearing them used in a clear context. Just think of all those song lyrics that you mistook for years, just because they sounded "like" the singer wanted ... a bathroom on the right. We work from astonishingly sparse data to conclusions and knowledge that are usually quite good. The scientific method is also precisely this, (method of induction), more or less gussied-up and conscious, entertaining how various hypotheses might achieve viable probability in light of their relations to known prior probabilities, otherwise known (hopefully) as knowledge.

In their review, the authors add Bayes's method of calculating and updating probabilities to the other important element of intelligence- the database, which they model as a freely ramifying hierarchical tree of knowledge and abstraction. The union of the two themes is something they term hierarchical Bayesian models (HBMs). Trees come naturally to us as frameworks to categorize information, whether it is the species of Linnaeus, a system of stamp collecting, or an organizational chart. We are always grouping things mentally, filing them away in multiple dimensions- as interesting or boring, political, personal, technical, ... the classifications are endless.

One instance of this was the ancient memory device of building rooms in one's head, furnishing prodigious recall to trained adepts. For our purposes, the authors concentrate on the property of arbitrary abstraction and hierarchy formation, where such trees can extend from the most abstract distinctions (color/sound, large/small, Protestant/Catholic) to the most granular (8/9, tulip/daffodil), and all can be connected in a flexible tree extending between levels of abstraction.

The authors frame their thoughts, and the field of AI generally, as a quest for three answers:
"1. How does abstract knowledge guide learning and inference from sparse data?
2. What forms does abstract knowledge take, across different domains and tasks?
3. How is abstract knowledge itself acquired?"

We have already seen how the first answer comes about- through iterative updating of probabilistic models following Bayes's theorem. We see a beginning of the second answer in a flexible hierarchical system of categorization that seems to come naturally. The nature and quality of such structures are partly dictated by the wiring established through genetics and development. Facial recognition is an example of an inborn module that classifies with exquisite sensitivity to fine differences. However, the more interesting systems are those that are not inborn / hard-wired, but that allow us to learn through more conscious engagement, as when we learn to classify species, or cars, or sources of alternative energy- whatever interests us at the moment.

Figure from the paper, diagramming hierarchical classification as done by human subjects.

Causality is, naturally, an important form of abstract knowledge, and also takes the form of abstract trees, with time the natural dimension, through which events affect each other in a directed fashion, more or less complex. Probability and induction are concerned with detecting hidden variables and causes within this causal tree, such as forces, physical principles, or deities, that can constitute hypotheses that are then validated probabilistically by evidence in the style of Bayes.

A key problem of AI has been a lack of comprehensive databases that provide the putative AI system the kind of comon-sense, all-around knowledge that we have of the world. Such a database allows the proper classification of details using contextual information- that a band means a music group rather than a wedding ring or a criminal conspiracy, for instance. The recent "Watson" game show contestant simulated such knowledge, but actually was just a rapid text mining algorithm, apparently without the kind of organized abstract knowledge that would truly represent intelligence.

The authors characterize human learning as strongly top-down organized, with critical hypothetical abstractions at higher levels coming first, before details can usefully be filled in. They cite Mendeleev's periodic table proposal as an exemplary paradigm hypothesis that then proved itself by "fitting" details at lower levels, thereby raising its own probability as an organizing structure.
"Getting the big picture first- discovering that diseases cause symptoms before pinning down any specific disease-symptom links- and then using that framework to fill in the gaps of specific knowledge is a distinctively human mode of learning. It figures prominently in children's development and scientific progress, but has not previously fit into the landscape of rational or statistical learning models."

Which leads to the last question- how to build up the required highly general database in a way that is continuously alterable, classifies data flexibly in multiple dimensions, and generates hypotheses (including top-level hypotheses and re-framings) in response to missing values and poor probability distributions, as a person would? Here is where the authors wheel in the HBMs and their relatives, the Chinese Restaurant and Indian Buffet processes, all of which are mathematical learning algorithms that allow relevant parameters or organizing principles to develop out of the data, rather than imposing them a priori.
"An automatic Occam's razor embodied in Bayesian inference trades off model complexity and fit to ensure that new structure (in this case a new class of variables) is introduced only when the data truly require it."
...
"Across several case studies of learning abstract knowledge ... it has been found that abstractions in HBMs can be learned remarkably fast from relatively little data compared with what is needed for learning at lower levels. This is because each degree of freedom at a higher level of the HBM influences and pools evidence from many variables at levels below. We call this property of HBM's 'the blessing of abstraction.' It offers a top-down route to the origins of knowledge that contrasts sharply with the two classic approaches: nativism, in which abstract concepts are assumed to be present from birth, and empiricism or associationism, in which abstractions are constructed but only approximately, and slowly in a bottom-up fashion, by layering many experiences on top of each other and filtering their common elements."

Wow- sounds great! Vague as this all admittedly is, (the authors haven't actually accomplished much, only citing some proof of principle exercises), it sure seems promising as an improved path towards software that learns in the generalized, unbounded, and high-level way that is needed for true AI. The crucial transition, of course, is when the program starts doing the heavy lifting of learning by asking the questions, rather than having data force-fed into it, as all so-called expert systems and databases have to date.

The next question is whether such systems require emotions. I think they do, if they are to have the motivation to frame questions and solve problems on their own. So deciding how far to take this process is a very tricky problem indeed, though I am hopeful that we can retain control. If I may, indeed, give in all over again to typical AI hubris ... creating true intelligence by such a path, not tethered to biological brains, could lead to a historic inflection point, where practical and philosophical benefits rain down upon us, Google becomes god, and we live happily ever after, plugged into a robot-run world!

An image from the now-defunct magazine, Business 2.0. Credit to Don Dixon, 2006. 

"Greece should definitely leave the Eurozone."