Saturday, April 22, 2017

How Speech Gets Computed, Maybe

Speech is just sounds, so why does it not sound like noise?

Even wonder how we learn a language? How we make the transition from hearing how a totally foreign language sounds- like gibberish- to fluent understanding of the ideas flowing in, without even noticing their sound characteristics, or at best, appreciating them as poetry or song- a curious mixture of meaning and sound-play? Something happens in our brains, but what? A recent paper discusses the computational activities that are going on under the hood.

Speech arrives as a continuous stream of various sounds, which we can split up into discrete units (phonemes), which we then have to reconstruct as meaning-units, and detect their long-range interactions with other units such as the word, sentence, and paragraph. That is, their grammar and higher-level meanings. Obviously, in light of the difficulty of getting machines to do this well, it is a challenging task. And while we have lots of neurons to throw at the problem, each one is very slow- a mere 20 milliseconds at best, compared to about well under a nanosecond for current computers, about 100 million times faster. It is not a very promising situation.

An elementary neural network, used by the authors. Higher levels operate more slowly, capturing broader meanings.

The authors focus on how pieces of the problem are separated, recombined, and solved, in the context of a stream of stimulus facing a neural network like the brain. They realize that analogical thinking, where we routinely schematize concepts and remember them in relational ways that link between sub-concepts, super-concepts, similar concepts, coordinated experiences, etc. may form a deep precursor from non-language thought that enabled the rapid evolution of language decoding and perception.

One aspect of their solution is that the information of the incoming stream is used, but not rapidly discarded as one would in some computational approaches. Recognizing a phrase within the stream of speech is a big accomplishment, but the next word may alter its interpretation fundamentally, and require access to its component parts for that reinterpretation. So bits of the meaning hierarchy (words, phrases) are identified as they come in, but must also be kept, piece-wise, in memory for further Bayesian consideration and reconstruction. This is easy enough to say, but how would it be implemented?

From the paper, it is hard to tell, actually. The details are hidden in the program they use and in prior work. They use a neural network of several layers, originally devised to detect relational logic from unstructured inputs. The idea was to use rapid classifications and hierarchies from incoming data (specific propositional statements or facts) to set up analogies, which enable drawing very general relations among concepts/ideas. They argue that this is a good model for how our brains work. The kicker is that the same network is quite effective in understanding speech, relating (binding) nearby (in time) meaning units together even while it also holds them distinct and generates higher-level logic from them. It even shows oscillations that match quite closely those seen in the active auditory cortex, which is known to entrain its oscillations to speech patterns. High activity at the 2 and 4 Hz bands seem to relate to the pace of speech.

"The basic idea is to encode the elements that are bound in lower layers of a hierarchy directly from the sequential input and then use slower dynamics to accumulate evidence for relations at higher levels of the hierarchy. This necessarily entails a memory of the ordinal relationships that, computationally, requires higher-level representations to integrate or bind lower-level representations over time—with more protracted activity. This temporal binding mandates an asynchrony of representation between hierarchical levels of representation in order to maintain distinct, separable representations despite binding."

This result and the surrounding work, cloudy though they are, also forms an evolutionary argument, that speech recognition, being computationally very similar to other forms of analogical / relational / hierarchical thinking, may have arisen rather easily from pre-existing capabilities. Neural networks are all the rage now, with Google among others drawing on them for phenomenal advances in speech and image recognition. So there seems to be a convergence from the technology and research sides to say that this principle of computation, so different from the silicon-based sequential and procedural processing paradigm, holds tremendous promise for understanding our brains as well as exceeding them.