Saturday, June 15, 2019

Can Machines Read Yet?

Sort of, and not very well.

Reading- such a pleasure, but never time enough to read all that one would like, especially in technical fields. Scholars, even scientists, still write out their findings in prose- which is the richest form of communication, but only if someone else has the time and interest to read it. The medical literature is, at the flagship NCBI Pubmed resource, at about 30 million articles in abstract and lightly annotated form. Its partner, PMC, has 5.5 million articles in full text. This represents a vast trove of data which no one can read through, yet which tantalizes with its potential to generate novel insights, connections, and comprehensive and useful models, were we only able to harvest it in some computable form.

That is one of the motivations for natural language processing, or NLP, one of many subfields of artificial intelligence. What we learn with minimal effort as young children, machines have so far been unable to truly master, despite decades of effort and vast computational power. Recent advances in "deep learning" have made great progress in pattern parsing, and learning from large sets of known texts, resulting in the ability to translate one language to another. But does Google Translate understand what it is saying? Not at all. Understanding has taken strides in constricted areas, such as phone menu interactions, and Siri-like services. As long as the structure is simple, and has key words that tip off meaning, machines have started to get the hang of verbal communication.

But dealing with extremely complex texts is another matter entirely. NLP projects directed against the medical literature have been going on for decades, with relatively little to show, since the complexity of the corpus far outstrips the heuristics used to analyze it. These papers are, indeed, often very difficult for humans to read. They are frequently written by non-English speakers, or just bad writers. And the ideas being communicated are also complex, not just the language. The machines need to have a conceptual apparatus ready to accommodate, or better yet, learn within such a space. Recall how perception likewise needs an ever-expanding database / model of reality. Language processing is obviously a subfield of such perception. These issues raises a core question of AI- is general intelligence needed to fully achieve NLP?


I think the answer is yes- the ability to read human text with full understanding assumes a knowledge of human metaphors, general world conditions, and specific facts and relations from all areas of life which amounts to general intelligence. The whole point of NLP, as portrayed above, is not to spew audio books from written texts, (which is already accomplished, in a quite advanced way), but to understand what it is reading fully enough to elaborate conceptual models of the meaning of what those texts are about. And to do so in a way that can be communicated back to us humans in some form, perhaps diagrams, maps, and formulas, if not language.

The intensive study of NLP processing over the Pubmed corpus reached a fever pitch in the late 2000's, but has been quiescent for the last few years, generally for this reason. The techniques that were being used- language models, grammar, semantics, stemming, vocabulary databases, etc. had fully exploited the current technology, but still hit a roadblock. Precision could be pushed to ~ %80 levels for specific tasks, like picking out the interactions of known molecules, or linking diseases with genes mentioned in the texts. But general understanding was and remains well out of reach of these rather mechanical techniques. This is not to suggest any kind of vitalism in cognition, but only that we have another technical plateau to reach, characterized by the unification of learning, rich ontologies (world models), and language processing.

The new neural network methods (tensorflow, etc.) promise to provide the latter part of the equation, sensitive language parsing. But from what I can see, the kind of model we have of the world, with infinite learnability, depth, spontaneous classification capability, and related-ness, remains foreign to these methods, despite the several decades of work lavished on databases in all their fascinating iterations. That seems to be where more work is needed, to get to machine-based language understanding.


  • What to do about media pollution?
  • Maybe ideas will matter eventually in this campaign.
  • Treason? Yes.
  • Stalinist confessions weren't the only bad ones.
  • Everything over-the-air ... the future of TV.

2 comments:

Burk said...
This comment has been removed by the author.
Burk said...

There is an ongoing benchmark/competition for NLP processing and automated categorization.