Machine learning is great and it does some amazing things, but even though we refer to the techniques as "neural networks" the way these systems learn is different from the way people learn. The biggest difference is that these algorithms/systems have insatiable appetites for clean data. You have to present one of these systems with huge numbers of pictures of kittens before it has any hope of labeling kittens reliably. As opposed to a child, who can be shown three pictures of kittens, and who at that point would probably perform as well as the exhaustively trained neural net.
In all fairness, if we examine what these (deep) neural nets are learning we can see that the contest is not really fair. The neural net is expected to first learn "how to see" from basic principles. The neural net is given a large bag of pixels from which it must discover fundamental visual primitives. As the processing moves up the (deep) layers of the neural net it discovers more complex (yet still abstract) parameters. Eventually, the layers of parameters consolidate toward the ultimate label "kitten". When we look at it this way we can see that the algorithm is required to learn (evolve?) what it means to "see" and all the intermediate steps to be able to "see kittens". It's almost as if the neural net is starting at some point far back on the evolutionary tree when our "eyes" first appeared.
It's pretty obvious that organisms (including humans) that have vision do not learn to see new novel objects by first creating a visual processing foundation. First of all we must be prewired to detect and process the lower level visual primitives. And second, we must be able to generalize and then refine our higher-level processing with auxiliary, nonvisual input. The child might say "puppy" and the parent might say "yes, like a puppy, but furrier".
Is this a Mickey Mouse problem ... Maybe. As it turns out some scientists at Disney Research are helping computers to recognize novel objects, objects that it has never seen before. They're doing it using vocabulary that has been used to describe objects that are similar to the novel object being presented. The technique is called "Semi-supervised Vocabulary-informed Learning".
We humans do something like this every time we read a novel or a poem. Objects and scenes are described and in our "mind's eye" we add, intersect, extend, etc. the semantics of the text in conjunction with images in our personal experience and synthesize a plausible expectation for the unseen object. It won't be exactly the same as the next readers expectation, but it will be a good guess from which to refine our expectation. And in the process, we avoid a great deal of resource consuming "training".
The research is being done by Yanwei Fu and Leonid Sigal at Disney Research and was presented at the IEEE Conference on Computer Vision Pattern Recognition, CVPR 2016, June 26 in Las Vegas (full paper is available online).
Leonid Sigal, who is a senior research scientist at Disney Research, said, "...a computer that already has been taught to recognize certain objects — apples, for instance — can analyze word use to get hints about the existence of fruits such as pears and peaches, and how they might differ from apples... The knowledge that other fruits exist also is helpful in teaching the computer about important characteristics of apples themselves."
Jessica Hodgins, vice president at Disney Research said, "We've seen unprecedented advances in object recognition and object categorization in recent years, thanks to the development of convolutional neural networks... but the need to train vision software with thousands of labeled examples for each object has created a bottleneck and limited the number of object classes that can be recognized. Vocabulary-informed learning promises to break that bottleneck and make computer vision more useful and reliable."
The vocabulary data set that was used in conjunction with the visual training was extracted from Wikipedia articles and the UMBC WebBase which contains about 3 billion English words. After processing all that text they were able to lexically discriminate between 300,000 categories of objects as well as quantify some statistical relationships between those categories. Those statistical relationships of the descriptive language qualifying the objects could be identified and its similarity to the descriptive language around other unknown objects could be used to infer properties of the unknown objects. Language attributed to a known object (e.g. an apple) might be similar (fruity, just picked, fruit stand, etc.) to language attributed to a yet unseen object (e.g. a pear) and so the system could make reasonable inferences about a pear:
- is about the same size as an apple
- has a stem
- grows on a tree
- has a core with seeds
There will also be dissimilarities in the language attributed to the unseen object. So, while a pear is similar to an apple the vocabulary used in conjunction with a pear may include concepts like "small end" or "pointy" or "oblong." By using that information in conjunction with what it already knows about an apple's appearance it seems likely that the system that was trained on apples and read about pears could identify one in a picture even though it had never seen one in training.
Sigal stated that the goal wasn't to mimic humans exactly but he did confirm that "... making the learning approach more human-like was a motivating factor... It is a different form of learning and so will motivate researchers to develop different types of algorithms."
I always look forward to new algorithms.
Clearly this is a new way to think about the vision learning problem. And this approach shows promise for extending the number of objects we can recognize by a factor of hundreds or even thousands. Yanwei Fu explained further: "I've never been to Africa, but I read books so I know what to expect ... We use our brains to organize information and contextualize how unknown things might look. Compared with previous semi-supervised learning, our vocabulary-informed paradigm is perhaps more similar to how humans reason."
Maybe computer vision has a Mickey Mouse solution after all!