I've written recently about some interesting things being done combining machine learning with image processing. We can now easily and automatically colorize a black-and-white image. A couple months ago I wrote about Automatic Colorization of Grayscale Images highlighting some interesting work being done at the Toyota Technological Institute at Chicago and the University of Chicago. (In fact, an algorithm that does this is already available to the public on the Algorithmia site.) And around the same time I wrote an article about Automatic 2-D to 3-D Conversion for Your Selfies based on work done at the University of Washington. Both of these systems relied on machine learning and were able to cleverly leverage the "perfect" training data available.
In the case of colorization the researchers used color photos which they converted to grayscale. Then they used the grayscale as input and the original color photos as the desired output. Kind of clever. For the 2-D to 3-D conversion, the researchers used readily available images from 3-D movies. The stereoscopic images provided the parallax information for primitive 3-D modeling and so they could use one of the stereoscopic views as a conventional flat 2-D image as input and the 3-D "truth" as the desired output. Clever again.
Not to be outdone some researchers at MIT decided to look at still images and project them into the time domain: make moving pictures! While the work presented here is early and decidedly a bit primitive it shows a great deal of promise. You can access the actual paper Generating Videos with Seen Dynamics if you're interested in getting into the nitty-gritty of how it's done.
While it is interesting that these images can be made to move, it is more interesting that the technique actually learns something about the dynamics of the image. In a very real sense, the algorithm has to predict what will happen in the next few frames. In some sense, it's a bit like a psychoanalyst showing a patient a picture and asking them "What happens next?" Because the system can predict dynamic changes in a static image it promises to figure very strongly into the field of computer vision. For example, just predicting what parts of the image will change in the following fraction of a second will allow computer vision programs to "focus their attention" with much more computational efficiency.
Another interesting facet of this scheme is that an image capture system could continuously predict what the future images will be and after waiting a few frames it will have captured the "correct and actual" image corresponding to the prediction. Feeding back the "error signal" between the prediction and the actual outcome allows the system to generate an infinite amount of new training data. A system like this could learn organically and continuously in ways that are not too dissimilar from the way we humans learn. (Gedankenexperiment: Think about how you learned to catch a ball.)
Of course, the system is not a fortune-telling system. Currently, it generates about one second of full frame rate video. And these predictions are only plausible futures based on the training data it was exposed to. Think of it as the video equivalent of "word completion" when you're typing. Sometimes the completed word is wrong, but usually it's plausible. (Interesting thought: Since its predictions are all based on normal expectations it would probably be just as confounded by magicians and sleight-of-hand as humans are!)
Here are a few examples of the input and generated output:
FYI, there is a more readable digested version of the paper available on the MIT site here. It is intelligent and cogent (not a gloss for dummies) but it's worth the five-minute investment if you don't have the 30 minutes for the actual paper.
Development in the area of image processing is moving at a very fast and quickening pace. I see exciting times in the future. That's my prediction.