Most of us know the story about how Philo Farnsworth invented the first electronic video image detector while watching a farmer plow a field. He called the sensor an image dissector because it dissected the image into many small patches of gray (today we call these pixels) in a serial array much like beads on a string. Each level of gray was represented by a level of voltage, and the changing levels of voltage created a waveform. That waveform could be transmitted by wire or radio waves.
The vacuum tube version of this methodology was the television studio standard of until late in the 20th century when it was replaced by solid-state, charged coupled devices (CCD). But, even with the advent of the CCD, video images were still decomposed into pixels which were assembled into lines and transmitted as frames. You still had to wade through full frames of video data just to get the new value of a pixel at a particular point in the image. No one really thought about it very much: that's just the way things are, that's the way we've always done it. And to make things worse, if you want to examine even a tiny part of the image at a higher time resolution you had to sample all the image data at a high data rate even if most of the image was not changing. This leads to a lot of needless expense to generate a lot of unused video bandwidth.
As you might expect evolution didn't follow the path of rasterization for processing moving images. Biological vision systems do not process the entire field of view in a frame by frame fashion. While I won't delve into all of the research to prove this you can certainly do up a couple of thought experiments to convince yourself:
We've all noticed at the movies or on television that rotating wagon wheels can appear to rotate in improbable ways (too slowly or even backwards). Most of us have learned to ignore the "strobe effect" caused by the interaction of the frame rate of the camera with the cyclic repetition of the spokes on the wheel. But we never see this when watching wagon wheels in the real world.
Our eyes are actually image difference detectors as opposed to full frame image collectors. How many times have you looked at a patch of grass and all of the chaotically arranged blades of grass, but what you saw was a small insect climbing one of the blades? In fact, when we look at a completely static image we unconsciously move our eyes in a motion called a saccade partly in order to create differences. Also, we notice the movement of the insect on the grass instantly and in a way that feels fundamentally different from the visual analysis we would do to find a specific word in a list or when searching for Waldo.
So what if a video image sensor could operate more like our visual system? Instead of sending a rasterized full frame of data repeatedly suppose that the sensor would emit the pixels in the temporal order in which they changed? Instead of the pixel location in the image being determined by its relative location in the full frame output stream, imagine that each pixel would determine whether it had changed and if it had changed it would transmit its location and value. Some interesting properties emerge with this type of image system:
If the interesting part of the image frame is a small percentage of the full frame (e.g. the insect, or projectile?) then the effective frame rate can become very large. Because only the insect pixels are being updated the effective frame rate for the full scenario could be thousands of times faster than the conventional full frame method.
Image segmentation for moving objects becomes much simpler, and more like the way our visual system does it. When an extended object moves in our visual field most of the changes happen at the same time. An image sensor that worked like this would present all of the image pixels for the extended object that moved at that instant. And this would work for scenarios that would confound today's object boundary detection algorithms. Imagine rolling a soccer ball across a very large photograph of nothing but soccer balls. This sort of sensor would "auto-magically" report the location and extent of the rolling ball, no GPU's needed.
And then even more interesting property emerges when we use two of the sensors for stereoscopic vision. When the insect moves on the grass we immediately get the extent and location on both sensors at the same time. One simple trigonometric computation locates the target in three dimensions. (Frogs and chameleons notice and target small insects in a fraction of a second capturing them with the tip of their projectile tongue. )
Before you dismiss this is all very interesting but unlikely technology you should note that there are companies building and evaluating these sorts of temporal imaging devices. Some are in stealth mode in their R&D labs, but at least two companies have announced products and prototypes.
and, Chronocam Technology
These new sensors will bring a lot of interesting features to the world of video image processing, no doubt. But the most interesting thing is that they completely change the way we think about the input data. So many of the solutions in the domain of video image understanding become far simpler and more elegant when we think of the problem with this fresh and new perspective.
How many projects do you work on that would benefit from a fresh perspective? Examining how you can break out of an old frame of reference is a good thing. It's always more fun to work smarter than it is to work harder!