Training AI for AV: How Are Unannotated 2D Images Turned Into 3D Cuboids?
Let's take a look at how unannotated 2D images are turned into 3D cuboids and why this is significant and important.
Join the DZone community and get the full member experience.Join For Free
For autonomous vehicles to successfully navigate myriad road obstacles, AI must be constantly trained to accurately perceive real-world 3D objects for what they are — traffic cones, pedestrians, electric scooters, etc. In order to do so, 2D images and video collected by sensor cameras must be refined and then annotated into 3D cuboid training data, which autonomous vehicle AI systems can leverage to become more intelligent. (This same method of creating 3D cuboid training data is also useful for teaching perception to AI in the field of robotics.) With cuboid annotation, drawings are first done manually and then calibrated for greater precision through a dynamic mathematical process that provides full 3D data for each cuboid. It’s an interesting process, and here’s a look under the hood at how it works.
Manual Cuboid Annotation
Manually annotating 2D images requires, rather simply, drawing boxes representing two sides of a cuboid around an object, like so:
The result is most often an approximation of a true cuboid. This is due to the perspective when observing an object not directly facing the camera, the angles at the corners of the front side not being a crisp 90 degrees, and the top and bottom of the side panes intersecting at a vanishing point rather than being parallel to one another.
The work of labeling and annotating this image data then includes detailing the coordinates that describe the basic parameters of the cuboid as drawn:
Adjusted Cuboid Annotation
By starting from this manual cuboid annotation and then adding further data including the parameters of the camera and its orientation, systems then produce cuboid annotation with meaningfully greater accuracy.
Via these adjustments, the relative orientation of the car to the camera is better represented by a trapezoidal front face, with the left edge just smaller than the right. Also, the right side’s top and bottom edges now converge. This adjustment process also adds two new fields to the annotation: points_3d and points_2d.
points_3D represents the 3D spatial coordinates of the cuboid’s vertices relative to the camera’s position (measured in meters). points_2D represents the pixel coordinates of the 2D projection of the cuboid.
In the absence of additional information, the 3D spatial coordinates can only be calculated up to a scaling factor. This is because an object cannot be distinguished from another object that matches its scale (for example, one that’s twice the size and twice the distance away) from an image alone. By adding depth data — perhaps available from LIDAR or stereo imaging — the precise distance to the cuboid’s front face is known and the points_3d can then be scaled to match that information. In the same way, knowing the height of the camera makes it possible to scale cuboids to ensure the bottom face of each stands upon the ground.
With the eight 3D-spatial points known, it becomes simple to find the position, dimensions, and orientation of the cuboid defined by those points.With that information in hand, it’s then possible to train models — like Magic Leap's Deep Cuboid Detection – that predict the 3D coordinates of cuboid objects from an image. Through this method, camera images and annotations allow vehicle locations to be accurately identified and recognized within the surrounding environment.
Initial Annotation #1
Initial Annotation #2
Initial Annotation #3
Using these now-much-more-refined results, AI systems have the training data they need to expedite learning processes and usher in the autonomous vehicle future that much more quickly (and safely).
Calvin Huang is a Software Engineer at Scale, a company that accelerates the development of AI by democratizing access to intelligent data.
Opinions expressed by DZone contributors are their own.