Training AI for AV: How Are Unannotated 2D Images Turned Into 3D Cuboids?

Let's take a look at how unannotated 2D images are turned into 3D cuboids and why this is significant and important.

· AI Zone · Analysis
Save
3.34K Views

For autonomous vehicles to successfully navigate myriad road obstacles, AI must be constantly trained to accurately perceive real-world 3D objects for what they are — traffic cones, pedestrians, electric scooters, etc. In order to do so, 2D images and video collected by sensor cameras must be refined and then annotated into 3D cuboid training data, which autonomous vehicle AI systems can leverage to become more intelligent. (This same method of creating 3D cuboid training data is also useful for teaching perception to AI in the field of robotics.) With cuboid annotation, drawings are first done manually and then calibrated for greater precision through a dynamic mathematical process that provides full 3D data for each cuboid. It’s an interesting process, and here’s a look under the hood at how it works.

Manual Cuboid Annotation

Manually annotating 2D images requires, rather simply, drawing boxes representing two sides of a cuboid around an object, like so:

The result is most often an approximation of a true cuboid. This is due to the perspective when observing an object not directly facing the camera, the angles at the corners of the front side not being a crisp 90 degrees, and the top and bottom of the side panes intersecting at a vanishing point rather than being parallel to one another.

The work of labeling and annotating this image data then includes detailing the coordinates that describe the basic parameters of the cuboid as drawn:

By starting from this manual cuboid annotation and then adding further data including the parameters of the camera and its orientation, systems then produce cuboid annotation with meaningfully greater accuracy.

Via these adjustments, the relative orientation of the car to the camera is better represented by a trapezoidal front face, with the left edge just smaller than the right. Also, the right side’s top and bottom edges now converge. This adjustment process also adds two new fields to the annotation: points_3d and points_2d.

points_3D represents the 3D spatial coordinates of the cuboid’s vertices relative to the camera’s position (measured in meters). points_2D represents the pixel coordinates of the 2D projection of the cuboid.

In the absence of additional information, the 3D spatial coordinates can only be calculated up to a scaling factor. This is because an object cannot be distinguished from another object that matches its scale (for example, one that’s twice the size and twice the distance away) from an image alone. By adding depth data — perhaps available from LIDAR or stereo imaging — the precise distance to the cuboid’s front face is known and the points_3d can then be scaled to match that information. In the same way, knowing the height of the camera makes it possible to scale cuboids to ensure the bottom face of each stands upon the ground.

With the eight 3D-spatial points known, it becomes simple to find the position, dimensions, and orientation of the cuboid defined by those points.With that information in hand, it’s then possible to train models — like Magic Leap's Deep Cuboid Detection – that predict the 3D coordinates of cuboid objects from an image. Through this method, camera images and annotations allow vehicle locations to be accurately identified and recognized within the surrounding environment.