Object detection in 3D spaces is a key component of many computer vision applications, such as image sensing in autonomous driving or robot navigation. However, sophisticated, high-performance systems such as Lidar can be expensive and computationally complicated.
Now, a team of researchers at North Carolina State University has developed a new system dubbed MonoCon that enhances the ability of object detection of artificial intelligence (AI) programs using 2D images.
We live in a 3D world, but when you take a picture, it records that world in a 2D image.
Tianfu Wu, Corresponding Author and Assistant Professor of Electrical and Computer Engineering at North Carolina State University
One of the main questions the team had to address was whether it was possible to recover information on 3D structures and the surrounding environment using 2D images that have lost some of the relevant depth information.
AI programs that receive relevant visual information from standard cameras convert 2D images into information that allows them to navigate 3D spaces by placing certain objects, such as vehicles, people, traffic structures, etc., into their surroundings.
While the researchers were particularly interested in the use of the MonoCon system for application in autonomous driving, the system could also find application potential in robotics and manufacturing systems.
Building in Redundancy
While the majority of today’s self-driving or autonomous systems employ Lidar to navigate 3D environments, these systems are expensive. Lidar, an acronym for light detection and ranging, uses a series of eye-safe lasers to map an area and create a 3D representation of an environment.
However, due to the cost, it is expensive and doesn’t allow much room for redundancy as it is not economical to include multiple Lidar scanners on a vehicle.
But if an autonomous vehicle could use visual inputs to navigate through space, you could build in redundancy.
Tianfu Wu, Corresponding Author and Assistant Professor of Electrical and Computer Engineering at North Carolina State University
This is because, unlike Lidar, cameras are much cheaper, meaning it would be economically viable to incorporate multiple cameras, making it possible for redundancy to be built into the system, thereby making it safe and sturdy.
However, the possibility of using 2D input and extracting 3D data also offers another exciting opportunity and exemplifies one of the advanced capabilities MonoCon carries - effectively telling an AI system the exterior edges of a relevant object by placing objects into ‘bounding boxes’.
Training MonoCon
The AI is trained by reading a series of 2D images and placing these bounding boxes around certain articles in the image. Each one of these boxes is a cuboid comprised of eight points. The AI is then fed the relevant information to each of the points allowing it to understand the height, length and width as well as the linear measure between the corners.
The AI can then make its predictions about the objects in relation to one another and their distance from the camera. Once the AI has made its calculations, the researchers can fine-tune the data by correcting any mistakes, which over time improves the performance of the AI.
“What sets our work apart is how we train the AI, which builds on previous training techniques.” Wu says.
The proposed method is motivated by a well-known theorem in measure theory, the Cramér–Wold theorem. It is also potentially applicable to other structured-output prediction tasks in computer vision.
Tianfu Wu, Corresponding Author and Assistant Professor of Electrical and Computer Engineering at North Carolina State University
As well as asking the AI to predict the distance between the camera, various objects, items, articles and the proportions of the bounding boxes, the AI was also given the objective to predict the positions of each of the eight points on the box and work out the distance from the center of the bounding box.
This process is known as ‘auxiliary context’ and allows the AI to detect and predict the location of 3D objects more accurately using 2D images.
The MonoCon system was then tested on an extensive set of data called KITTI, which outperformed many of the other AI programs that have been developed to extract 3D data from 2D images. While MonoCon also performed well when asked to identify bicycles and/or pedestrians, other programs still fare better.
The next step is fine-tuning MonoCon using larger datasets in order to make it scalable for use in autonomous, self-driving driving applications.
“We also want to explore applications in manufacturing, to see if we can improve the performance of tasks such as the use of robotic arms,” Wu concludes.
References and Further Reading
Shipman, M., (2022) Technique Improves AI Ability to Understand 3D Space Using 2D Images. [online] NC State News. Available at: https://news.ncsu.edu/2022/01/monocon-ai-3d/
Disclaimer: The views expressed here are those of the author expressed in their private capacity and do not necessarily represent the views of AZoM.com Limited T/A AZoNetwork the owner and operator of this website. This disclaimer forms part of the Terms and conditions of use of this website.