Machine learning has reached the state of the art in many of it's computer vision applications be it Object Detection, Image Classification or Object Tracking and Reidentification. The reason behind this success has been the presence of huge open souce Image datasets like ImageNet, COCO and PASCAL VOC datasets. These are huge datasets which has millions of annotated images. Models pretrained on these benchmark Computer Vision datasets have reached state of the art accuracy on difficult Computer Vision tasks. Banking of the advancement of Computer Vision with 2D objects, we also need to lay our foot on 3D objects and this is where this blog comes into picture where we will be discussing about Google opensourcing it's Objectron Dataset. We will be including the codebase and jupyter notebooks which will be handy while working with this dataset. So, stay tuned and let's get started.
The Objectron Dataset
Earlier this year, Google released their 3D Object Detection Models for mobile devices which were trained on 3D real world data and the model was able to give 3D bounding boxes predictions.This one was released by MediaPipe Objectron.
But still we lack in 3D Object detection as it has become a challenge understanding the symantics of the 3D space. This is mainly due to the absense of large opensource 3D datasets like we had for 2D. Thus, Google has come up with this dataset where they have collected videos of 3D objects which tells a lot more details as to understand the 3D structure in a more detailed way. They have also made sure that the data formats for this dataset is similar to the standards that is used for other Computer Vision datasets.
On Nov 9, 2020, Google has opensourced their new Objectron dataset which is a 3D dataset. It contains a collection of object centric videos of commonly used objects from different angles. Each video is accompanied by AR metadata which includes varying camera angles and spatial point clouds. The image dataset contains 15,000 videos which are annotated, supplemented with over 4M annotated images. It is collected from geo diverse locations covering 10 countries across 5 continents.
MediaPipe has also released 3D detection models for four classes - shoe, camera, mug and chairs. They have made changes to the recent model, previously they used single stage Objectron model but now they are using a two stage model architecture. The first stage calculates the 2D bounding box using the Tensorflow Object detection Pipeline and the second stage tries to predict the 3D coordinates while the first stage again starts computing the 2D bounding box for the next frame. This way the object detection part doesn't have to run each and every frame. The second stage 3D bounding box detection takes place at 83 FPS on Adreno 650 mobile GPU.
The format of Objectron dataset is given below -
3D bounding box coordinates
AR metadata (such as camera poses, and point clouds)
Processed dataset - shuffled version of annotated frames in tf.example format.
They have also given the evaluation metrics for the Objectron dataset, which is 3D IOU, one can find more details in their blog. (Link)
One can go around to checkout the dataset, codes and samples on how to get started on Objectron dataset at their Github page - Link
Hope you liked my article. If you have any questions and doubts related to this topic or any topic in AI and machine learning, do let me know in the comment section, and I will be more than happy to help you out. Do hit like on this article and share it among your friends who are in AI. Follow us on Instagram and Twitter - @theaibuddy. Let's democratize AI.