In autonomous driving, convolutional neural networks are the go-to tool for various perception tasks. Although CNNs are great at distilling information from camera images (or a sequence of them in form of a video clip), I constantly bump into all kinds of metadata that do not lend themselves to convolutional neural networks.
Metadata, by traditional definition, means a set of data used to describe other data. Here in this post, by metadata, we mean:
- heterogeneous, unstructured or unordered data that accompanies camera image data as auxiliary information. In the sense of the traditional definition, these data “describe” the camera data.
- The size of metadata is usually much less than camera image data, ranging from a few to at most a few hundred numbers per image.
- And unlike image data, metadata cannot be represented by a regular grid, and the length of metadata per image may not be constant.
All these properties make it hard for CNN to consume the metadata directly as CNN assumes a data representation on a regular-spaced grid, and neighboring data on the grid has a closer spatial or semantic relationship as well.
The type of metadata I have encountered can be categorized into the following groups:
- Sensor parameters that could impact sensor observation: camera intrinsics/extrinsics
- Different types of sensor data: radar pins or lidar point cloud
- Correspondence/association between two groups of data
One special case is lidar point cloud data. One typical frame of lidar point cloud usually has hundreds of thousands of points, accompanying one or a few frames of camera images. Lidar point clouds are so information-rich that they themselves can form the basis of a standalone perception pipeline in parallel with camera perception. It is thus uncommon to consider it auxiliary information to camera data and are not the typical type of metadata considered here. For point cloud data, people have developed specific neural network architectures, such as PointNet, or graphic neural networks (GNN) to consume point cloud data directly, and are beyond the scope of this post.
Below we review different ways presented in recent literature to consume metadata with convolutional neural networks.
If this in-depth educational content on convolutional neural networks is useful for you, you can subscribe to our AI research mailing list to be alerted when we release new material.
Camera parameters
Deep learning has made significant progress in many aspects of SLAM, one of which is monocular depth estimation. Monocular depth estimation is inherently ill-posed and the model trained on one dataset does not typically generalize well to other datasets, due to the lack of scale in monocular images. This is in stark contrast with general object detection where the performance of object detectors does not depend on specific camera models (it would be a nightmare to need to know which camera models took the hundreds of thousands of images in COCO dataset).
Camera intrinsics, in particular, the focal length of the lens determines the scale factor that is lacking in monocular images. Generally speaking, it is impossible to tell if one image is taken with a camera with a longer focal length from the same position, or taken with the same camera at a location closer to an object.
For this reason, depth estimation training and inference are usually done on one dataset collected with the same camera (or at least with cameras of the same sensor and lens specifications). If you change the camera model, you have to collect an entirely new dataset and annotate the distance to train your model again.
Luckily, in autonomous driving and other industrial applications, intrinsics are easy to obtain from the camera manufacturer and it is relatively fixed throughout the life of the camera. Can we work the intrinsics into the monocular depth prediction network?
Camera intrinsics has four degrees of freedom (barring lens distortion — see this concise openCV documentation), characterized by the focal length and in row and column directions normalized by pixel size, and and , the pixel location for principal point. One naive solution I can think of is to fuse these four numbers the depth decoder on top of a feature map and maybe add a fully connected layer to fuse the four numbers into the depth. CAM-Convs: Camera-Aware Multi-Scale Convolutions for Single-View Depth (CVPR 2019) proposes a much clever solution by working intrinsics into a pseudo-image.
Cam-Conv derived much inspiration from CoordConv from Uber (An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution, NeurIPS 2018). CoordConv concatenates two mesh-grid channels to the original image and/or intermediate feature maps to encode location information. Building on top of CoordConv, Cam-Conv first shifts the origin of the CoordConv from top-left corner to principal point, creating two Centered Coordinates (cc) Maps. These two channels encode the principal point information. Then the Field of View (fov) Maps are calculated by dividing cc-channels by the camera focal length f and taking the arctan, which basically calculates the azimuth and elevation angle of each pixel. These two channels encode focal length information. Finally, Normalized Coordinate (nc) maps are concatenated to feature maps as well (essentially a normalized CoordCon).
Actually CoordConv itself can be seen as a case of encoding coordinate information to the convolutional neural network. It gives the option to the convolutional neural network to have translation variance and is useful to learn position-sensitive data distribution.
The alternative approach: Normalized Focal Length and Normalized Image Plane
One more thing that is worth mentioning in Cam-Conv paper. As mentioned above, it is impossible to tell if an image is taken with a longer focal length camera or taken at a close distance. From another perspective, the same object imaged by two cameras with different focal lengths from the same camera position will appear differently, even though they have the same 3D distance.
One alternative to Cam-Conv is to use a nominal focal length. All groundtruth distance is scaled according to a nominal focal length and used to train the model. Then during inference, the predicted distance is scaled back to the real distance by taking into account the real focal length of the camera. Of course, the above discussion is based on the assumption of the same image sensor. If the sensor’s physical pixel size also changes, we can use the same idea of adopting a nominal pixel size, assuming a narrow field of view (image size << focal length). In comparison, Cam-Conv is a more principled way to accommodate various camera models.
This is closely related to the approaches that MonoLoco (ICCV 2019) uses for pedestrian distance estimation. After finding key points on the image and before feeding into MLP, the image coordinates are projected to the normalized image plane at unit depth Z=1. This helps preventing the model from overfitting to any particular camera. This essentially takes into account the effect of both focal length and sensor pixel size on apparent object size.
Non-camera Sensor Data
In autonomous driving, sensor data other than camera images are often available to increase sensor redundancy and system robustness. One type of ubiquitous sensor in today’s ADAS sensor suite (besides the ubiquitous cameras) is radar.
Most commercial radars as of today pump out extremely sparse radar points (varying number per frame, with a maximum number of 32 to 128 points per frame, according to different radar models). This is three to four orders of magnitude smaller than the hundreds of thousands of points per scan by lidar sensors. It is therefore natural to view radar data (or radar pins) as a type of metadata supplementing and describing camera images. Below is an intuitive comparison of the typical density of radar and lidar data in the same scene, which is quite representative in autonomous driving.
Note: There are more advanced radar systems that output many hundreds or thousands of point per frame, but these so-called high resolution “imaging radar” (such as this one by Astyx) is of limited commercial availability and cost much more than conventional radar.
There are abundant literature on performing 3D object detection on lidar data alone or fused lidar and camera data (such as Frustum PointNet, AVOD, MV3D, etc). There are little literature of early fusion based on sparse radar pins and camera. This is partially due to the lack of public datasets with radar data, and partially due to the noisy nature of radar data and lack of elevation information. Therefore I hope the release of nuScenes dataset could bring more attention to this critical yet understudied field.
The mainstream method to fuse radar and image data is to find ways to “densify” the radar data to an image. In Distant Detection: Distant Vehicle Detection Using Radar and Vision (ICRA 2019), a varying number of radar pins per frame are coded into a 2-channel image with the same spatial size of the camera image, one channel encoding the range (distance measurement) and the other encoding the range rate (radial velocity). Each radar pin is marked as a circle instead of a single pixel to increase the influence of each point in the training process and to reflect the noisy nature of the radar measurements in both bearing and height. The radar pins are projected onto camera images using extrinsic calibration from radar to camera and camera’s intrinsic calibration. The fusion network is relatively straightforward and I will skip here, as our focus is on radar data representation for CNN.
In RVNet: Deep Sensor Fusion of Monocular Camera and Radar for Image-based Obstacle Detection in Challenging Environments (PSIVT 2019), the radar pins are also projected to the camera image plane and form a sparse radar image. This time it has three channels, Depth, Lateral Velocity, and Longitudinal Velocity. Note that the velocity here are compensated by the velocity of the ego vehicle and thus cannot be represented by a single channel of range rate. (The authors also presented a dense radar image encoding method which does not make sense to me and thus is omitted here.)
In the above two methods, radar pins are projected onto the camera image. The projection point is either used as a single pixel or given a constant spatial extent. One possible way to improve this is to use a disk of various size according to distance, such as that used by RRPN (Radar Region Proposal Network, ICIP 2019). This better reflects the spatial uncertainty of radar pins as, in theory, the projection of a closeby radar pin has more lateral spatial uncertainty than that of a distant radar pin.
CRF-Net: A Deep Learning-based Radar and Camera Sensor Fusion Architecture for Object Detection (SDF 2019) paints radar point as a vertical line. The lines start from the ground and extend 3 meters, and are thus not uniformly painted vertically. Parse Geometry from a Line (ICRA 2017) also used a similar technique to densify a single line lidar measurement to a dense reference depth frame.
In addition, the above-mentioned RRPN (radar region proposal network) also presents an interesting way to use radar to generate region proposal. This is based on the observation that almost every object in the nuScenes dataset has corresponding radar pins, and thus radar data can be used as a robust region proposal method. To accommodate the spatial uncertainty of radar measurement, the anchors are not always centered.
In summary, all the above methods (except RRPN) convert radar pins to a pseudo-image and use CNN to extract higher-level information.
Lidar Point Cloud
As mentioned above, due to the dense nature of point cloud, it is possible to directly perform object detection on top of lidar data. Therefore it may be improper to view lidar data as the metadata for camera images. Yet from the sense that the point cloud data has varying numbers of unordered points not evenly spaced on a regular grid, lidar data is unstructured just like the radar data.
There have been numerous efforts to perform the early fusion of lidar data and image data, before feeding them into the neural network. MV3D: Multi-View 3D Object Detection Network for Autonomous Driving (CVPR 2017) converts lidar points into two types of pseudo-image, bird’s eye view (BEV) and front view (FV). BEV maps are discretized grids with 0.1 m resolution, with multiple height maps and one density map and one intensity map. FV follows the convention of VeloFCN: Vehicle Detection from 3D Lidar Using Fully Convolutional Network (RSS 2016), and note that this is different from projecting the points from lidar to camera image. Then three different networks extract features from BEV images, FV images, and RGB images, and the features are then concatenated for fusion.
LaserNet: An Efficient Probabilistic 3D Object Detector for Autonomous Driving (Arxiv, 2019) proposed a different method to encode lidar points. The RV (range view) is generated by directly mapping the laser id to rows and discretizing azimuth angle into columns. One advantage of this representation is that it is naturally compact. It has five channels: range (distance), height, azimuth angle, intensity, and a flag indicating whether the cell contains a point.
In summary, despite higher density than radar pins, lidar points can also be packed into pseudo-images for CNNs to consume. Similar to the alternative approach of using fully connected layers to consume sparse metadata, we can also use PointNet (CVPR 2017) to consume unordered point cloud data directly.
Correspondence/Association Data
Another type of metadata is the association data, for example, traffic light to lane association. Metadata Fusion: Deep Metadata Fusion for Traffic Light to Lane Assignment (IEEE RA-L 2019) proposed a data fusion method to fuse the heterogeneous metadata of results from the traffic light, lane arrow and lane marking detection. The metadata are encoded in the form of Metadata Feature Maps (MFM) and are fused with intermediate feature maps from the camera images. MFMs are essentially binary attention maps. The association groundtruth and prediction are also encoded into one-dimensional vectors representing the lateral spatial locations.
In this work, Metadata Feature Maps (MFM) element-wise multiplied with the first F=12 layers. This proves to be slightly better than directly concatenating the MFM with image feature maps.
Prior object detection results
Sometimes, it is useful to feed the object detection bounding boxes into other learning pipelines. However, the number of bounding boxes are not constant — from this perspective, they can also be viewed as metadata. One way to do this is to convert bounding boxes to heatmaps. In ROLO: Spatially Supervised Recurrent Convolutional Neural Networks for Visual Object Tracking, the object detection results are converted to a heatmap to guide the learning of features that are both spatially and temporally consistent for video object detection and tracking.
In Pixels to Graphs by Associative Embedding (NIPS 2017), the prior detections can be incorporated by formatting object detection as a two channel input, where one channel consists of a one-hot activation at the center of the bounding boxes and the other providing the binary mask of the box. Multiple box can be displayed on these two channels with the second indicating the union of their masks. If there are too many bounding boxes and the mask channel gets too crowded, then separate the masks by bounding box anchors and put them into different channels.
To reduce computation cost, these additional inputs are not integrated in the input layer but rather incorporated after several layers of convolution and pooling.
Takeaway
- Metadata is usually unordered and does not lie on a regular grid. The number of metadata per image is usually not constant, making it hard to use a fixed neural network structure with a fixed input dimension.
- If metadata is of fixed length per camera image, it may be possible to use a fully connected structure to fuse these metadata with camera feature maps.
- If the metadata is unordered, such as radar or lidar point cloud data, an alternative is to use PointNet structure that is invariant to permutation of input order.
- The most generic method to consume metadata with CNN is to convert metadata to some form of a pseudo-image with a regular grid spacing. The pseudo-image should be best in or can be transformed to the same spatial domain with the image data.
References
- CoordConv: An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution, NeurIPS 2018
- CAM-Convs: Camera-Aware Multi-Scale Convolutions for Single-View Depth, CVPR 2019
- MonoLoco: Monocular 3D Pedestrian Localization and Uncertainty Estimation, ICCV 2019
- Distant Detection: Distant Vehicle Detection Using Radar and Vision, ICRA 2019
- RVNet: Deep Sensor Fusion of Monocular Camera and Radar for Image-based Obstacle Detection in Challenging Environments (PSIVT 2019)
- CRF-Net: A Deep Learning-based Radar and Camera Sensor Fusion Architecture for Object Detection, SDF 2019
- RRPN: Radar Region Proposal Network, ICIP 2019
- Parse Geometry from a Line: Parse Geometry from a Line: Monocular Depth Estimation with Partial Laser Observation, ICRA 2017
- LaserNet: An Efficient Probabilistic 3D Object Detector for Autonomous Driving, Arxiv 2019
- VeloFCN: Vehicle Detection from 3D Lidar Using Fully Convolutional Network, RSS 2016
- MV3D: Multi-View 3D Object Detection Network for Autonomous Driving, CVPR 2017
- Metadata Fusion: Deep Metadata Fusion for Traffic Light to Lane Assignment, IEEE RA-L 2019
- ROLO: Spatially Supervised Recurrent Convolutional Neural Networks for Visual Object Tracking, ISCAS 2016
- Pixels to Graphs by Associative Embedding, NIPS 2017
This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.
Enjoy this article? Sign up for more computer vision updates.
We’ll let you know when we release more technical education.
Leave a Reply
You must be logged in to post a comment.