A Uniﬁed Framework for Depth Prediction from a Single Image and Binocular Stereo Matching

: Depth information has long been an important issue in computer vision. The methods for this can be categorized into (1) depth prediction from a single image and (2) binocular stereo matching. However, these two methods are generally regarded as separate tasks, which are accomplished in different network architectures when using deep learning-based methods. This study argues that these two tasks can be achieved using only one network with the same weights. We modify existing networks for stereo matching to perform the two tasks. We ﬁrst enable the network capable of accepting both a single image and an image pair by duplicating the left image when the right image is absent. Then, we introduce a training procedure that alternatively selects training samples of depth prediction from a single image and binocular stereo matching. In this manner, the trained network can perform both tasks and single-image depth prediction even beneﬁts from stereo matching to achieve better performance. Experimental results on KITTI raw dataset show that our model achieves state-of-the-art performances for accomplishing depth prediction from a single image and binocular stereo matching in the same architecture. our method and the original models are similar. This demonstrates that our method enables stereo matching networks capable of depth prediction from a single image while keeping their performance for stereo matching.


Introduction
Effective methods have long been pursued in various disciplines that can extract or estimate accurate depth information from camera images. Depth information is the distance from images to objections. It plays import roles in modeling 3D environments. Therefore, accurate depth information is critical in many vision-related applications, e.g., autonomous driving [1], augmented reality [2], mixed reality [3], etc.
Some methods use sensors, e.g., laser radar [4] or structured light cameras [5], to measure depth directly. These methods are expensive and are highly dependent on the environment. Direct methods are limited to specific scenes, and most of them obtain sparse depth maps [6], which need further completion before being usable. Depth information can also be estimated from camera images [7,8]. Traditional methods extract various features from camera images and aggregate these features to map from images to depth information [9,10]. However, these methods often obtain blurry and inaccurate results.
Deep-learning-based methods have been widely used to estimate depth from images. Most of these methods are categorized into two classes, namely (1) depth prediction from a single image [11][12][13][14] and (2) binocular stereo matching [15][16][17][18]. Depth prediction from a single image generates a depth map for a single-view image, whereas binocular stereo matching takes as input a rectified image pair and outputs its disparity map. Note that depth z and disparity d can be transformed mutually according to Equation (1), where f is the focal length of the camera and B is the distance between the camera centers.
Recently, fundamental innovations have been made to benefit one task from the other. In the context of these innovations, stereo matching methods are used to enhance monocular methods [19]. Stereo cues help monocular methods to eliminate the influence of few dense ground-truth depth data. Considering depth z and disparity d can be transformed mutually, we can estimate disparity for both tasks uniformly, with f and B known. However, two research challenges remain before binocular stereo matching and depth prediction from a single image can be accomplished in a unified framework: • Most deep learning frameworks take as input a fixed number of inputs. However, the number of input images is different for these two tasks. • A single framework can perform well in one specific task. However, it is hard to guarantee the performance for another task. In other words, it is nontrivial to ensure the framework is optimized towards both tasks [20,21].
To tackle these challenges, an appropriate solution should be able (1) to effectively handle a different number of input images and, at the same time, (2) to unify stereo matching and depth prediction from a single image in the same architecture to ensure that the framework is trained towards both tasks. This study develops a unified framework for both depth prediction from a single image and binocular stereo matching, namely DoubleNet. The framework incorporates a module to handle different types of inputs and a unified architecture for both tasks. This study also introduces a novel training procedure to optimize the unified architecture. The proposed framework can accomplish stereo matching and depth prediction from a single image in the same architecture with the same parameters.
To evaluate the performance of DoubleNet, this study carried out a number of experiments on the challenging KITTI Raw dataset [22]. Experimental results demonstrate that the proposed method can perform both depth prediction from a single image and binocular stereo matching tasks simultaneously and achieves state-of-the-art performance. Moreover, single-image depth estimation benefits from stereo matching by achieving better performance than when treating these two tasks separately.
The main contributions of this study are three-fold: • A solution has been fostered to accomplish depth prediction from a single image and binocular stereo matching simultaneously. • This study explores the interaction between monocular and stereo methods. It proposes to make single-image depth estimation benefits from stereo matching. • A number of experiments have been conducted to prove the effectiveness of the proposed unified framework. Experimental results have shown its performance.

Related Work
Successful attempts have been made to improve the performance of depth estimation from camera images, which has long been a notoriously ill-posed task. Studies undertaken for this purpose mainly focuses on (1) estimating the disparity for a pair of stereo images or (2) inferring the depth map for a single image.

Binocular Stereo Matching
Recently, deep learning methods are gaining popularity in stereo matching. Early work employs deep convolution neural networks (CNN) to calculate the matching cost between left and right images [23]. Its following methods try to extract better features from images [24], to aggregate different features [25], and to use various techniques to refine the estimation [17]. With the emergence of a large scale of synthetic dense data, a great number of end-to-end methods are proposed for disparity estimation from camera images. These methods can roughly be categorized into two classes: • In the first class of methods, the correlation between left and right feature maps is used to form a cost volume. The formed cost volume is processed by a series of convolution and trans-convolution layers, i.e., an encoder-decoder-like structure. DispNetC [15] is the first to suggest this paradigm. Built on top of DispNetC, more proposals were proposed, e.g., CRL [26], iResNet [17], DispNet3 [27], etc. Additionally, some similar networks used cues from edges or segmentation to enhance the estimation, e.g., EdgeStereo [28] and SegStereo [29]. • In the second class of methods, the concatenation or difference of left and right features are used to form a 3D cost volume. From this 3D cost volume, 3D CNN layers are employed to extract disparity information. Typical architectures along this direction include GC-Net [16] and PSMNet [18].
Regardless of the different methods to calculate cost volumes, these two classes of methods encode joint information from left and right images. This information is crucial for inferring disparity, which is usually accomplished using one or more encode-decoder-like structures.
Besides, unsupervised methods are also explored in stereo matching. Joung et al. [30] proposed to unsupervisedly use CNNs to compute the matching cost. They combined image domain learning with stereo epipolar constraints to obtain state-of-the-art performance.

Depth Prediction from a Single Image
Deep learning methods have helped reduce the heavy work of feature engineering [31,32]. In the context of deep learning methods, depth prediction from a single image often uses an encoder-like network to extract a feature map for a single image. Then, this feature map is upsampled and used to regress the depth map.
Eigen et al. [33] first applied a multi-scale CNN architecture to predict depth maps from monocular images, which helps capture image details. Following this, some other CNN-based methods [34] were proposed to estimate monocular depth. Xu et al. [35] combined CNN and conditional random field to improve the smoothness of estimated depth maps.
In Reference [36], the authors introduced a novel term to constrain depth and to occlude contour predictions. They used synthetic datasets for training and real datasets for finetuning. Their method has brought better accuracy along the occluding contours.
To extract high-level structure from single images, Osuna-Coutiño et al. [37] proposed to utilize region-wise analysis for better depth estimation. They segmented the depth in semantic orientations to extract high-level structures. Their work contributes to depth estimation from a single image, which has few parallax constraints.
Considering the difficulty of obtaining dense depth maps, some researches used unsupervised methods to estimate monocular depth maps. Garg et al. [38] and Clement et al. [39] proposed to use the estimated inverse depth map and the right image to reconstruct the left image. The reconstruction error was used as loss function to train the network. Luo et al. [19] first synthesized a right image from the input left image using a view-point synthesis network. Then, they used another network to perform stereo matching on this image pair to generate inverse depth map. Kuznietsov et al. [40] used deep networks to produce photo-consistent dense depth maps in a stereo setup using a direct image alignment loss. Their semi-supervised method estimates reliable depth images in realistic dynamic outdoor environments. Godard et al. [41] proposed a minimum reprojection loss to handle occlusion areas for better depth estimation. To reduce visual artifacts, they also designed a novel multi-scale sampling method. Their improvements estimated better depth maps, both quantitatively and qualitatively.
Goldman et al. [42] proposed a novel self-supervised method for depth estimation. During training, they used two twin networks to predict depth maps for both left and right images. During testing, only one of the two networks are used. Their method provided better self-supervised estimation results, and they also performed well in unseen datasets.

Similarity
The input and output for two tasks are similar. Both of them estimate depth maps from camera images. This accounts for the similarity of functionality.
From the perspective of network structure, these two tasks are similar. Both of them use encoder-decoder-like structures. They extract feature maps from input images and upsample the feature maps to regress depth maps. This accounts for the similarity of implementation.
The similarity of functionality and implementation indicates that two tasks can be accomplished in the same architecture.

Differences
Although they have similar functionality and implementation, the binocular and monocular methods are different. Their differences need to be considered to accomplish them in only one architecture.
Binocular stereo matching networks take as input left and right images, while monocular networks only take single left images. This means that encoders of the two architectures are different. For binocular networks, they encode information from both left and right images whereas monocular networks only encode left image information.
The way to regress depth or disparity maps is also different. In binocular stereo matching networks, disparity maps are regressed according to the cost volume. The cost volume contains the relationship a pair of pixels, which contributes to the regression of disparity maps. However, monocular networks regress a depth map directly from the feature map, which makes the estimation ill-posed.
The two methods are different in the number of inputs and the way to regress the final results. These differences need to be tackled before the two tasks can be accomplished in the same architecture.

Method
Depth maps are estimated by either binocular stereo matching or from a single image. These two tasks are similar but have their differences. This section describes how the two tasks are accomplished in the same architecture using the same parameters.

Overview of the Proposed Unified Framework
Although different from each other, the two kinds of networks enjoy similarities in functionality and implementation. Figure 1 illustrates an overview of the proposed framework.
The framework consists of two functional modules, i.e., F 1 and F 2 . F 1 is the module to handle different number of input camera images, and F 2 is the module to perform depth or disparity regression.
To accomplish them simultaneously, an architecture must be capable of dealing with different types of input, i.e., one single image or a pair of images. Moreover, it is necessary to design a regression sub-architecture. This sub-architecture can regress the results from either a feature map for one image or a cost volume for an image pair.
F 1 module detects whether the input is a monocular image or a rectified pair of stereo images. Then, it separates them as left and (fake) right images. These two images are taken as input in F 2 module.
For F 2 and F 3 modules, two approaches are available to accomplish both tasks in one architecture, which correspond to two kinds of networks. On the one hand, binocular networks take as input a pair of images. This study regards monocular methods as binocular ones without right images. To accomplish the monocular task, the absent input needs to be filled. On the other hand, monocular networks are fed with only single images. To accomplish binocular task, an extra branch is needed for right images. Considering the former is easier, this study decides to use binocular stereo matching networks as the basic framework, i.e., DispNetC and PSMNet. Figures 2 and 3 illustrate detailed architectures used in this study. The details of DispNetC and PSMNet can be found in their papers and codes.  (2) filling the right image with a certain value, e.g., 0. With these two policies, the F 1 module outputs a (H, W, 6) tensor regardless of the input.

Module to Form a Cost Volume
The proposed framework regards the H * W * 6 input as the combination of two images. For stereo matching, they are left and right images. For depth prediction from a single image, they are left image and corresponding learning cues. F 2 module focuses on how to form a cost volume, i.e., a fair feature map for both tasks. Available schemes include (1) stacking left and right feature maps, e.g., PSMNet, and (2) calculating the correlation between two feature maps, e.g., DispNetC.
PSMNet forms a 4D cost volume by concatenating left feature maps with their corresponding right feature maps across each disparity level. Then, it uses 3D convolution layers to learn matching cost estimation. For stereo matching tasks, this approach performs well. However, for depth prediction from a single image, the sources to form a cost volume are the left image and its duplicate or a constant-value tensor. This makes it hard to learn a cost volume by only rearranging the order of pixels.
An alternative approach is to explicitly calculate the correlation between two feature maps, as in DispNetC. The explicit calculation is effective for both types of inputs. When it is fed with a pair of stereo images, the F 2 module calculates the correlation between them and outputs their matching cost as the cost volume. For a single left image and its duplicate, the F 2 module calculates their self-similarity to form a cost volume.

Module to Regress a Depth Map
The F 1 module enables a stereo matching network to accept both types of inputs. The F 2 module forms a cost volume as the final feature to regress the result. To regress depth maps, a module, F 3 , needs to upsample the cost volume and to handle the lost details. Actually, the decoder parts are almost the same for binocular and monocular networks. Therefore, this study does not adjust the network structure for regressing depth maps. Instead, the difference of regressing depth maps from a cost volume and from a single feature map is handled by the training procedure.

Procedure to Train the Network
This study aims to accomplish both tasks using the same architecture and the same parameters. Stereo matching networks are used as backbone architecture. The different inputs are handled by introducing the F 1 module. The training procedure makes the F 3 module capable of learning fair parameters to extract features and to regress depth maps towards both tasks.
With the rapid development of big data [43,44], considering that learning-based methods are highly dependent on data themselves, this study dynamically changes data for training a unified framework for both tasks. Specifically, the architecture alternatively selects training examples from two tasks during each iteration. This ensures that the model is optimized towards both tasks. From the aspect of functionality, F 3 can be regarded as a many-to-one mapping function, which uses the cost volume as the input and produces depth as its output. In F 3 , our method adds additional regularization for optimizing the model compared with the original model, which is optimized towards only one task.

Loss Function
The framework was trained with both supervised and unsupervised loss functions, as in Equation (2), where λ is the hyper-parameter to adjust the weight of supervised loss and unsupervised loss.
The supervised loss is defined in Equation (3), where N is the number of valid pixels in the ground-truth depth map, d is the ground-truth depth map, andd is the predicted one.
The real ground-truth data are often sparse. This promotes the need for unsupervised methods. To improve the results, this study adopts unsupervised loss to train the framework. The unsupervised loss is defined in Equation (4), where I le f t and I right represent the image pair and where Warp(I, d) means the warped image from image I and disparity d.

Results
Experimental studies were conducted (1) to select the best input scheme for the proposed framework, (2) to compare different training procedures of the proposed framework and to evaluate the depth estimation performance of the proposed framework, and (3) to validate that the proposed framework can accomplish both depth prediction from a single image and binocular stereo matching in the same architecture using the same parameters. First, we give the implementation details of the proposed framework. Then, the evaluation metrics are introduced. Finally, the quantitative and qualitative results are given to illustrate the effectiveness of the proposed framework.

Implementation Details
We implemented the proposed framework based on the platform PyTorch. The experiments were conducted on a single NVIDIA Titan XP GPU with 12 GB GPU memory. The model was first pretrained on the large synthetic Scene Flow dataset, which contains about 30,000 training image pairs and 4370 test image pairs. Then, we fine-tuned the network on the training split of the KITTI dataset (used by Eigen et al.) to obtain the finally model. The model was optimized using the Adam method with β 1 = 0.9 and β 2 = 0.999. It was trained using the loss functions defined in Section 3.3, with λ = 0.5.
We chose DispNetC and PSMNet as backbone architectures. We used our proposed training procedure to train DispNetC on Scene Flow dataset for 100 epochs and fine-tuned it on KITTI raw dataset for 80 epochs. The learning rate was set as 10 −4 for the first 50 epochs and as 10 −5 for the other epochs. PSMNet was trained on Scene Flow dataset for 20 epochs and fine-tuned on KITTI raw dataset for 40 epochs. The learning rate was set as 10 −3 for the first 10 epochs and as 10 −4 for the other epochs.

Evaluation Metrics
To align with the method in Reference [33], we evaluated our framework with the following error metrics: where d andd denote the ground-truth depth map and the predicted depth map, respectively; N denotes the total number of pixels; and i denotes the pixel positions. We also measured the accuracy, for which higher is better. Assuming t i as the threshold value for , three metrics (δ 1 , δ 2 and δ 3 ) were defined as in Equation (5).

Comparison between Different Input Schemes
As mentioned in Section 3.1.1, there are two schemes for the proposed framework to handle the missing right input image, i.e., (1) duplicating left one and (2) filling the right one with a certain value, e.g., 0.
The performance of depth prediction from a single image with different input schemes is shown in Table 1. Note that the architecture with suffix Cross means duplicating the left image as the right one, whereas Zero means filling the right image input with zero. It can be observed that, for both DispNetC and PSMNet, duplicating the left image as the right input achieves much a better performance than setting it to zero in terms of all the metrics. Besides, DispNetC is better than PSMNet. This is because F 2 in DispNetC takes as input the concatenation of cost volume (calculated using correlation) and left features and the network can learn depth information from those left features. However, in PSMNet, F2 takes as input only the cost volume (calculated using concatenation of left and right features) and applies 3D convolution upon the cost volume, which is intuitively and experimentally not suitable for learning depth information.  Table 2 compares our method with some typical methods for depth prediction from a single image. It is noted that the LRC method uses a similar U-net architecture as DispNetC, but LRC uses more convolution layers and outputs disparity maps of half resolution. Therefore, LRC outperforms DispNetC(Mono), which is trained only with left images. However, when using our proposed training procedure, the performance of DispNetC(Cross) is better than that of LRC. This result indicates that our training procedure encodes knowledge from binocular stereo matching, which is helpful for single-image depth estimation. On the contrary, PSMNet(Mono) performs better than PSMNet(Cross). This means that the stereo knowledge learned by PSMNet could not be well exploited for single-image depth estimation. The proposed method did not perform better than DORN [45], this can be explained by more complicated structure employed in DORN. However, the main purpose of this study is to offer a unified framework for both single-image depth estimation and stereo matching. So it is possible to design a more flexible and well-designed stereo matching network to obtain better performance.

Accomplishing Both Tasks in the Same Architecture
As shown in Table 3, (Stereo) results were obtained using original trained model provided by the authors, whereas (Cross) results were obtained using models trained by our method. The performances for our method and the original models are similar. This demonstrates that our method enables stereo matching networks capable of depth prediction from a single image while keeping their performance for stereo matching.  Figure 4 shows the qualitative results of the proposed method from the KITTI raw dataset. The disparity maps from ground truth are interpolated for better visualization. It is observed that, with the proposed method, the model can estimate depth from both a single image and a pair of images. Using a pair of images generates better results, since it utilizes more information than using a single image.

Conclusions
This study developed a unified framework to estimate depth from either single images or binocular image pairs, namely DoubleNet. The DoubleNet framework can accomplish both depth prediction from a single image and binocular stereo matching using the same architecture with the same parameters.
The DoubleNet framework employed typical stereo matching architectures as backbone. These architectures were modified to accept different types of inputs, with the proposed module for input handling. The modified architectures were trained using a novel training procedure, i.e., alternatively selecting monocular and binocular inputs during training iterations. This ensured that the architecture was optimized towards both tasks.
Experimental results indicated that the DoubleNet could reach state-of-the-art performance for the unified task of both single-image depth estimation and binocular stereo matching. In other words, the trained model could perform well in depth prediction from a single image and it could still reach similar accuracy for binocular stereo matching without extra training or adaptation.
Furthermore, DoubleNet is designed to explore the similarity between single-image depth estimation and binocular stereo matching. Experimental results also demonstrate that single-image depth estimation could benefit from stereo matching. This indicates that one of these tasks can be promoted by the other using the proposed training procedure.
Overall, the work paved the way to probing the unified framework for different but similar tasks by designing an architecture to accept different training samples for these tasks. The unified framework was trained by mixing different training samples. The work also explores the potential to benefit a specific task from another similar task.

Conflicts of Interest:
The authors declare no conflict of interest.