Learning Single-Image Depth from Videos using
Quality Assessment Networks

Weifeng Chen, Shengyi Qian, Jia Deng

Figure 1: An overview of our data collection method. Given an arbitrary video, we follow standard steps of structure-from-motion: extracting feature points and matching them across frames, estimating the camera parameters, and performing triangulation to obtain a reconstruction. A Quality Assessment Network (QANet) examines the operation of the SfM pipeline and assigns a score to the reconstruction. If the score is above a certain threshold, this reconstruction is deemed of high quality, and we use it as single-view depth training data. Otherwise, the reconstruction is discarded.


Depth estimation from a single image in the wild remains a challenging problem. One main obstacle is the lack of high-quality training data for images in the wild. In this paper we propose a method to automatically generate such data through Structure-from-Motion (SfM) on Internet videos. The core of this method is a Quality Assessment Network that identifies high-quality reconstructions obtained from SfM. Using this method, we collect single-view depth training data from a large number of YouTube videos and construct a new dataset called YouTube3D. Experiments show that YouTube3D is useful in training depth estimation networks and advances the state of the art of single-view depth estimation in the wild.


Learning Single-Image Depth from Videos using Quality Assessment Networks,
Weifeng Chen, Shengyi Qian, Jia Deng
Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

Qualitative Results

Figure 2: Qualitative results on the DIW [1] test set by the state-of-the-art network EncDecResNet [2] trained on ImageNet + ReDWeb [2] + DIW [1] + YouTube3D.


Download [Records][Image Data (64 GB)]


Code for training and evaluation. [link]


[1] Chen, Weifeng, Zhao Fu, Dawei Yang, and Jia Deng. "Single-image depth perception in the wild." In Advances in Neural Information Processing Systems, pp. 730-738. 2016.
[2] Xian, Ke, Chunhua Shen, Zhiguo Cao, Hao Lu, Yang Xiao, Ruibo Li, and Zhenbo Luo. "Monocular Relative Depth Perception With Web Stereo Data Supervision." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 311-320. 2018.


Please send any questions or comments to Weifeng Chen at wfchen@umich.edu.