Abstract
We present a novel method for reconstructing clothed humans from a sparse set of, e.g., 1–6 RGB images. Despite impressive results from recent works employing deep implicit representation, we revisit the volumetric approach and demonstrate that better performance can be achieved with proper system design. The volumetric representation offers significant advantages in leveraging 3D spatial context through 3D convolutions, and the notorious quantization error is largely negligible with a reasonably large yet affordable volume resolution, e.g., 512. To handle memory and computation costs, we propose a sophisticated coarse-to-fine strategy with voxel culling and subspace sparse convolution. Our method starts with a discretized visual hull to compute a coarse shape and then focuses on a narrow band nearby the coarse shape for refinement. Once the shape is reconstructed, we adopt an image-based rendering approach, which computes the colors of surface points by blending input images with learned weights. Extensive experimental results show that our method significantly reduces the mean point-to-surface (P2S) precision of state-of-the-art methods by more than 50% to achieve approximately 2mm accuracy with a 512 volume resolution. Additionally, images rendered from our textured model achieve a higher peak signal-to-noise ratio (PSNR) compared to state-of-the-art methods.
Supplemental Material
Available for Download
Supplementary material
- 2019. Tex2Shape: Detailed full human body geometry from a single image. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 2293–2303. Google ScholarCross Ref .
- 2022. Photorealistic monocular 3D reconstruction of humans wearing clothing. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22). 1496–1505. Google ScholarCross Ref .
- 2005. SCAPE: Shape completion and animation of people. ACM Trans. Graph. 24, 3 (
July 2005), 408–416. Google ScholarDigital Library . - 2019. Multi-garment net: Learning to dress 3D people from images. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 5419–5429. Google ScholarCross Ref .
- 2016. Keep It SMPL: Automatic estimation of 3D human pose and shape from a single image. In Computer Vision – ECCV 2016, , , , and (Eds.). Springer International Publishing, Cham, 561–578.Google ScholarCross Ref .
- 2019. Learning implicit fields for generative shape modeling. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 5932–5941. Google ScholarCross Ref .
- 2020. Implicit functions in feature space for 3D shape reconstruction and completion. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 6968–6979. Google ScholarCross Ref .
- 2015. High-quality streamable free-viewpoint video. ACM Trans. Graph. 34, 4, Article
69 (July 2015), 13 pages. Google ScholarDigital Library . - Blender. https://www.blender.org/Google Scholar . (n.d.).
- 2016. Fusion4D: Real-time performance capture of challenging scenes. ACM Trans. Graph. 35, 4, Article
114 (July 2016), 13 pages. Google ScholarDigital Library . - 2017. 3D convolutional neural networks for efficient and robust hand pose estimation from single depth images. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 5679–5688. Google ScholarCross Ref .
- 2018. Volumetric performance capture from minimal camera viewpoints. In Computer Vision – ECCV 2018, , , , and (Eds.). Springer International Publishing, Cham, 591–607.Google ScholarDigital Library .
- 2018. 3D semantic segmentation with submanifold sparse convolutional networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9224–9232. Google ScholarCross Ref .
- 2019. The relightables: Volumetric performance capture of humans with realistic relighting. ACM Trans. Graph. 38, 6, Article
217 (Nov. 2019), 19 pages. Google ScholarDigital Library . - 2019. HoloPose: Holistic 3D human reconstruction in-the-wild. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 10876–10886. Google ScholarCross Ref .
- 2021. ARCH++: Animation-ready clothed human reconstruction revisited. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV’21). 11026–11036. Google ScholarCross Ref .
- 2021. StereoPIFu: Depth aware clothed human digitization via stereo vision. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 535–545. Google ScholarCross Ref .
- 2017. Towards accurate marker-less human shape and pose estimation over time. In 2017 International Conference on 3D Vision (3DV’17). 421–430. Google ScholarCross Ref .
- 2018. Deep volumetric video from very sparse multi-view performance capture. In European Conference on Computer Vision (ECCV’18). Google ScholarDigital Library .
- 2020. ARCH: Animatable reconstruction of clothed humans. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 3090–3099. Google ScholarCross Ref .
- 2018. 3D human body reconstruction from a single image via volumetric regression. ArXiv abs/1809.03770 (2018).Google Scholar .
- 2020. Local implicit grid representations for 3D scenes. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 6000–6009. Google ScholarCross Ref .
- 2019. Panoptic studio: A massively multiview system for social interaction capture. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 1 (2019), 190–204. Google ScholarDigital Library .
- 2018. Total capture: A 3D deformation model for tracking faces, hands, and bodies. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8320–8329. Google ScholarCross Ref .
- 2018. End-to-end recovery of human shape and pose. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7122–7131. Google ScholarCross Ref .
- 2020. VIBE: Video inference for human body pose and shape estimation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 5252–5262. Google ScholarCross Ref .
- 2019. Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 2252–2261. Google ScholarCross Ref .
- 2020. Monocular real-time volumetric performance capture. In Computer Vision – ECCV 2020, , , , and (Eds.). Springer International Publishing, Cham, 49–67.Google ScholarDigital Library .
- 2019. Shape-aware human pose and shape reconstruction using multi-view images. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 4351–4361. Google ScholarCross Ref .
- 2015. SMPL: A skinned multi-person linear model. ACM Trans. Graph. 34, 6, Article
248 (Nov. 2015), 16 pages. Google ScholarDigital Library . - 1987. Marching cubes: A high resolution 3D surface construction algorithm. SIGGRAPH Comput. Graph. 21, 4 (
Aug. 1987), 163–169. Google ScholarDigital Library . - 2019. Occupancy networks: Learning 3D reconstruction in function space. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 4455–4465. Google ScholarCross Ref .
- 2020. NeRF: Representing scenes as neural radiance fields for view synthesis. In Computer Vision – ECCV 2020, , , , and (Eds.). Springer International Publishing, Cham, 405–421.Google ScholarDigital Library .
- 2015. DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 343–352. Google ScholarCross Ref .
- 2011. KinectFusion: Real-time dense surface mapping and tracking. In 2011 10th IEEE International Symposium on Mixed and Augmented Reality. 127–136. Google ScholarDigital Library .
- 2016. Stacked hourglass networks for human pose estimation. In Computer Vision – ECCV 2016, , , , and (Eds.). Springer International Publishing, Cham, 483–499.Google Scholar .
- 2018. Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In 2018 International Conference on 3D Vision (3DV’18). 484–494. Google ScholarCross Ref .
- 2019. DeepSDF: Learning continuous signed distance functions for shape representation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 165–174. Google ScholarCross Ref .
- 2019. Expressive body capture: 3D hands, face, and body from a single image. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 10967–10977. Google ScholarCross Ref .
- 2018. Learning to estimate 3D human pose and shape from a single color image. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 459–468. Google ScholarCross Ref .
- 2020. Convolutional occupancy networks. In Computer Vision – ECCV 2020, , , , and (Eds.). Springer International Publishing, Cham, 523–540.Google ScholarDigital Library .
- 2021. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 9050–9059. Google ScholarCross Ref .
- 2019. PIFu: Pixel-aligned implicit function for high-resolution clothed human digitization. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 2304–2314. Google ScholarCross Ref .
- 2020. PIFuHD: Multi-level pixel-aligned implicit function for high-resolution 3D human digitization. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 81–90. Google ScholarCross Ref .
- 2020. Background matting: The world is your green screen. In CVPR.Google Scholar .
- 2022a. DoubleField: Bridging the neural surface and radiance fields for high-fidelity human reconstruction and rendering. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22). 15851–15861. Google ScholarCross Ref .
- 2022b. DiffuStereo: High quality human reconstruction via diffusion-based stereo using sparse cameras. In ECCV.Google Scholar .
- Human 3D Body Model Datasets. https://web.twindom.com/.Google Scholar . (n.d.).
- 2018. BodyNet: Volumetric inference of 3D human body shapes. In Computer Vision – ECCV 2018, , , , and (Eds.). Springer International Publishing, Cham, 20–38.Google Scholar .
- 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010.Google ScholarDigital Library .
- 2009. Dynamic shape capture using multi-view photometric stereo. ACM Trans. Graph. 28, 5 (
Dec. 2009), 1–11. Google ScholarDigital Library . - 2014. Let there be color! large-scale texturing of 3D reconstructions. In Computer Vision – ECCV 2014, , , , and (Eds.). Springer International Publishing, Cham, 836–850.Google ScholarCross Ref .
- 2022. ICON: Implicit clothed humans obtained from normals. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22). 13286–13296. Google ScholarCross Ref .
- 2019. DenseRaC: Joint 3D pose and shape estimation by dense render-and-compare. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 7759–7769. Google ScholarCross Ref .
- 2021. pixelNeRF: Neural radiance fields from one or few images. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 4576–4585. Google ScholarCross Ref .
- 2017. BodyFusion: Real-time capture of human motion and surface geometry using a single depth camera. In 2017 IEEE International Conference on Computer Vision (ICCV’17). 910–919. Google ScholarCross Ref .
- 2020. DoubleFusion: Real-time capture of human performances with inner body shapes from a single depth sensor. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 10 (2020), 2523–2539. Google ScholarDigital Library .
- 2021. Function4D: Real-time human volumetric capture from very sparse consumer RGBD sensors. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 5742–5752. Google ScholarCross Ref .
- 2021. DeepMultiCap: Performance capture of multiple characters using sparse multiview cameras. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV’21). 6219–6229. Google ScholarCross Ref .
- 2022. PaMIR: Parametric model-conditioned implicit representation for image-based human reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 6 (2022), 3170–3184. Google ScholarCross Ref .
- 2019. DeepHuman: 3D human reconstruction from a single image. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 7738–7748. Google ScholarCross Ref .
Index Terms
- High-Resolution Volumetric Reconstruction for Clothed Humans
Recommendations
Volumetric reconstruction and interactive rendering of trees from photographs
SIGGRAPH '04: ACM SIGGRAPH 2004 PapersReconstructing and rendering trees is a challenging problem due to the geometric complexity involved, and the inherent difficulties of capture. In this paper we propose a volumetric approach to capture and render trees with relatively sparse foliage. ...
Volumetric reconstruction and interactive rendering of trees from photographs
Reconstructing and rendering trees is a challenging problem due to the geometric complexity involved, and the inherent difficulties of capture. In this paper we propose a volumetric approach to capture and render trees with relatively sparse foliage. ...
Colorful 3D reconstruction at high resolution using multi-view representation
Highlights- A simple framework for colorful 3D model reconstruction at high resolution.
- A ...
AbstractHigh-quality 3D models should contain accurate shapes, as well as other correct attributes, such as realistic surface color. However, current researches were mostly focused on the reconstruction of shapes. We present a method to ...
Comments