research-article

High-Resolution Volumetric Reconstruction for Clothed Humans

Authors:
Sicong Tang

Simon Fraser University, Canada

Simon Fraser University, Canada

0000-0001-6943-0074
View Profile

,
Guangyuan Wang

Alibaba Group, China

Alibaba Group, China

0009-0007-2388-0716
View Profile

,
Qing Ran

Alibaba Group, China

Alibaba Group, China

0000-0003-2376-1833
View Profile

,
Lingzhi Li

Alibaba Group, China

Alibaba Group, China

0000-0002-0552-9566
View Profile

,
Li Shen

Alibaba Group, China

Alibaba Group, China

0000-0002-2283-4976
View Profile

,
Ping Tan

Simon Fraser University, Canada

Simon Fraser University, Canada

0000-0002-4506-6973
View Profile

Authors Info & Claims

ACM Transactions on Graphics Volume 42 Issue 5Article No.: 170pp 1–15https://doi.org/10.1145/3606032

Published:21 August 2023Publication History

ACM Transactions on Graphics

Abstract

We present a novel method for reconstructing clothed humans from a sparse set of, e.g., 1–6 RGB images. Despite impressive results from recent works employing deep implicit representation, we revisit the volumetric approach and demonstrate that better performance can be achieved with proper system design. The volumetric representation offers significant advantages in leveraging 3D spatial context through 3D convolutions, and the notorious quantization error is largely negligible with a reasonably large yet affordable volume resolution, e.g., 512. To handle memory and computation costs, we propose a sophisticated coarse-to-fine strategy with voxel culling and subspace sparse convolution. Our method starts with a discretized visual hull to compute a coarse shape and then focuses on a narrow band nearby the coarse shape for refinement. Once the shape is reconstructed, we adopt an image-based rendering approach, which computes the colors of surface points by blending input images with learned weights. Extensive experimental results show that our method significantly reduces the mean point-to-surface (P2S) precision of state-of-the-art methods by more than 50% to achieve approximately 2mm accuracy with a 512 volume resolution. Additionally, images rendered from our textured model achieve a higher peak signal-to-noise ratio (PSNR) compared to state-of-the-art methods.

Supplemental Material

video_demo.mp4

mp4

256.1 MB

Download

Available for Download

pdf

3606032.supp.pdf (4.6 MB)

Supplementary material

REFERENCES

Alldieck Thiemo, Pons-Moll Gerard, Theobalt Christian, and Magnor Marcus. 2019. Tex2Shape: Detailed full human body geometry from a single image. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 2293–2303. Google ScholarCross Ref
Alldieck Thiemo, Zanfir Mihai, and Sminchisescu Cristian. 2022. Photorealistic monocular 3D reconstruction of humans wearing clothing. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22). 1496–1505. Google ScholarCross Ref
Anguelov Dragomir, Srinivasan Praveen, Koller Daphne, Thrun Sebastian, Rodgers Jim, and Davis James. 2005. SCAPE: Shape completion and animation of people. ACM Trans. Graph. 24, 3 (July2005), 408–416. Google ScholarDigital Library
Bhatnagar Bharat, Tiwari Garvita, Theobalt Christian, and Pons-Moll Gerard. 2019. Multi-garment net: Learning to dress 3D people from images. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 5419–5429. Google ScholarCross Ref
Bogo Federica, Kanazawa Angjoo, Lassner Christoph, Gehler Peter, Romero Javier, and Black Michael J.. 2016. Keep It SMPL: Automatic estimation of 3D human pose and shape from a single image. In Computer Vision – ECCV 2016, Leibe Bastian, Matas Jiri, Sebe Nicu, and Welling Max (Eds.). Springer International Publishing, Cham, 561–578.Google ScholarCross Ref
Chen Zhiqin and Zhang Hao. 2019. Learning implicit fields for generative shape modeling. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 5932–5941. Google ScholarCross Ref
Chibane Julian, Alldieck Thiemo, and Pons-Moll Gerard. 2020. Implicit functions in feature space for 3D shape reconstruction and completion. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 6968–6979. Google ScholarCross Ref
Collet Alvaro, Chuang Ming, Sweeney Pat, Gillett Don, Evseev Dennis, Calabrese David, Hoppe Hugues, Kirk Adam, and Sullivan Steve. 2015. High-quality streamable free-viewpoint video. ACM Trans. Graph. 34, 4, Article 69 (July2015), 13 pages. Google ScholarDigital Library
Community Blender Online. (n.d.). Blender. https://www.blender.org/Google Scholar
Dou Mingsong, Khamis Sameh, Degtyarev Yury, Davidson Philip, Fanello Sean Ryan, Kowdle Adarsh, Escolano Sergio Orts, Rhemann Christoph, Kim David, Taylor Jonathan, Kohli Pushmeet, Tankovich Vladimir, and Izadi Shahram. 2016. Fusion4D: Real-time performance capture of challenging scenes. ACM Trans. Graph. 35, 4, Article 114 (July2016), 13 pages. Google ScholarDigital Library
Ge Liuhao, Liang Hui, Yuan Junsong, and Thalmann Daniel. 2017. 3D convolutional neural networks for efficient and robust hand pose estimation from single depth images. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 5679–5688. Google ScholarCross Ref
Gilbert Andrew, Volino Marco, Collomosse John, and Hilton Adrian. 2018. Volumetric performance capture from minimal camera viewpoints. In Computer Vision – ECCV 2018, Ferrari Vittorio, Hebert Martial, Sminchisescu Cristian, and Weiss Yair (Eds.). Springer International Publishing, Cham, 591–607.Google ScholarDigital Library
Graham Benjamin, Engelcke Martin, and Maaten Laurens van der. 2018. 3D semantic segmentation with submanifold sparse convolutional networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9224–9232. Google ScholarCross Ref
Guo Kaiwen, Lincoln Peter, Davidson Philip, Busch Jay, Yu Xueming, Whalen Matt, Harvey Geoff, Orts-Escolano Sergio, Pandey Rohit, Dourgarian Jason, Tang Danhang, Tkach Anastasia, Kowdle Adarsh, Cooper Emily, Dou Mingsong, Fanello Sean, Fyffe Graham, Rhemann Christoph, Taylor Jonathan, Debevec Paul, and Izadi Shahram. 2019. The relightables: Volumetric performance capture of humans with realistic relighting. ACM Trans. Graph. 38, 6, Article 217 (Nov.2019), 19 pages. Google ScholarDigital Library
Güler Riza Alp and Kokkinos Iasonas. 2019. HoloPose: Holistic 3D human reconstruction in-the-wild. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 10876–10886. Google ScholarCross Ref
He Tong, Xu Yuanlu, Saito Shunsuke, Soatto Stefano, and Tung Tony. 2021. ARCH++: Animation-ready clothed human reconstruction revisited. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV’21). 11026–11036. Google ScholarCross Ref
Hong Yang, Zhang Juyong, Jiang Boyi, Guo Yudong, Liu Ligang, and Bao Hujun. 2021. StereoPIFu: Depth aware clothed human digitization via stereo vision. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 535–545. Google ScholarCross Ref
Huang Yinghao, Bogo Federica, Lassner Christoph, Kanazawa Angjoo, Gehler Peter V., Romero Javier, Akhter Ijaz, and Black Michael J.. 2017. Towards accurate marker-less human shape and pose estimation over time. In 2017 International Conference on 3D Vision (3DV’17). 421–430. Google ScholarCross Ref
Huang Zeng, Li Tianye, Chen Weikai, Zhao Yajie, Xing Jun, LeGendre Chloe, Ma Chongyang, Luo Linjie, and Li Hao. 2018. Deep volumetric video from very sparse multi-view performance capture. In European Conference on Computer Vision (ECCV’18). Google ScholarDigital Library
Huang Zeng, Xu Yuanlu, Lassner Christoph, Li Hao, and Tung Tony. 2020. ARCH: Animatable reconstruction of clothed humans. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 3090–3099. Google ScholarCross Ref
Jackson Aaron S., Manafas Chris, and Tzimiropoulos Georgios. 2018. 3D human body reconstruction from a single image via volumetric regression. ArXiv abs/1809.03770 (2018).Google Scholar
Jiang Chiyu, Sud Avneesh, Makadia Ameesh, Huang Jingwei, Nießner Matthias, and Funkhouser Thomas. 2020. Local implicit grid representations for 3D scenes. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 6000–6009. Google ScholarCross Ref
Joo Hanbyul, Simon Tomas, Li Xulong, Liu Hao, Tan Lei, Gui Lin, Banerjee Sean, Godisart Timothy, Nabbe Bart, Matthews Iain, Kanade Takeo, Nobuhara Shohei, and Sheikh Yaser. 2019. Panoptic studio: A massively multiview system for social interaction capture. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 1 (2019), 190–204. Google ScholarDigital Library
Joo Hanbyul, Simon Tomas, and Sheikh Yaser. 2018. Total capture: A 3D deformation model for tracking faces, hands, and bodies. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8320–8329. Google ScholarCross Ref
Kanazawa Angjoo, Black Michael J., Jacobs David W., and Malik Jitendra. 2018. End-to-end recovery of human shape and pose. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7122–7131. Google ScholarCross Ref
Kocabas Muhammed, Athanasiou Nikos, and Black Michael J.. 2020. VIBE: Video inference for human body pose and shape estimation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 5252–5262. Google ScholarCross Ref
Kolotouros Nikos, Pavlakos Georgios, Black Michael, and Daniilidis Kostas. 2019. Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 2252–2261. Google ScholarCross Ref
Li Ruilong, Xiu Yuliang, Saito Shunsuke, Huang Zeng, Olszewski Kyle, and Li Hao. 2020. Monocular real-time volumetric performance capture. In Computer Vision – ECCV 2020, Vedaldi Andrea, Bischof Horst, Brox Thomas, and Frahm Jan-Michael (Eds.). Springer International Publishing, Cham, 49–67.Google ScholarDigital Library
Liang Junbang and Lin Ming. 2019. Shape-aware human pose and shape reconstruction using multi-view images. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 4351–4361. Google ScholarCross Ref
Loper Matthew, Mahmood Naureen, Romero Javier, Pons-Moll Gerard, and Black Michael J.. 2015. SMPL: A skinned multi-person linear model. ACM Trans. Graph. 34, 6, Article 248 (Nov.2015), 16 pages. Google ScholarDigital Library
Lorensen William E. and Cline Harvey E.. 1987. Marching cubes: A high resolution 3D surface construction algorithm. SIGGRAPH Comput. Graph. 21, 4 (Aug.1987), 163–169. Google ScholarDigital Library
Mescheder Lars, Oechsle Michael, Niemeyer Michael, Nowozin Sebastian, and Geiger Andreas. 2019. Occupancy networks: Learning 3D reconstruction in function space. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 4455–4465. Google ScholarCross Ref
Mildenhall Ben, Srinivasan Pratul P., Tancik Matthew, Barron Jonathan T., Ramamoorthi Ravi, and Ng Ren. 2020. NeRF: Representing scenes as neural radiance fields for view synthesis. In Computer Vision – ECCV 2020, Vedaldi Andrea, Bischof Horst, Brox Thomas, and Frahm Jan-Michael (Eds.). Springer International Publishing, Cham, 405–421.Google ScholarDigital Library
Newcombe Richard A., Fox Dieter, and Seitz Steven M.. 2015. DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 343–352. Google ScholarCross Ref
Newcombe Richard A., Izadi Shahram, Hilliges Otmar, Molyneaux David, Kim David, Davison Andrew J., Kohi Pushmeet, Shotton Jamie, Hodges Steve, and Fitzgibbon Andrew. 2011. KinectFusion: Real-time dense surface mapping and tracking. In 2011 10th IEEE International Symposium on Mixed and Augmented Reality. 127–136. Google ScholarDigital Library
Newell Alejandro, Yang Kaiyu, and Deng Jia. 2016. Stacked hourglass networks for human pose estimation. In Computer Vision – ECCV 2016, Leibe Bastian, Matas Jiri, Sebe Nicu, and Welling Max (Eds.). Springer International Publishing, Cham, 483–499.Google Scholar
Omran Mohamed, Lassner Christoph, Pons-Moll Gerard, Gehler Peter, and Schiele Bernt. 2018. Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In 2018 International Conference on 3D Vision (3DV’18). 484–494. Google ScholarCross Ref
Park Jeong Joon, Florence Peter, Straub Julian, Newcombe Richard, and Lovegrove Steven. 2019. DeepSDF: Learning continuous signed distance functions for shape representation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 165–174. Google ScholarCross Ref
Pavlakos Georgios, Choutas Vasileios, Ghorbani Nima, Bolkart Timo, Osman Ahmed A., Tzionas Dimitrios, and Black Michael J.. 2019. Expressive body capture: 3D hands, face, and body from a single image. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 10967–10977. Google ScholarCross Ref
Pavlakos Georgios, Zhu Luyang, Zhou Xiaowei, and Daniilidis Kostas. 2018. Learning to estimate 3D human pose and shape from a single color image. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 459–468. Google ScholarCross Ref
Peng Songyou, Niemeyer Michael, Mescheder Lars, Pollefeys Marc, and Geiger Andreas. 2020. Convolutional occupancy networks. In Computer Vision – ECCV 2020, Vedaldi Andrea, Bischof Horst, Brox Thomas, and Frahm Jan-Michael (Eds.). Springer International Publishing, Cham, 523–540.Google ScholarDigital Library
Peng Sida, Zhang Yuanqing, Xu Yinghao, Wang Qianqian, Shuai Qing, Bao Hujun, and Zhou Xiaowei. 2021. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 9050–9059. Google ScholarCross Ref
Saito Shunsuke, Huang Zeng, Natsume Ryota, Morishima Shigeo, Li Hao, and Kanazawa Angjoo. 2019. PIFu: Pixel-aligned implicit function for high-resolution clothed human digitization. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 2304–2314. Google ScholarCross Ref
Saito Shunsuke, Simon Tomas, Saragih Jason, and Joo Hanbyul. 2020. PIFuHD: Multi-level pixel-aligned implicit function for high-resolution 3D human digitization. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 81–90. Google ScholarCross Ref
Sengupta Soumyadip, Jayaram Vivek, Curless Brian, Seitz Steven M., and Kemelmacher-Shlizerman Ira. 2020. Background matting: The world is your green screen. In CVPR.Google Scholar
Shao Ruizhi, Zhang Hongwen, Zhang He, Chen Mingjia, Cao Yan-Pei, Yu Tao, and Liu Yebin. 2022a. DoubleField: Bridging the neural surface and radiance fields for high-fidelity human reconstruction and rendering. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22). 15851–15861. Google ScholarCross Ref
Shao Ruizhi, Zheng Zerong, Zhang Hongwen, Sun Jingxiang, and Liu Yebin. 2022b. DiffuStereo: High quality human reconstruction via diffusion-based stereo using sparse cameras. In ECCV.Google Scholar
Twindom. (n.d.). Human 3D Body Model Datasets. https://web.twindom.com/.Google Scholar
Varol Gül, Ceylan Duygu, Russell Bryan, Yang Jimei, Yumer Ersin, Laptev Ivan, and Schmid Cordelia. 2018. BodyNet: Volumetric inference of 3D human body shapes. In Computer Vision – ECCV 2018, Ferrari Vittorio, Hebert Martial, Sminchisescu Cristian, and Weiss Yair (Eds.). Springer International Publishing, Cham, 20–38.Google Scholar
Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010.Google ScholarDigital Library
Vlasic Daniel, Peers Pieter, Baran Ilya, Debevec Paul, Popović Jovan, Rusinkiewicz Szymon, and Matusik Wojciech. 2009. Dynamic shape capture using multi-view photometric stereo. ACM Trans. Graph. 28, 5 (Dec.2009), 1–11. Google ScholarDigital Library
Waechter Michael, Moehrle Nils, and Goesele Michael. 2014. Let there be color! large-scale texturing of 3D reconstructions. In Computer Vision – ECCV 2014, Fleet David, Pajdla Tomas, Schiele Bernt, and Tuytelaars Tinne (Eds.). Springer International Publishing, Cham, 836–850.Google ScholarCross Ref
Xiu Yuliang, Yang Jinlong, Tzionas Dimitrios, and Black Michael J.. 2022. ICON: Implicit clothed humans obtained from normals. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22). 13286–13296. Google ScholarCross Ref
Xu Yuanlu, Zhu Song-Chun, and Tung Tony. 2019. DenseRaC: Joint 3D pose and shape estimation by dense render-and-compare. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 7759–7769. Google ScholarCross Ref
Yu Alex, Ye Vickie, Tancik Matthew, and Kanazawa Angjoo. 2021. pixelNeRF: Neural radiance fields from one or few images. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 4576–4585. Google ScholarCross Ref
Yu Tao, Guo Kaiwen, Xu Feng, Dong Yuan, Su Zhaoqi, Zhao Jianhui, Li Jianguo, Dai Qionghai, and Liu Yebin. 2017. BodyFusion: Real-time capture of human motion and surface geometry using a single depth camera. In 2017 IEEE International Conference on Computer Vision (ICCV’17). 910–919. Google ScholarCross Ref
Yu Tao, Zhao Jianhui, Zheng Zerong, Guo Kaiwen, Dai Qionghai, Li Hao, Pons-Moll Gerard, and Liu Yebin. 2020. DoubleFusion: Real-time capture of human performances with inner body shapes from a single depth sensor. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 10 (2020), 2523–2539. Google ScholarDigital Library
Yu Tao, Zheng Zerong, Guo Kaiwen, Liu Pengpeng, Dai Qionghai, and Liu Yebin. 2021. Function4D: Real-time human volumetric capture from very sparse consumer RGBD sensors. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 5742–5752. Google ScholarCross Ref
Zheng Yang, Shao Ruizhi, Zhang Yuxiang, Yu Tao, Zheng Zerong, Dai Qionghai, and Liu Yebin. 2021. DeepMultiCap: Performance capture of multiple characters using sparse multiview cameras. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV’21). 6219–6229. Google ScholarCross Ref
Zheng Zerong, Yu Tao, Liu Yebin, and Dai Qionghai. 2022. PaMIR: Parametric model-conditioned implicit representation for image-based human reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 6 (2022), 3170–3184. Google ScholarCross Ref
Zheng Zerong, Yu Tao, Wei Yixuan, Dai Qionghai, and Liu Yebin. 2019. DeepHuman: 3D human reconstruction from a single image. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 7738–7748. Google ScholarCross Ref

Index Terms

High-Resolution Volumetric Reconstruction for Clothed Humans
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Reconstruction
  2. Computer graphics
    1. Shape modeling
      1. Mesh models
      2. Volumetric models

Recommendations

Volumetric reconstruction and interactive rendering of trees from photographs
SIGGRAPH '04: ACM SIGGRAPH 2004 Papers

Reconstructing and rendering trees is a challenging problem due to the geometric complexity involved, and the inherent difficulties of capture. In this paper we propose a volumetric approach to capture and render trees with relatively sparse foliage. ...
Read More
Volumetric reconstruction and interactive rendering of trees from photographs

Reconstructing and rendering trees is a challenging problem due to the geometric complexity involved, and the inherent difficulties of capture. In this paper we propose a volumetric approach to capture and render trees with relatively sparse foliage. ...
Read More
Colorful 3D reconstruction at high resolution using multi-view representation
Highlights
- A simple framework for colorful 3D model reconstruction at high resolution.
- A ...
Abstract
High-quality 3D models should contain accurate shapes, as well as other correct attributes, such as realistic surface color. However, current researches were mostly focused on the reconstruction of shapes. We present a method to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Graphics Volume 42, Issue 5
October 2023
195 pages
ISSN:0730-0301
EISSN:1557-7368
DOI:10.1145/3607124
Editor:
Carol O'Sullivan
Trinity College Dublin, Ireland
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 August 2023
- Online AM: 15 July 2023
- Accepted: 17 June 2023
- Revised: 3 May 2023
- Received: 6 November 2022
Published in tog Volume 42, Issue 5

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Clothed human
3D reconstruction
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 819
  Total Downloads
- Downloads (Last 12 months)819
- Downloads (Last 6 weeks)37
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text