skip to main content
research-article

High-Resolution Volumetric Reconstruction for Clothed Humans

Published:21 August 2023Publication History
Skip Abstract Section

Abstract

We present a novel method for reconstructing clothed humans from a sparse set of, e.g., 1–6 RGB images. Despite impressive results from recent works employing deep implicit representation, we revisit the volumetric approach and demonstrate that better performance can be achieved with proper system design. The volumetric representation offers significant advantages in leveraging 3D spatial context through 3D convolutions, and the notorious quantization error is largely negligible with a reasonably large yet affordable volume resolution, e.g., 512. To handle memory and computation costs, we propose a sophisticated coarse-to-fine strategy with voxel culling and subspace sparse convolution. Our method starts with a discretized visual hull to compute a coarse shape and then focuses on a narrow band nearby the coarse shape for refinement. Once the shape is reconstructed, we adopt an image-based rendering approach, which computes the colors of surface points by blending input images with learned weights. Extensive experimental results show that our method significantly reduces the mean point-to-surface (P2S) precision of state-of-the-art methods by more than 50% to achieve approximately 2mm accuracy with a 512 volume resolution. Additionally, images rendered from our textured model achieve a higher peak signal-to-noise ratio (PSNR) compared to state-of-the-art methods.

Skip Supplemental Material Section

Supplemental Material

video_demo.mp4

mp4

256.1 MB

REFERENCES

  1. Alldieck Thiemo, Pons-Moll Gerard, Theobalt Christian, and Magnor Marcus. 2019. Tex2Shape: Detailed full human body geometry from a single image. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 22932303. Google ScholarGoogle ScholarCross RefCross Ref
  2. Alldieck Thiemo, Zanfir Mihai, and Sminchisescu Cristian. 2022. Photorealistic monocular 3D reconstruction of humans wearing clothing. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22). 14961505. Google ScholarGoogle ScholarCross RefCross Ref
  3. Anguelov Dragomir, Srinivasan Praveen, Koller Daphne, Thrun Sebastian, Rodgers Jim, and Davis James. 2005. SCAPE: Shape completion and animation of people. ACM Trans. Graph. 24, 3 (July2005), 408416. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bhatnagar Bharat, Tiwari Garvita, Theobalt Christian, and Pons-Moll Gerard. 2019. Multi-garment net: Learning to dress 3D people from images. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 54195429. Google ScholarGoogle ScholarCross RefCross Ref
  5. Bogo Federica, Kanazawa Angjoo, Lassner Christoph, Gehler Peter, Romero Javier, and Black Michael J.. 2016. Keep It SMPL: Automatic estimation of 3D human pose and shape from a single image. In Computer Vision – ECCV 2016, Leibe Bastian, Matas Jiri, Sebe Nicu, and Welling Max (Eds.). Springer International Publishing, Cham, 561578.Google ScholarGoogle ScholarCross RefCross Ref
  6. Chen Zhiqin and Zhang Hao. 2019. Learning implicit fields for generative shape modeling. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 59325941. Google ScholarGoogle ScholarCross RefCross Ref
  7. Chibane Julian, Alldieck Thiemo, and Pons-Moll Gerard. 2020. Implicit functions in feature space for 3D shape reconstruction and completion. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 69686979. Google ScholarGoogle ScholarCross RefCross Ref
  8. Collet Alvaro, Chuang Ming, Sweeney Pat, Gillett Don, Evseev Dennis, Calabrese David, Hoppe Hugues, Kirk Adam, and Sullivan Steve. 2015. High-quality streamable free-viewpoint video. ACM Trans. Graph. 34, 4, Article 69 (July2015), 13 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Community Blender Online. (n.d.). Blender. https://www.blender.org/Google ScholarGoogle Scholar
  10. Dou Mingsong, Khamis Sameh, Degtyarev Yury, Davidson Philip, Fanello Sean Ryan, Kowdle Adarsh, Escolano Sergio Orts, Rhemann Christoph, Kim David, Taylor Jonathan, Kohli Pushmeet, Tankovich Vladimir, and Izadi Shahram. 2016. Fusion4D: Real-time performance capture of challenging scenes. ACM Trans. Graph. 35, 4, Article 114 (July2016), 13 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ge Liuhao, Liang Hui, Yuan Junsong, and Thalmann Daniel. 2017. 3D convolutional neural networks for efficient and robust hand pose estimation from single depth images. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 56795688. Google ScholarGoogle ScholarCross RefCross Ref
  12. Gilbert Andrew, Volino Marco, Collomosse John, and Hilton Adrian. 2018. Volumetric performance capture from minimal camera viewpoints. In Computer Vision – ECCV 2018, Ferrari Vittorio, Hebert Martial, Sminchisescu Cristian, and Weiss Yair (Eds.). Springer International Publishing, Cham, 591607.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Graham Benjamin, Engelcke Martin, and Maaten Laurens van der. 2018. 3D semantic segmentation with submanifold sparse convolutional networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 92249232. Google ScholarGoogle ScholarCross RefCross Ref
  14. Guo Kaiwen, Lincoln Peter, Davidson Philip, Busch Jay, Yu Xueming, Whalen Matt, Harvey Geoff, Orts-Escolano Sergio, Pandey Rohit, Dourgarian Jason, Tang Danhang, Tkach Anastasia, Kowdle Adarsh, Cooper Emily, Dou Mingsong, Fanello Sean, Fyffe Graham, Rhemann Christoph, Taylor Jonathan, Debevec Paul, and Izadi Shahram. 2019. The relightables: Volumetric performance capture of humans with realistic relighting. ACM Trans. Graph. 38, 6, Article 217 (Nov.2019), 19 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Güler Riza Alp and Kokkinos Iasonas. 2019. HoloPose: Holistic 3D human reconstruction in-the-wild. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 1087610886. Google ScholarGoogle ScholarCross RefCross Ref
  16. He Tong, Xu Yuanlu, Saito Shunsuke, Soatto Stefano, and Tung Tony. 2021. ARCH++: Animation-ready clothed human reconstruction revisited. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV’21). 1102611036. Google ScholarGoogle ScholarCross RefCross Ref
  17. Hong Yang, Zhang Juyong, Jiang Boyi, Guo Yudong, Liu Ligang, and Bao Hujun. 2021. StereoPIFu: Depth aware clothed human digitization via stereo vision. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 535545. Google ScholarGoogle ScholarCross RefCross Ref
  18. Huang Yinghao, Bogo Federica, Lassner Christoph, Kanazawa Angjoo, Gehler Peter V., Romero Javier, Akhter Ijaz, and Black Michael J.. 2017. Towards accurate marker-less human shape and pose estimation over time. In 2017 International Conference on 3D Vision (3DV’17). 421430. Google ScholarGoogle ScholarCross RefCross Ref
  19. Huang Zeng, Li Tianye, Chen Weikai, Zhao Yajie, Xing Jun, LeGendre Chloe, Ma Chongyang, Luo Linjie, and Li Hao. 2018. Deep volumetric video from very sparse multi-view performance capture. In European Conference on Computer Vision (ECCV’18). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Huang Zeng, Xu Yuanlu, Lassner Christoph, Li Hao, and Tung Tony. 2020. ARCH: Animatable reconstruction of clothed humans. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 30903099. Google ScholarGoogle ScholarCross RefCross Ref
  21. Jackson Aaron S., Manafas Chris, and Tzimiropoulos Georgios. 2018. 3D human body reconstruction from a single image via volumetric regression. ArXiv abs/1809.03770 (2018).Google ScholarGoogle Scholar
  22. Jiang Chiyu, Sud Avneesh, Makadia Ameesh, Huang Jingwei, Nießner Matthias, and Funkhouser Thomas. 2020. Local implicit grid representations for 3D scenes. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 60006009. Google ScholarGoogle ScholarCross RefCross Ref
  23. Joo Hanbyul, Simon Tomas, Li Xulong, Liu Hao, Tan Lei, Gui Lin, Banerjee Sean, Godisart Timothy, Nabbe Bart, Matthews Iain, Kanade Takeo, Nobuhara Shohei, and Sheikh Yaser. 2019. Panoptic studio: A massively multiview system for social interaction capture. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 1 (2019), 190204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Joo Hanbyul, Simon Tomas, and Sheikh Yaser. 2018. Total capture: A 3D deformation model for tracking faces, hands, and bodies. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 83208329. Google ScholarGoogle ScholarCross RefCross Ref
  25. Kanazawa Angjoo, Black Michael J., Jacobs David W., and Malik Jitendra. 2018. End-to-end recovery of human shape and pose. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 71227131. Google ScholarGoogle ScholarCross RefCross Ref
  26. Kocabas Muhammed, Athanasiou Nikos, and Black Michael J.. 2020. VIBE: Video inference for human body pose and shape estimation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 52525262. Google ScholarGoogle ScholarCross RefCross Ref
  27. Kolotouros Nikos, Pavlakos Georgios, Black Michael, and Daniilidis Kostas. 2019. Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 22522261. Google ScholarGoogle ScholarCross RefCross Ref
  28. Li Ruilong, Xiu Yuliang, Saito Shunsuke, Huang Zeng, Olszewski Kyle, and Li Hao. 2020. Monocular real-time volumetric performance capture. In Computer Vision – ECCV 2020, Vedaldi Andrea, Bischof Horst, Brox Thomas, and Frahm Jan-Michael (Eds.). Springer International Publishing, Cham, 4967.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Liang Junbang and Lin Ming. 2019. Shape-aware human pose and shape reconstruction using multi-view images. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 43514361. Google ScholarGoogle ScholarCross RefCross Ref
  30. Loper Matthew, Mahmood Naureen, Romero Javier, Pons-Moll Gerard, and Black Michael J.. 2015. SMPL: A skinned multi-person linear model. ACM Trans. Graph. 34, 6, Article 248 (Nov.2015), 16 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Lorensen William E. and Cline Harvey E.. 1987. Marching cubes: A high resolution 3D surface construction algorithm. SIGGRAPH Comput. Graph. 21, 4 (Aug.1987), 163169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Mescheder Lars, Oechsle Michael, Niemeyer Michael, Nowozin Sebastian, and Geiger Andreas. 2019. Occupancy networks: Learning 3D reconstruction in function space. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 44554465. Google ScholarGoogle ScholarCross RefCross Ref
  33. Mildenhall Ben, Srinivasan Pratul P., Tancik Matthew, Barron Jonathan T., Ramamoorthi Ravi, and Ng Ren. 2020. NeRF: Representing scenes as neural radiance fields for view synthesis. In Computer Vision – ECCV 2020, Vedaldi Andrea, Bischof Horst, Brox Thomas, and Frahm Jan-Michael (Eds.). Springer International Publishing, Cham, 405421.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Newcombe Richard A., Fox Dieter, and Seitz Steven M.. 2015. DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 343352. Google ScholarGoogle ScholarCross RefCross Ref
  35. Newcombe Richard A., Izadi Shahram, Hilliges Otmar, Molyneaux David, Kim David, Davison Andrew J., Kohi Pushmeet, Shotton Jamie, Hodges Steve, and Fitzgibbon Andrew. 2011. KinectFusion: Real-time dense surface mapping and tracking. In 2011 10th IEEE International Symposium on Mixed and Augmented Reality. 127136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Newell Alejandro, Yang Kaiyu, and Deng Jia. 2016. Stacked hourglass networks for human pose estimation. In Computer Vision – ECCV 2016, Leibe Bastian, Matas Jiri, Sebe Nicu, and Welling Max (Eds.). Springer International Publishing, Cham, 483499.Google ScholarGoogle Scholar
  37. Omran Mohamed, Lassner Christoph, Pons-Moll Gerard, Gehler Peter, and Schiele Bernt. 2018. Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In 2018 International Conference on 3D Vision (3DV’18). 484494. Google ScholarGoogle ScholarCross RefCross Ref
  38. Park Jeong Joon, Florence Peter, Straub Julian, Newcombe Richard, and Lovegrove Steven. 2019. DeepSDF: Learning continuous signed distance functions for shape representation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 165174. Google ScholarGoogle ScholarCross RefCross Ref
  39. Pavlakos Georgios, Choutas Vasileios, Ghorbani Nima, Bolkart Timo, Osman Ahmed A., Tzionas Dimitrios, and Black Michael J.. 2019. Expressive body capture: 3D hands, face, and body from a single image. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 1096710977. Google ScholarGoogle ScholarCross RefCross Ref
  40. Pavlakos Georgios, Zhu Luyang, Zhou Xiaowei, and Daniilidis Kostas. 2018. Learning to estimate 3D human pose and shape from a single color image. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 459468. Google ScholarGoogle ScholarCross RefCross Ref
  41. Peng Songyou, Niemeyer Michael, Mescheder Lars, Pollefeys Marc, and Geiger Andreas. 2020. Convolutional occupancy networks. In Computer Vision – ECCV 2020, Vedaldi Andrea, Bischof Horst, Brox Thomas, and Frahm Jan-Michael (Eds.). Springer International Publishing, Cham, 523540.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Peng Sida, Zhang Yuanqing, Xu Yinghao, Wang Qianqian, Shuai Qing, Bao Hujun, and Zhou Xiaowei. 2021. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 90509059. Google ScholarGoogle ScholarCross RefCross Ref
  43. Saito Shunsuke, Huang Zeng, Natsume Ryota, Morishima Shigeo, Li Hao, and Kanazawa Angjoo. 2019. PIFu: Pixel-aligned implicit function for high-resolution clothed human digitization. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 23042314. Google ScholarGoogle ScholarCross RefCross Ref
  44. Saito Shunsuke, Simon Tomas, Saragih Jason, and Joo Hanbyul. 2020. PIFuHD: Multi-level pixel-aligned implicit function for high-resolution 3D human digitization. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 8190. Google ScholarGoogle ScholarCross RefCross Ref
  45. Sengupta Soumyadip, Jayaram Vivek, Curless Brian, Seitz Steven M., and Kemelmacher-Shlizerman Ira. 2020. Background matting: The world is your green screen. In CVPR.Google ScholarGoogle Scholar
  46. Shao Ruizhi, Zhang Hongwen, Zhang He, Chen Mingjia, Cao Yan-Pei, Yu Tao, and Liu Yebin. 2022a. DoubleField: Bridging the neural surface and radiance fields for high-fidelity human reconstruction and rendering. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22). 1585115861. Google ScholarGoogle ScholarCross RefCross Ref
  47. Shao Ruizhi, Zheng Zerong, Zhang Hongwen, Sun Jingxiang, and Liu Yebin. 2022b. DiffuStereo: High quality human reconstruction via diffusion-based stereo using sparse cameras. In ECCV.Google ScholarGoogle Scholar
  48. Twindom. (n.d.). Human 3D Body Model Datasets. https://web.twindom.com/.Google ScholarGoogle Scholar
  49. Varol Gül, Ceylan Duygu, Russell Bryan, Yang Jimei, Yumer Ersin, Laptev Ivan, and Schmid Cordelia. 2018. BodyNet: Volumetric inference of 3D human body shapes. In Computer Vision – ECCV 2018, Ferrari Vittorio, Hebert Martial, Sminchisescu Cristian, and Weiss Yair (Eds.). Springer International Publishing, Cham, 2038.Google ScholarGoogle Scholar
  50. Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 60006010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Vlasic Daniel, Peers Pieter, Baran Ilya, Debevec Paul, Popović Jovan, Rusinkiewicz Szymon, and Matusik Wojciech. 2009. Dynamic shape capture using multi-view photometric stereo. ACM Trans. Graph. 28, 5 (Dec.2009), 111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Waechter Michael, Moehrle Nils, and Goesele Michael. 2014. Let there be color! large-scale texturing of 3D reconstructions. In Computer Vision – ECCV 2014, Fleet David, Pajdla Tomas, Schiele Bernt, and Tuytelaars Tinne (Eds.). Springer International Publishing, Cham, 836850.Google ScholarGoogle ScholarCross RefCross Ref
  53. Xiu Yuliang, Yang Jinlong, Tzionas Dimitrios, and Black Michael J.. 2022. ICON: Implicit clothed humans obtained from normals. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22). 1328613296. Google ScholarGoogle ScholarCross RefCross Ref
  54. Xu Yuanlu, Zhu Song-Chun, and Tung Tony. 2019. DenseRaC: Joint 3D pose and shape estimation by dense render-and-compare. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 77597769. Google ScholarGoogle ScholarCross RefCross Ref
  55. Yu Alex, Ye Vickie, Tancik Matthew, and Kanazawa Angjoo. 2021. pixelNeRF: Neural radiance fields from one or few images. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 45764585. Google ScholarGoogle ScholarCross RefCross Ref
  56. Yu Tao, Guo Kaiwen, Xu Feng, Dong Yuan, Su Zhaoqi, Zhao Jianhui, Li Jianguo, Dai Qionghai, and Liu Yebin. 2017. BodyFusion: Real-time capture of human motion and surface geometry using a single depth camera. In 2017 IEEE International Conference on Computer Vision (ICCV’17). 910919. Google ScholarGoogle ScholarCross RefCross Ref
  57. Yu Tao, Zhao Jianhui, Zheng Zerong, Guo Kaiwen, Dai Qionghai, Li Hao, Pons-Moll Gerard, and Liu Yebin. 2020. DoubleFusion: Real-time capture of human performances with inner body shapes from a single depth sensor. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 10 (2020), 25232539. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Yu Tao, Zheng Zerong, Guo Kaiwen, Liu Pengpeng, Dai Qionghai, and Liu Yebin. 2021. Function4D: Real-time human volumetric capture from very sparse consumer RGBD sensors. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 57425752. Google ScholarGoogle ScholarCross RefCross Ref
  59. Zheng Yang, Shao Ruizhi, Zhang Yuxiang, Yu Tao, Zheng Zerong, Dai Qionghai, and Liu Yebin. 2021. DeepMultiCap: Performance capture of multiple characters using sparse multiview cameras. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV’21). 62196229. Google ScholarGoogle ScholarCross RefCross Ref
  60. Zheng Zerong, Yu Tao, Liu Yebin, and Dai Qionghai. 2022. PaMIR: Parametric model-conditioned implicit representation for image-based human reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 6 (2022), 31703184. Google ScholarGoogle ScholarCross RefCross Ref
  61. Zheng Zerong, Yu Tao, Wei Yixuan, Dai Qionghai, and Liu Yebin. 2019. DeepHuman: 3D human reconstruction from a single image. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 77387748. Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. High-Resolution Volumetric Reconstruction for Clothed Humans

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Graphics
          ACM Transactions on Graphics  Volume 42, Issue 5
          October 2023
          195 pages
          ISSN:0730-0301
          EISSN:1557-7368
          DOI:10.1145/3607124
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 21 August 2023
          • Online AM: 15 July 2023
          • Accepted: 17 June 2023
          • Revised: 3 May 2023
          • Received: 6 November 2022
          Published in tog Volume 42, Issue 5

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)819
          • Downloads (Last 6 weeks)37

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text