skip to main content
research-article
Open Access

Neural Wavelet-domain Diffusion for 3D Shape Generation, Inversion, and Manipulation

Published:03 January 2024Publication History

Skip Abstract Section

Abstract

This paper presents a new approach for 3D shape generation, inversion, and manipulation, through a direct generative modeling on a continuous implicit representation in wavelet domain. Specifically, we propose a compact wavelet representation with a pair of coarse and detail coefficient volumes to implicitly represent 3D shapes via truncated signed distance functions and multi-scale biorthogonal wavelets. Then, we design a pair of neural networks: a diffusion-based generator to produce diverse shapes in the form of the coarse coefficient volumes and a detail predictor to produce compatible detail coefficient volumes for introducing fine structures and details. Further, we may jointly train an encoder network to learn a latent space for inverting shapes, allowing us to enable a rich variety of whole-shape and region-aware shape manipulations. Both quantitative and qualitative experimental results manifest the compelling shape generation, inversion, and manipulation capabilities of our approach over the state-of-the-art methods.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Generative modeling of 3D shapes enables rapid creation of 3D contents, enriching extensive applications across graphics, vision, and VR/AR. With the emerging large-scale 3D datasets [Chang et al. 2015], data-driven shape generation has gained increasing attention. However, it is still very challenging to learn to generate 3D shapes that are diverse, realistic, and novel, while promoting controllability on part- or region-aware shape manipulations with high fidelity.

Existing shape generation models are developed mainly for voxels [Girdhar et al. 2016; Zhu et al. 2017; Yang et al. 2018], point clouds [Fan et al. 2017; Jiang et al. 2018; Achlioptas et al. 2018], and meshes [Wang et al. 2018; Groueix et al. 2018; Smith et al. 2019; Tang et al. 2019]. Typically, these modern learning-based methods cannot handle high resolutions or irregular topology, thus unlikely producing high-fidelity results. In contrast, implicit functions [Mescheder et al. 2019; Park et al. 2019; Chen and Zhang 2019] show improved performance in surface reconstructions. By representing a 3D shape as a level set of discrete volume or a continuous field, we can flexibly extract a mesh object of arbitrary topology at desired resolution.

Existing generative models such as GANs and normalizing flows have shown great success in generating point clouds and voxels. Yet, they cannot directly generate high-resolution implicit functions, which demand intractable memory for training and inference. To represent a surface in 3D, a large number of point samples are required, even though many nearby samples are redundant. Taking the occupancy field for instance, only regions near the surface have varying data values, yet we need huge efforts to encode samples in constant and smoothly-varying regions. Such representation non-compactness and redundancy demands a huge computational cost and hinders the learning efficiency on implicit surfaces.

To address these challenges, some methods attempt to sample in a pre-trained latent space built on the reconstruction task [Chen and Zhang 2019; Mescheder et al. 2019] or convert the generated implicits into point clouds or voxels for adversarial learning [Kleineberg et al. 2020; Luo et al. 2021]. However, these regularizations can only be indirectly applied to the generated implicit functions, so they are not able to ensure the generation of realistic objects. Hence, the visual quality of the generated shapes often shows a significant gap, as compared with the 3D reconstruction results, and the diversity of their generated shapes is also quite limited.

Further, to promote controllability in whole-shape or part-wise manipulation on implicit shapes is challenging. Some works encode implicit shapes into latent codes and perform shape manipulation by modifying the latent codes [Park et al. 2019; Chen and Zhang 2019; Mescheder et al. 2019]. To achieve part-wise controllability, some methods [Hertz et al. 2022; Hao et al. 2020] learn to decompose implicit shapes into a set of templates and manipulate corresponding templates of specific parts. Yet, these methods struggle to invert (implicitly encode then reconstruct) shapes faithfully due to the redundancy in the sample-based implicit representation, which further hinders the shape manipulation quality. Recently, [Lin et al. 2022] propose a method to manipulate implicit shapes by introducing extra part annotations, which enhance the quality of shape inversion and manipulation. Yet, preparing part annotations on 3D shapes is tedious and costly. Also, instead of refining the latent code at testing, their method overfits each unseen shape and the process is time-consuming, i.e., around 25 minutes per shape.

This work introduces a new approach for 3D shape generation, inversion, and manipulation, enabling a direct generative modeling on a continuous implicit representation in the compact wavelet frequency domain. Overall, we have six key contributions: (i) a compact wavelet representation (a pair of coarse and detail coefficient volumes) based on the biorthogonal wavelets and truncated signed distance field to implicitly encode 3D shapes, facilitating effective learning for shape generation and inversion; (ii) a generator network formulated based on the diffusion probabilistic model [Sohl-Dickstein et al. 2015] to produce coarse coefficient volumes from random noise samples, promoting the generation of diverse and novel shapes; (iii) a detail predictor network, formulated to produce compatible detail coefficients to enhance the generation of fine details; (iv) an encoder network, which is first trained in a 3D setting with a diffusion-model-based generator to build a latent space for high-fidelity shape inversion and manipulation, (v) a shape-guided refinement scheme to enhance the shape inversion quality; and (vi) region-aware manipulation for implicitly editing object regions without additional part annotations.

In this work, with the generator and detail predictor, we can flexibly generate diverse and realistic shapes that are not necessarily the same as the training shapes. Further with the encoder network, we can embed a shape, not necessarily in the training set, into our compact latent space for shape inversion and manipulation. As Figure 1 shows, our generated shapes exhibit diverse topology, clean surfaces, sharp boundaries, and fine details, without obvious artifacts. Fine details such as curved/thin beams, diverse drawers and complex cabinets are very challenging for the existing 3D generation approaches to synthesize, e.g., see Figure 1 (left). Besides, our method can faithfully invert and reconstruct randomly-generated shapes, as shown in the middle of Figure 1. Further, we support a rich variety of shape manipulations, e.g., composing/re-generating parts from existing or randomly-generated shapes; see the right of Figure 1.

Fig. 1.

Fig. 1. Our new framework is able to generate diverse and realistic 3D shapes that exhibit complex structures and topology, fine details, and clean surfaces, without obvious artifacts (left). Also, we can encode shapes and invert (reconstruct) them from their codes with high fidelity (middle). Further, we may manipulate shapes in a region-aware manner and generate new shapes by composing/re-generating parts from existing or randomly-generated shapes (right).

This work extends [Hui et al. 2022], which was presented very recently in a conference. In the following, we will highlight the new contributions introduced in this extended version. First, we expand the diffusion-model-based generator in Hui et al. [2022] with an additional encoder network and the shape-guided refinement scheme to build the latent space, enabling us to faithfully invert 3D shapes. Besides, we design a new region-aware manipulation procedure to support applications beyond shape interpolation, e.g., part replacement, part-wise interpolation, and part-wise re-generation. Further, with the compact latent space, we explore the potential of our method for reconstructing implicit shapes from point clouds or images. Also, we perform further experiments to evaluate our framework on shape inversion and shape manipulation, and include more comparisons, including GET3D [Gao et al. 2022], a very recent work on 3D shape generation. Both quantitative and qualitative experimental results manifest the superiority of our method on 3D generation, inversion, and manipulation over the state of the arts.

Skip 2RELATED WORK Section

2 RELATED WORK

3D reconstruction via implicit function. Recently, many methods leverage the flexibility of implicit surface for 3D reconstructions from voxels [Mescheder et al. 2019; Chen and Zhang 2019], complete/partial point clouds [Park et al. 2019; Liu et al. 2021; Yan et al. 2022], and RGB images [Xu et al. 2019; , 2020; Li and Zhang 2021; Tang et al. 2021]. On the other hand, besides ground-truth field values, various supervisions have been explored to train the generation of implicit surfaces, e.g., multi-view images [Liu et al. 2019; Niemeyer et al. 2020] and unoriented point clouds [Atzmon and Lipman 2020; Gropp et al. 2020; Zhao et al. 2021]. Yet, the task of 3D reconstruction focuses mainly on synthesizing a high-quality 3D shape that best matches the input. So, it is fundamentally very different from the task of 3D shape generation, which aims to learn the shape distribution of a given set of shapes for generating diverse, high-quality, and possibly novel shapes accordingly.

3D shape generation via implicit function. Unlike the 3D reconstruction task, 3D shape generation has no fixed ground truth to supervise the generation of each shape sample. Exploring efficient guidance for implicit surface generation is still an open problem. Some works attempt to use the reconstruction task to first learn a latent embedding [Mescheder et al. 2019; Chen and Zhang 2019; Hao et al. 2020; Ibing et al. 2021] then generate new shapes by decoding codes sampled from the learned latent space. Recently, Hertz et al. [2022] learn a latent space with a Gaussian-mixture-based autodecoder for shape generation and manipulation. Though these approaches ensure a simple training process, the generated shapes have limited diversity restricted by the pre-trained shape space. Some other works attempt to convert implicit surfaces to some other representations, e.g., voxels [Kleineberg et al. 2020; Zheng et al. 2022], point cloud [Kleineberg et al. 2020], and mesh [Luo et al. 2021], for applying adversarial training. Yet, the conversion inevitably leads to information loss in the generated implicit surfaces, thus reducing the training efficiency and generation quality.

In this work, we propose a compact wavelet representation for modeling the implicit surface and learn to synthesize it with a diffusion model. By this means, we can effectively learn to produce the implicit function without a representation conversion in the generative task. The results in Section 5.1 also show that our new approach is capable of producing diversified shapes of high visual quality, exceeding the state-of-the-art methods.

3D shape generation via other representations. Smith and Meger [2017] and Wu et al. [2016] explore voxels, a natural grid-based extension of 2D image. Yet, the methods learn mainly coarse structures and fail to produce fine details due to memory restriction. Some other methods explore point clouds via GAN [Gal et al. 2020; Hui et al. 2020; Li et al. 2021], flow-based models [Kim et al. 2020; Cai et al. 2020], and diffusion models [Zhou et al. 2021]. Due to the discrete nature of point clouds, 3D meshes reconstructed from them often contain artifacts. This work focuses on implicit surface generation, aiming at generating high-quality and diverse meshes with fine details and overcoming the limitations of the existing representations.

3D shape inversion & manipulation via implicit function. The inversion task was first proposed for 2D images, aiming at embedding a given image, not necessarily in the training set, into a trained model’s latent space [Xia et al. 2022; Bau et al. 2019; Zhu et al. 2016; Abdal et al. 2021; Tewari et al. 2020; Tov et al. 2021], so that we can perform semantic manipulations on the image and reconstruct a modified version of it from the manipulated latent code. 3D shape inversion is a relatively new topic. So far, research works mainly explored the following 3D representations, e.g., point clouds [Zhang et al. 2021], voxels [Wu et al. 2016], and part-annotated bounding boxes or point clouds [Mo et al. 2019; Li et al. 2017b].

The existing inversion approaches for implicit representations can be divided into two categories: (i) leveraging an auto-encoder framework to map the given shape to the latent space [Mescheder et al. 2019; Chen and Zhang 2019; Genova et al. 2020] and (ii) optimizing the latent code at test time without using an additional encoder [Park et al. 2019; Hao et al. 2020; Hertz et al. 2022; Lin et al. 2022]. We combine their strengths by initializing the latent code through an encoder network and further refining the code in a shape-guided manner; see Section 5.2 for the experimental results, which demonstrate the compelling performance of our method.

More importantly, 3D shape inversion enables user manipulations on existing shapes in the compact latent space. Some recent works [Mo et al. 2019; Gal et al. 2020; Wei et al. 2020; Li et al. 2021; Hui* et al. 2022] explore shape manipulation in the latent space via explicit 3D representations. Yet, manipulation is still challenging for neural implicit representations. First, some existing works [Mescheder et al. 2019; Chen and Zhang 2019; Park et al. 2019] lack part-level awareness in the manipulation. Second, the inversion quality can be severely limited by the shape representative capability of their models [Hertz et al. 2022; Hao et al. 2020]. A very recent work [Lin et al. 2022] enables part-aware manipulation with good quality but it requires extra part annotations, which are costly to prepare. In this work, we propose a rich variety of region-aware manipulations, besides whole-shape interpolation, without requiring part-level annotations; see Section 5.3 for the high-fidelity manipulated shapes that can be generated using our new approach.

Multi-scale neural implicit representation. This work also relates to multi-scale representations, so we discuss some 3D deep learning works in this area. Chen et al. [2021], Chibane et al. [2020], Liu et al. [2020], Martel et al. [2021], and Takikawa et al. [2021], predict multi-scale latent codes in an adaptive octree to improve the reconstruction quality and inference efficiency. Fathony et al. [2020] propose a band-limited network to obtain a multi-scale representation by restricting the frequency magnitude of the basis functions. Recently, Saragadam et al. [2022] adopt the Laplacian pyramid to extract multi-scale coefficients for multiple neural networks. Unlike our work, this work overfits each input object with an individual representation for efficient storage and rendering. In contrast to our work on shape generation, the above methods focus on improving 3D reconstruction performance by separately handling features at different levels. In our work, we adopt a multi-scale implicit representation based on wavelets (motivated by Velho et al. [1994]) to build a compact representation for high-quality shape generation.

Diffusion-based models. Ho et al. [2020], Nichol and Dhariwal [2021], Sohl-Dickstein et al. [2015], and Song et al. [2020] recently show top performance in image generation, surpassing GAN-based models [Dhariwal and Nichol 2021]. Recently, Luo and Hu [2021] and Zhou et al. [2021] adopt diffusion models for point cloud generation. Yet, they fail to generate smooth surfaces and complex structures, as point clouds contain only discrete samples. Distinctively, we adopt a compact representation based on wavelets to model the continuous signed distance field in our diffusion model, promoting 3D shape representation with diverse structures and fine details.

Skip 3OVERVIEW Section

3 OVERVIEW

As shown in Figure 2, our approach has the following procedures:

Fig. 2.

Fig. 2. Overview of our approach. (a) Data preparation builds a compact wavelet representation (a pair of coarse and detail coefficient volumes) for each input shape using a truncated signed distance field and a multi-scale wavelet decomposition. (b) Shape learning trains the generator network to produce coarse coefficient volumes from random noise samples and trains the detail predictor network to produce detail coefficient volumes from coarse coefficient volumes. Also, we train an encoder jointly with the generator to build a compact latent space for shape inversion and manipulation. (c) Shape generation employs the trained generator to produce a coarse coefficient volume and then the trained detail predictor to further predict a compatible detail coefficient volume, followed by an inverse wavelet transform and marching cube, to produce the output shape. (d) Shape inversion & manipulation employ the trained encoder to map a given shape to our compact latent space and refine the latent code for faithful reconstruction or shape manipulation, given the manipulation inputs.

(i)

Data preparation is a one-time process for preparing a compact wavelet representation from each input shape; see Figure 2(a). For each shape, we sample a signed distance field (SDF) and truncate its distance values to avoid redundant information. Then, we transform the truncated SDF to the wavelet domain to produce a series of multi-scale coefficient volumes. Importantly, we take a pair of coarse and detail coefficient volumes at the same scale as our compact wavelet representation for implicitly encoding the input shape.

(ii)

Shape learning aims to train a pair of neural networks to learn the 3D shape distribution from the coarse and detail coefficient volumes; see Figure 2(b). First, we adopt the denoising diffusion probabilistic model [Sohl-Dickstein et al. 2015] to formulate and train the generator network to learn to iteratively refine a random noise sample for generating diverse 3D shapes in the form of the coarse coefficient volume. Second, we design and train the detail predictor network to learn to produce the detail coefficient volume from the coarse coefficient volume for introducing further details in our generated shapes. Besides, we jointly train an additional encoder with the generator for mapping the coarse coefficient volume to a compact latent code. By doing so, the latent code can serve as a controllable condition in the generation, enabling applications such as shape inversion and manipulation; see (iv) below.

(iii)

Shape generation employs the two trained networks to generate 3D shapes; see Figure 2(c). Starting from a random Gaussian noise sample, we first use the trained generator to produce the coarse coefficient volume then the detail predictor to produce an associated detail coefficient volume. After that, we can perform an inverse wavelet transform, followed by the marching cube operator [Lorensen and Cline 1987], to generate the output 3D shape.

(iv)

Shape inversion & manipulation aim to embed a given shape in the generator’s latent space, such that we may manipulate the shape and reconstruct it; see Figure 2(d). To invert a shape, we follow procedures (i) & (ii) to sample a truncated signed distance field (TSDF) from the shape, transform it to the wavelet domain, and derive its latent code using the encoder jointly trained in procedure (ii). Then, we enhance the correspondence between the shape and the latent code by refining the latent code via a back propagation with the frozen generator and encoder. Further, we perform manipulation and reconstruction by feeding the derived (or manipulated) latent code as a condition on the generator to reconstruct the final shape. Also, by utilizing the trained latent space, we can perform various region-aware manipulations on the given shape.

Skip 4METHOD Section

4 METHOD

4.1 Compact Wavelet Representation

Preparing a compact wavelet representation of a given 3D shape (see Figure 2(a)) involves the following two steps: (i) implicitly represent the shape using a signed distance field (SDF); and (ii) decompose the implicit representation via wavelet transform into coefficient volumes, each encoding a specific scale of the shape.

In the first step, we scale each shape to fit \([-0.9,+0.9]^3\) and sample an SDF of resolution \(256^3\) to implicitly represent the shape. Importantly, we truncate the distance p in the SDF to \([-0.1,+0.1]\), so regions not close to the object surface are clipped to a constant. We denote the truncated signed distance field (TSDF) for the i-th shape in training set as \(S_i\). Using \(S_i\), we can significantly reduce the shape representation redundancy, such that the shape learning process can focus better on the shape structures and fine details.

The second step is a multi-scale wavelet decomposition [Mallat 1989; Daubechies 1990; Velho et al. 1994] on the TSDF. Here, we decompose \(S_i\) into a high-frequency detail coefficient volume and a low-frequency coarse coefficient volume, which is roughly a compressed version of \(S_i\). We repeat this process J times on the coarse coefficient volume of each scale, decomposing \(S_i\) into a series of multi-level coefficient volumes. We denote the coarse and detail coefficient volumes at the j-th step (scale) as \(C^j_i\) and \(D^j_i\), respectively, where \(j = \lbrace 1,\ldots ,J\rbrace\). The representation is lossless, meaning that the extracted coefficient volumes together can faithfully reconstruct the original TSDF via a series of inverse wavelet transforms.

There are three important considerations in the data preparation. First, multi-scale decomposition can effectively separate rich structures, fine details, and noise in the TSDF. Empirically, we evaluate the reconstruction error on the TSDF by masking out all higher-scale detail coefficients and reconstructing \(S_i\) only from the coefficients at scale \(J=3\), i.e., \(C^3_i\) and \(D^3_i\). We found that the reconstructed TSDF values have relatively small changes from the originals (only 2.8% in magnitude), even without 97% of the coefficients for the Chair category in ShapeNet [Chang et al. 2015]. Comparing Figures 3(a) vs. (b), we can see that reconstructing only from the coarse scale of \(J=3\) already well retains the chair’s structure. Motivated by this observation, we propose to construct the compact wavelet representation at a coarse scale (\(J=3\)) and drop other detail coefficient volumes, i.e., \(D^1_i\) and \(D^2_i\), for efficient shape learning. See supplementary material Section I for more details on the wavelet decomposition.

Fig. 3.

Fig. 3. Reconstructions with different wavelet filters. (a) An input shape from ShapeNet. (b,c) Reconstructions from the J=3 coefficient volumes with biorthogonal wavelets. The two numbers mean the vanishing moment of the synthesis and analysis wavelets. (d) Reconstruction with the Haar wavelet.

Second, we need a suitable wavelet filter. While Haar wavelet is a popular choice due to its simplicity, using it to encode smooth and continuous signals such as the SDF may introduce serious voxelization artifacts, see, e.g., Figure 3(d). In this work, we propose to adopt the biorthogonal wavelets [Cohen 1992], since it enables a more smooth decomposition of the TSDF. Specifically, we tried different settings in the biorthogonal wavelets and chose to use high vanishing moments: six for the synthesis filter and eight for the analysis filter; see Figures 3(b) vs. (c). Also, instead of storing the detail coefficient volumes with seven channels, as in traditional wavelet decomposition, we follow [Velho et al. 1994] to efficiently compute it as the difference between the inverse transformed \(C^{j}_i\) and \(C^{j-1}_i\) in a Laplacian pyramid style. Hence, the detail coefficient volume has a higher resolution than the coarser one, but both have much lower resolution than the original TSDF volume (\(256^3\)).

In this work, we adopt ShapeNet [Chang et al. 2015] to prepare the wavelet representation. Note that the shapes in this dataset are canonically aligned, thus potentially simplifying the learning process. Also, it is important to truncate the SDF before constructing the wavelet representation for shape learning. By truncating the SDF, regions not close to the shape surface would be cast to a constant function to make efficient the wavelet decomposition and shape learning. Otherwise, we found that the shape learning process will collapse and the training loss cannot be reduced.

4.2 Shape Learning

Next, to learn the 3D shape distribution in a given shape set, we gather coefficient volumes \(\lbrace C_i^J , D_i^J\rbrace\) of the shapes in the set for training (i) the generator network to learn to iteratively remove noise from a random Gaussian noise sample to generate \(C_i^J\); and (ii) the detail predictor network to learn to predict \(D_i^J\) from \(C_i^J\) to enhance the details in the generated shapes. Further, to enable shape inversion, we may additionally train (iii) the encoder network (jointly optimized with the generator network) to learn a latent space for mapping the coarse coefficient volume \(C_i^J\) to a latent code \(z_{i}\).

Network structure. To start, we formulate a simple but efficient neural network structure for both the generator and detail predictor networks. The two networks have the same structure, as both take a 3D volume as input and output a 3D volume of the same resolution as the input. Specifically, we adopt a modified 3D version of the U-Net architecture [Nichol and Dhariwal 2021]. First, we apply three 3D residual blocks to progressively compose and downsample the input into a set of multi-scale features and a bottleneck feature volume. Then, we apply a single self-attention layer to aggregate features in the bottleneck volume, so that we can efficiently incorporate non-local information into the features. Further, we upsample and concatenate features in the same scale and progressively perform an inverse convolution with three residual blocks to produce the output. For all convolution layers in the network structure, we use a filter size of three with a stride of one.

For the encoder network, we design a five-layer 3D convolutional neural network with kernel size \(k=4\) and stride \(s=1\), each followed by an instance normalization. Also, we adopt a single linear transform to produce the output latent code z.

In the following, we will first introduce the modeling of the shape generation process, followed by the adaptation for the shape inversion process. Lastly, we will introduce the detail predictor network for enhancing the details of the generated shapes.

Modeling the shape generation process. We formulate the 3D shape generation process based on the denoising diffusion probabilistic model [Sohl-Dickstein et al. 2015]. For simplicity, we drop the subscript and superscript in \(C_i^J\) , and denote \(\lbrace C_{0}, \ldots , C_{T} \rbrace\) as the shape generation sequence, where \(C_0\) is the target, which is \(C_i^J\); \(C_T\) is a random noise volume from the Gaussian prior; and T is the total number of time steps. As shown on top of Figure 2(b), we have (i) a forward process (denoted as \(q(C_{0:T})\)) that progressively adds noise based on a Gaussian distribution to corrupt \(C_0\) into a random noise volume; and (ii) a backward process (denoted as \(p_{\theta }(C_{0:T})\)) that employs the generator network (with network parameter \(\theta\)) to iteratively remove noise from \(C_T\) to generate the target. Note that all 3D shapes \(\lbrace C_{0}, \ldots , C_{T} \rbrace\) are represented as 3D volumes and each voxel value is a wavelet coefficient at its spatial location.

Both the forward and backward processes are modeled as the Markov processes. The generator network is optimized to maximize the generation probability of the target, i.e., \(p_{\theta }(C_0)\). Also, as suggested in Ho et al. [2020], this training procedure can be further simplified to use the generator network to predict the noise volume \(\epsilon _{\theta }\). So, we adopt a mean-squares loss to train our framework: (1) \(\begin{equation} L_2 = E_{t,C_0,\epsilon }[{\parallel } \epsilon - \epsilon _{\theta }(C_t, t) {\parallel }^2], \epsilon \sim \mathcal {N}(0,\mathbf {I}), \end{equation}\) where t is the time steps; \(\epsilon\) is a noise volume; and \(\mathcal {N}(0,\mathbf {I})\) denotes a unit Gaussian distribution. In particular, we first sample noise volume \(\epsilon\) from a unit Gaussian distribution \(\mathcal {N}(0,\mathbf {I})\) and time step \(t \in [1,\ldots ,T]\) to corrupt \(C_0\) into \(C_t\). Then, our generator network learns to predict noise \(\epsilon\) based on the corrupted coefficient volume \(C_t\). Further, as the network takes time step t as input, we convert value t into an embedding via two MLP layers. Using this embedding, we can condition all the convolution modules in the prediction and enable the generator to be more aware of the amount of noise contaminated in \(C_t\). For more details on the derivation of the training objectives, please refer to supplementary material Section J.

Modeling the shape inversion process. The shape inversion process aims to embed a given 3D shape as a compact code z, such that we can faithfully reconstruct the coarse wavelet volume \(C_0\) of the shape from z and further reconstruct the shape. Motivated by Preechakul et al. [2022], we introduce an optional encoder \(Enc_{\phi }\) to derive the latent code z of coarse wavelet volume \(C_0\), i.e., \(z = Enc_{\phi }(C_0)\). Then, we inject z into the generator during the backward process as an additional shape condition; see the bottom part of Figure 2(b). During the training, we jointly optimize the encoder and generator networks to maximize the conditional generation probability, i.e., \(p_\theta (C_0|z)\). This can be achieved by modifying the objective in Equation (1) into (2) \(\begin{equation} L_2 = E_{t,C_0,\epsilon }[{\parallel } \epsilon - \epsilon _{\theta }(C_t, z, t) {\parallel }^2], \epsilon \sim \mathcal {N}(0,\mathbf {I}). \end{equation}\) Note that since we need the generator to be aware of the current shape condition, we follow [Preechakul et al. 2022] to use the group normalization layers to jointly take the time embedding and the latent code as the input to the generator.

Detail predictor network. Next, we train the detail predictor network to produce the detail coefficient volume \(D_0\) from coarse coefficient volume \(C_0\) (see the top part of Figure 2(b)), so that we can further enhance the details in our generated (or inverted) shapes.

To train the detail predictor network, we leverage the paired coefficient volumes \(\lbrace C_i^J, D_i^J \rbrace\) from the data preparation. Importantly, each detail coefficient volume \(D_0\) should be highly correlated to its associated coarse coefficient volume \(C_0\). Hence, we pose detail prediction as a conditional regression on the detail coefficient volume, aiming at learning neural network function \(f: C_0 \rightarrow D_0\); hence, we optimize f via a mean squared error loss. Overall, the detail predictor has the same network structure as the generator, but we include more convolution layers to accommodate the cubic growth in the number of nonzero values in the detail coefficient volume.

4.3 Shape Generation

Now, we are ready to generate 3D shapes. Figure 2(c) illustrates the shape generation procedure. First, we randomize a 3D noise volume as \(C_T\) from the standard Gaussian distribution. Then, we can employ the trained generator for T time steps to produce \(C_0\) from \(C_T\). This process is iterative and inter-dependent. We cannot parallelize the operations in different time steps, so leading to a very long computing time. To speed up the inference process, we adopt an approach in Song et al. [2020] to sub-sample a set of time steps from \([1,\ldots , T]\) during the inference; in practice, we evenly sample \(1/10\) of the total time steps in all our experiments.

After we obtain the coarse coefficient volume \(C_0\), we can then use the detail predictor network to predict the detail coefficient volume \(D_0\) from \(C_0\), followed by a series of inverse wavelet transforms from \(\lbrace C_0 , D_0 \rbrace\) at scale J=3 to reconstruct the original TSDF. By then, we can extract an explicit 3D mesh from the reconstructed TSDF using the marching cube algorithm [Lorensen and Cline 1987].

4.4 Shape Inversion

Thanks to the encoder network trained during the shape learning, our approach can leverage it for shape inversion. Having said that, our goal is to invert a given unseen shape into a latent code and then reconstruct it faithfully from the latent code.

Figure 2(d) illustrates the overall shape inversion procedure. First, we produce coarse coefficient volume \(C_0\) from the input shape, following the procedure in Section 4.1. Then, we feed coefficient volume \(C_0\) into the encoder network to obtain the latent code \(z=Enc_{\phi }(C_0)\). After that, we feed latent code z together with a sampled noise volume \(\epsilon\) into the generator network to directly produce a coarse coefficient volume for reproducing the original shape, following the procedure in Section 4.3.

However, directly generating the shape from latent code z would lead to loss in topological structures and fine geometric details originally in the shape; compare Figure 4(a) and (b). To enhance the quality of the inverted shape, we further propose a shape-guided refinement scheme to search for latent code \(z^{\prime }\) around z in the latent space to better fit the input shape \(C_0\). In detail, we initialize latent code \(z^{\prime }\) as z and adapt it by gradient descent using the inversion objective in Equation (2) (see Section 4.2) for 400 iterations on the input. Using the initial latent code z, we can already obtain a plausible shape similar to the input. By using this shape-guided refinement, we can further obtain fine details and structures in local regions missed in the initial code z. Also, note that during the refinement, both the encoder and generator networks are fixed. As Figure 4(c) shows, our refined latent code helps encourage more faithful shape inversion with precise topological structures and fine geometric details, more similar to the original inputs. In Section 5.2, we will present quantitative evaluation of our results, showing that our approach can produce inverted shapes of much higher fidelity, compared with the state-of-the-art methods.

Fig. 4.

Fig. 4. Visual comparison of shape inversion with/without shape-guided refinement. Directly using our latent codes without refinement can already produce plausible shapes with overall appearance very similar to the inputs. Further refinement can enhance the reconstruction of the local structures, e.g., see the chair legs on top and the chair armrests on bottom.

4.5 Shape Manipulation

Shape inversion enables us to faithfully encode shapes into the learned latent space, in which our latent codes can faithfully represent their associated 3D shapes. Further, we design the shape generation process with the latent code as a condition. Hence, by manipulating the latent code and regenerating shape from the manipulated code, we can produce new shapes from the existing ones, e.g., by simply interpolating latent codes of different shapes; see our high-quality shape interpolation results shown in Figure 5(a).

Fig. 5.

Fig. 5. Shape manipulations supported by our method. (a) Shape interpolation. We can smoothly interpolate inverted shapes to others using the learned latent space. (b) Part replacement. We can replace a part in a shape (in blue) with a part from a donor shape (in blue). (c) Part-wise shape interpolation. We can continuously interpolate only the selected part region (in green) in the source shape. (d) Part-wise shape re-generation. We can re-generate a selected part (in green) in a shape while keeping the other parts untouched (in blue). The re-generated parts are diverse, plausible, and consistent with the untouched parts.

Further than that, our method enables a rich variety of region-aware manipulations, in which we can manipulate a particular region of a shape, while leaving other regions untouched:

(i)

Part Replacement. First, we can replace a selected part in the input shape with a part from the donor, where ”donor” means an existing shape that provides (donates) parts to be incorporated into the manipulation of the input shape; see Figure 5(b) for an example. The new shape is still plausible and the new part is consistent with the other parts.

(ii)

Part-wise Interpolation. Instead of interpolating the whole shape, our method allows us to select a part in the input and interpolate the part region towards another shape; see Figure 5(c). We can observe that changes in the selected region are smooth and semantically meaningful during the interpolation with good coherency across different parts.

(iii)

Part-wise Re-generation. Further, we can randomly re-generate a selected part in the input; see the upper part of the chair in Figure 5(d). Our re-generated parts are diverse, plausible, and consistent with the untouched parts.

To support these region-aware shape manipulations, we utilize the property that our generated coarse coefficient volume \(C_0\) can maintain a good spatial correspondence with the original TSDF volume. Hence, one can select a region in a shape and locally manipulate the associated portion in the coarse coefficient volume. However, naively changing the coefficient values in the selected region will introduce noise around the connecting boundary, since the new coefficient values may not be consistent with the original coefficients in the other regions of the shape; see, e.g., Figure 7(a).

Fig. 6.

Fig. 6. Overview of our region-aware manipulation procedure, taking T steps to produce the manipulated coarse coefficient volume \(C^{A}_{0}\) from the sampled 3D Gaussian noise volume \(C^{A}_{T}\) . First, we run two inversion processes (left side) for \(\Delta T\) steps in parallel, guided by the two refined latent codes \(z_A\) and \(z_B\) from input shapes A and B to produce two partially-denoised coefficient volumes \(C^A_{T-\Delta T}\) and \(C^B_{T-\Delta T}\) , respectively. We then spatially combine the coefficient values of two coefficient volumes to obtain the mixed coefficient volume \(C^{\text{mix}}_{T-\Delta T}\) and further harmonize values in the boundary regions to produce \(C^{\text{h}}_{T-\Delta T}\) . With the harmonized coefficient volume as guidance, we repeat this combine-and-harmonize process every \(\Delta T\) steps until we produce the manipulated coefficient volume \(C^{A}_{0}\) . Also, we can replace the bottom inversion process with an unconditional shape generation process for achieving part-wise re-generation.

Fig. 7.

Fig. 7. Visual comparisons on part replacement by: (a) a direct replacement of coefficient values; (b) our region-aware manipulation without the harmonizing process; and (c) our full region-aware manipulation procedure with the harmonizing process. Directly replacing the coefficient values leads to noise near the boundaries between the new part from the donor shape and the remaining parts in the input shape; see (a). Without the harmonizing process, inconsistencies in the coefficient volume could introduce artifacts on the manipulated shape; see (b). With our full approach, we can smooth out the connected regions for more consistent part replacement; see (c).

To address this issue, we propose the region-aware manipulation procedure shown in Figure 6 by extending the shape inversion pipeline. Overall, the diffusion process takes T steps to produce the manipulated coarse coefficient volume \(C^{A}_{0}\) from the random noise volume \(C^{A}_T\). See Figure 6 (top left), given input shape A and its refined latent code \(z_A\) (from Section 4.4), we first run the shape inversion process for \(\Delta T\) steps to obtain the partially-denoised coefficient volume \(C^A_{T-\Delta T}\). In parallel, see Figure 6 (bottom left), we do the same on input shape B (which can be the donor/target shape, depending on the type of the manipulation operation) and its refined code \(z_B\) to produce another coefficient volume \(C^{B}_{T-\Delta T}\).

Importantly, after every \(\Delta T\) steps in the diffusion process (where \(T = M \cdot \Delta T\) for some positive integer M), we replace the coefficient values in the selected region of \(C^{A}_{T-\Delta T}\) by those in \(C^{B}_{T-\Delta T}\) to produce the mixed coefficient volume \(C^{\text{mix}}_{T-\Delta T}\). As the coefficient values near the mixed boundary in \(C^{\text{mix}}_{T-\Delta T}\) may not be consistent from \(C^{A}_{T-\Delta T}\) and \(C^{B}_{T-\Delta T}\), the generated shapes may contain some small artifacts; see, e.g., Figure 7(b). To smooth the coefficient mixing, we propose to adopt the harmonizing process in Lugmayr et al. [2022] to produce the harmonized coefficient volume \(C^{\text{h}}_{T-\Delta T}\) (details in the next paragraph). By harmonizing \(C^{\text{mix}}_{T-\Delta T}\) after combining \(C^A_{T-\Delta T}\) and \(C^B_{T-\Delta T}\), we can obtain a smooth transition of coefficient values near the boundary; see the improved result in Figure 7(c). After that, the harmonized coefficient volume can guide the subsequent steps of the two processes, so we use a combine-and-harmonize process in every \(\Delta T\) steps to obtain the final manipulated coefficient volume; see again Figure 6. We empirically set \(\Delta T = 10\) for region-aware manipulation experiments. Also, we can replace the inversion process guided by the refined latent code \(z_B\) with an unconditional shape generation for achieving part-wise re-generation. For the details on how we select the manipulation region in the input shape and how we compute the corresponding region in the wavelet domain; please refer to supplementary material Section F.

Details on the harmonizing process. Given the mixed coefficient volume \(C^{\text{mix}}_t\), we follow the forward process of the diffusion model to add noise to it by sampling \(C^{\text{mix}}_{t+1} \sim \mathcal {N}(\sqrt {1 - \beta _t} {C}_{t}, \beta _t \mathbf {I})\). Then, we apply the two inversion processes mentioned above, guided by the refined latent codes \(z_A\) and \(z_B\), respectively, separately on \(C^{\text{mix}}_{t+1}\), for a single step to obtain two denoised coefficient volumes. We then combine the volumes according to the selected region again and repeat the above adding noise and denoising procedure ten times to obtain the harmonized coefficient volume. By doing so, the generator can better account for the coefficient value changes in the manipulated region and adapt the values near the boundary. In the case of part-wise re-generation, we use the unconditional shape generation for harmonizing the mixed coefficient volume instead of the inversion process guided by the refined latent code \(z_B\).

Discussions on shape manipulation. While our approach is able to create high-fidelity shapes with fine details, there is still no guarantee that the results always meet the user expectations. For example, the quality of the manipulated result can be sensitive to hyper-parameters, e.g., \(\Delta T\). Furthermore, the alignment between the input shape and donor shape can impact the performance of the region-aware manipulations, since it is hard to expect shapes with a significant misalignment can produce a reasonable result.

4.6 Implementation Details

We employed ShapeNet [Chang et al. 2015] to prepare the training dataset used in all our experiments. For shape generation, we follow the data split in Chen and Zhang [2019] and use only the training split to supervise our network training. For shape inversion, we follow the data split in Park et al. [2019] for ease of comparison. Also, similar to Hertz et al. [2022], Li et al. [2021], and Luo and Hu [2021], we train one model of each category in the ShapeNet dataset [Chang et al. 2015] for both shape generation and inversion.

We implement our networks using PyTorch and run all experiments on a GPU cluster with four RTX3090 GPUs. For both shape generation and inversion, we follow [Ho et al. 2020] to set \(\lbrace \beta _t\rbrace\) to increase linearly from \(1e^{-4}\) to 0.02 for 1,000 steps and set \(\sigma _t = \frac{1-\bar{\alpha }_{t-1}}{1 - \bar{\alpha }_t} \beta _t\). We train the generator, optionally with the encoder, for 800,000 iterations and the detail predictor for 60,000 iterations, both using the Adam optimizer [Kingma and Ba 2014] with a learning rate of \(1e^{-4}\). Training the generator and detail predictor takes around three days and 12 hours, respectively. For shape-guided refinement, we also adopt Adam optimizer [Kingma and Ba 2014] with a learning rate of \(5e^{-2}\) for 400 iterations. For shape generation, the inference takes around six seconds per shape on an RTX 3090 GPU. For shape inversion, the refinement procedure and inference totally take around two minutes on an RTX 3090 GPU using 1,000 diffusion steps. As for shape manipulation, our region-aware manipulation procedure takes four minutes on an RTX 3090 GPU for running 1,000 diffusion steps. We adapt [Cotter 2020] to implement the 3D wavelet decomposition.

Skip 5RESULTS AND EXPERIMENTS Section

5 RESULTS AND EXPERIMENTS

5.1 Shape Generation

Galleries of our generated shapes. We present Figure 1 (left) and Figure 8 to showcase the compelling capability of our method for generating shapes of various categories. Our generated shapes exhibit diverse topology, fine details, and also clean surfaces without obvious artifacts, covering a rich variety of small, thin, and complex structures that are typically very challenging for the existing approaches to produce. More 3D shape generation results produced by our method are provided in supplementary material Section A.

Fig. 8.

Fig. 8. Gallery of our generated shapes: Table, Chair, Cabinet, and Airplane (top to bottom). Our shapes exhibit complex structures, fine details, and clean surfaces, without obvious artifacts, compared with those generated by the other approaches; see Figure 9.

Baselines for comparison. We compare the shape generation capability of our method with six state-of-the-art methods: IM-GAN [Chen and Zhang 2019], Voxel-GAN [Kleineberg et al. 2020], Point-Diff [Luo and Hu 2021], SPAGHETTI [Hertz et al. 2022], SDF-StyleGAN [Zheng et al. 2022], and GET3D [Gao et al. 2022]. To our best knowledge, our method is the first work that generates implicit shape representations in frequency domain and considers coarse and detail coefficients to enhance the generation of structures and fine details.

Our experiments follow the same setting as the above works. Specifically, we leverage our trained model on the Chair and Airplane categories in ShapeNet [Chang et al. 2015] to randomly generate 2,000 shapes for each category. Then, we uniformly sample 2,048 points on each generated shape and evaluate the shapes using the same set of metrics as in the previous methods (details to be presented later). For GET3D, we use the official pre-trained model for the Chair category and adopt their code to train a model for the Airplane category, which is not provided in their released repository. For the other comparison methods, we employ publicly-released trained network models to generate shapes.

Evaluation metrics for shape generation. Following [Luo and Hu 2021; Hertz et al. 2022], we evaluate the generation quality using (i) minimum matching distance (MMD) measures the fidelity of the generated shapes; (ii) coverage (COV) indicates how well the generated shapes cover the given 3D repository; and (iii) 1-NN classifier accuracy (1-NNA) measures how well a classifier differentiates the generated shapes from those in the repository. Overall, a low MMD, a high COV, and an 1-NNA close to 50% indicate good generation quality; see supplementary material Section K for the details.

Quantitative evaluation for shape generation. Table 1 reports the quantitative comparison results, showing that our method surpasses all others for almost all the evaluation cases over the three metrics for both the Chair and Airplane categories. We employ the Chair category, due to its large variations in structure and topology, and the Airplane category, due to the fine details in its shapes. As discussed in Luo and Hu [2021] and Yang et al. [2019], the COV and MMD metrics have limited capabilities to account for details, so they are not suitable for measuring the fine quality of the generation results, e.g., the generated shapes sometimes show a better performance even when compared with the ground-truth training shapes on these metrics. In contrast, 1-NNA is more robust and can better correlate with the generation quality. In this metric, our approach outperforms all others, while having a significant margin in the Airplane category, manifesting the diversity and fidelity of our generated results.

Table 1.
MethodChairAirplane
COV\(~\uparrow\)MMD\(~\downarrow\)1-NNA \(\sim \hspace{-2.84526pt}50\)COV\(~\uparrow\)MMD\(~\downarrow\)1-NNA \(\sim \hspace{-2.84526pt}50\)
CDEMDCDEMDCDEMDCDEMDCDEMDCDEMD
IM-GAN [Chen and Zhang 2019]56.4954.5011.7914.5261.9863.4561.5562.793.3208.37176.2176.08
Voxel-GAN [Kleineberg et al. 2020]43.9539.4515.1817.3280.2781.1638.4439.185.93711.6993.1492.77
Point-Diff [Luo and Hu 2021]51.4755.9712.7916.1261.7663.7260.1962.303.5439.51974.6072.31
SPAGHETTI [Hertz et al. 2022]49.1951.9214.9015.9070.7268.9558.3458.384.0628.88778.2477.01
SDF-StyleGAN [Zheng et al. 2022]51.7750.3013.4515.4368.8870.2057.9748.333.8599.40683.3784.36
GET3D [Gao et al. 2022]53.4756.4114.4315.6370.3269.5155.6255.384.1349.42189.1286.77
Ours58.1955.4611.7014.3161.4761.6264.7864.403.2307.75671.6966.74
  • We follow the same setting to conduct this experiment as in the state-of-the-art methods. From the table, we can see that our generated shapes have the best quality for almost all cases (largest COV, lowest MMD, and 1-NNA close to 50) for both the Chair and Airplane categories. The units of CD and EMD are \(10^{-3}\) and \(10^{-2}\), respectively.

Table 1. Quantitative Comparison between the Generated Shapes Produced by our Method and Six State-of-the-art Methods

  • We follow the same setting to conduct this experiment as in the state-of-the-art methods. From the table, we can see that our generated shapes have the best quality for almost all cases (largest COV, lowest MMD, and 1-NNA close to 50) for both the Chair and Airplane categories. The units of CD and EMD are \(10^{-3}\) and \(10^{-2}\), respectively.

Qualitative evaluation for shape generation. Figure 9 shows some visual comparisons. For each random shape generated by our method, we find a similar shape (with similar structures and topology) generated by each of the other methods to make the visual comparison easier. See supplementary material Sections B and D for more visual comparisons. Further, as different methods likely have different statistical modes in the generation distribution, we also take random shapes generated by IM-GAN and find similar shapes generated by our method for comparison; see supplementary material Section C for the results. From all these results, we can see that the 3D shapes generated by our method clearly exhibit finer details, higher fidelity structures, and cleaner surfaces, without obvious artifacts.

Fig. 9.

Fig. 9. Visual comparisons with state-of-the-art methods. Our generated shapes exhibit finer details and cleaner surfaces, without obvious artifacts.

5.2 Shape Inversion

Baselines for comparison. We compare our shape inversion results with those from four state-of-the-art methods: IM-NET [Chen and Zhang 2019], DeepSDF [Park et al. 2019], DualSDF [Hao et al. 2020], and SPAGHETTI [Hertz et al. 2022]. We employ their official code to train their models, following the same train-test split as Park et al. [2019] for a fair comparison. Note that we do not evaluate the results of SPAGHETTI in the Lamp category, as its official pre-trained model is unavailable and the training code has not been officially released by the time of the submission. Further, we notice a very recent work, NeuForm [Lin et al. 2022], which overfits the given shape for achieving the shape inversion. Even though the code and pre-trained models of this work have not been released, we provide a visual comparison on the inversion results in supplementary material Section E using the example results given in their paper.

Evaluation metrics for shape inversion. Following prior works, we evaluate the inversion quality by measuring the similarity between the inverted shapes and the original inputs using Chamfer Distance (CD), Earth Mover’s Distance (EMD), and Light Field Distance (LFD) [Chen et al. 2003]. For CD and EMD, they evaluate point-wise distances between two point clouds sampled on the shape surfaces. Here, we uniformly sample 2,048 points on each shape (inverted and original) and evaluate the metrics on the sampled point clouds. For LFD, it measures the difference between two shapes in the rendered image domain. First, we uniformly sample 20 viewpoints to render images for each of the inverted shapes and input shapes. Then, we compute a 45-dimension feature vector for each rendered image and obtain the final metric by summing up all pairwise L1 distances between the image feature vectors from the same viewpoint. Note that a low value indicates a better performance for all three metrics. Also, there could be noise in the sampled TSDF grid, so we post-process the inverted shapes. For the details on the metrics and post-processing, please see supplementary material Section L.

Quantitative evaluation for shape inversion. Table 2 shows the quantitative comparison results. Without refinement, our method already performs better in all metrics than IM-NET [Chen and Zhang 2019], which also does not require additional refinement. By further refining the latent code, our method can significantly outperform all the other methods in all metrics. Particularly, for the LFD metric, our method manifests more than \(30\%\) performance gain over the second-best method in all categories.

Table 2.
MethodChairAirplaneLamp
CD\(~\downarrow\)EMD\(~\downarrow\)LFD\(~\downarrow\)CD\(~\downarrow\)EMD\(~\downarrow\)LFD\(~\downarrow\)CD\(~\downarrow\)EMD\(~\downarrow\)LFD\(~\downarrow\)
IM-NET [Chen and Zhang 2019]4.96810.422800.563.3019.2615024.6520.25320.4036880.01
DeepSDF [Kleineberg et al. 2020]4.6579.0662403.083.2499.7344939.5617.54018.9186858.34
DualSDF [Luo and Hu 2021]8.25413.012588.876.52915.747097.0935.12526.6926820.62
SPAGHETTI [Hertz et al. 2022]2.8377.6631988.821.3866.4663637.33——————
Ours without Refinement4.5788.9192358.982.4596.8163726.4317.41219.2376955.02
Ours1.1424.491924.9111.1084.5072508.555.8347.2013450.34
  • We follow the same setting as in the state-of-the-art methods. Our method outperforms (lowest CD, EMD, and LFD) all the other methods for all categories. The units of CD, EMD, and LFD are \(10^{-3}\), \(10^{-2}\), and 1, respectively. Notice that the quantitative evaluation for SPAGHETTI [Hertz et al. 2022] in the Lamp category is not shown here, as SPAGHETTI does not provide the pre-trained model of Lamp and the training code has not been officially released.

Table 2. Quantitative comparison on shape inversion between our method and four state-of-the-art methods

  • We follow the same setting as in the state-of-the-art methods. Our method outperforms (lowest CD, EMD, and LFD) all the other methods for all categories. The units of CD, EMD, and LFD are \(10^{-3}\), \(10^{-2}\), and 1, respectively. Notice that the quantitative evaluation for SPAGHETTI [Hertz et al. 2022] in the Lamp category is not shown here, as SPAGHETTI does not provide the pre-trained model of Lamp and the training code has not been officially released.

Qualitative evaluation for shape inversion. Figure 10 shows some visual comparisons on shape inversion. We can observe that our method can already produce plausible results without refinement; see Figure 10(f). By further introducing the shape-guided refinement, we can faithfully reproduce the fine details and complex structures in the input shapes; see, e.g., the thin structures at the aircraft’s propeller and the chair’s back shown in the fourth and fifth rows of Figure 10; note that they cannot be achieved by the existing works.

Fig. 10.

Fig. 10. Visual comparisons on shape inversion. Our method is able to produce more faithful inverted shapes (g) that are highly similar to the inputs (a), compared with existing works (b-e). Our inverted results exhibit fine details and complete structures; see, e.g., the chair pulleys and airplane propellers, which are typically very challenging for the existing works. Also, our method, even without further refinements (f), can still produce reasonable shape inversions. The results of the Lamp category are not provided for SPAGHETTI [Hertz et al. 2022], as the associated pre-trained model is not publicly available.

5.3 Shape Manipulation

Shape Interpolation. As Figure 11 shows, our method can produce smooth and plausible interpolations between two unseen shapes, thanks to the shape-guided refinement. From left to right, the source can morph smoothly towards the target; see especially the consistent changes in the armrests in the last row of Figure 11. These results manifest the superior capability of our framework to embed an unseen shape in a smooth and plausible latent space.

Fig. 11.

Fig. 11. High-quality shapes of various categories created by interpolating our refined latent codes. Note particularly the plausible structures and fine details in the smooth transition from the sources to the targets.

Part Replacement. Besides, we can select a part in an input shape and replace it with a corresponding part in the donor shape. As Figure 12 shows, we replace the blue part in each input (a) with the blue part in the corresponding donor (b) to generate new shapes (d). Note that the new shapes exhibit high fidelity, while preserving the geometric details of (a) & (b). See particularly the last row of Figure 12; the complex armrests and back in the input shape can be well preserved in our new shape, while the chair’s seat and legs are seamlessly connected with clean surfaces and sharp edges.

Fig. 12.

Fig. 12. Part replacement results. We replace the blue part in the original shape (a) with the blue part in the donor shape (b), thus generating new shapes (d). Compared with SPAGHETTI (c), our method can better preserve the original parts, e.g., legs and castors, and produce plausible shapes.

Further, we compare our method with SPAGHETTI [Hertz et al. 2022], the state-of-the-art implicit method for the same task. In particular, SPAGHETTI produces inverted shapes as a mixture of Gaussian distributions, where each of them is conditioned by a latent code. Part replacement on the input can then be conducted by replacing some particular latent codes with the corresponding ones in another shape. However, their results are somehow noisy in the connecting regions, while struggling to preserve the geometry and details of the input and part donor; see Figure 12(c). More visual comparisons on part replacement with SPAGHETTI [Hertz et al. 2022] and COALESCE [Kangxue Yin and Zhang 2020] are provided in supplementary material Section F. Note also that for the examples shown in the main paper, we cannot provide comparisons of COALESCE, since it needs additional part-level annotations, which are not available for these examples.

Part-wise Interpolation. Also, our method enables us to select a part (region) in the source shape and perform interpolation on it towards the target shape. See, e.g., Figure 13; during the interpolation, the green part of the source morphs smoothly towards the associated part in the target shape while preserving the remaining parts (marked in blue). See the drawer’s knobs of the table in the second row and the chair’s legs in the third row of Figure 13, the geometry and topology of the untouched parts are preserved faithfully. Besides, the selected part of the intermediate results is consistent with the untouched part; see the connecting regions between the chair’s seat and legs in the last row of Figure 13.

Fig. 13.

Fig. 13. Part-wise interpolation results. Our method can interpolate the selected part (yellow for the tables and green for the chairs) in the sources (leftmost) towards the targets (rightmost). As shown, the intermediate results morph smoothly on the selected region, while preserving other parts.

Part-wise Re-generation. Further, our method allows us to specify a part in the input shape and replace it with a randomly-generated part; see Figure 14. The randomly-generated parts are diverse and plausible; see the table at top left of Figure 14. The connecting regions in the tables keep consistent across all the re-generation results, while the re-generated legs of the tables exhibit diverse topology structures and geometric details.

Fig. 14.

Fig. 14. Part-wise re-generation results. We keep the blue parts fixed and randomly re-generate the remaining parts. Note the diverse structures and geometries in the re-generated parts. See the chair at bottom right; new parts replace the original back and its seat can exhibit various styles, e.g., one with vertical bars and the other with a round back. Also, the geometries of the legs are consistent across the results.

More shape manipulation results. Please refer to supplementary material Section F for more results on shape interpolation, part replacement, part-wise interpolation, and part-wise re-generation.

5.4 Shape Reconstruction

Shape reconstruction from point clouds and single-view images. By utilizing the latent space learned by the encoder network, we can leverage our method to help reconstruct implicit shapes from various forms of inputs. As suggested by Chen and Zhang [2019], we propose to train another encoder that takes a point cloud or a single-view image as input and to predict a latent code that matches the one obtained by our trained encoder network in Section 4.2. Given an unseen input (a point cloud or a single-view image), we can then use the new encoder to generate a corresponding latent code and take the code to reconstruct the 3D shape.

Figure 15 shows visual examples of our reconstruction results. We can reconstruct shapes with fine details and thin structures solely by predicting the latent codes. The results show that the learned latent space is highly smooth and covers various unseen inputs. Interestingly, our method can reconstruct some occluded parts in the single-view images, e.g., the occluded legs of the chairs, by leveraging the shape priors learned in the latent space.

Fig. 15.

Fig. 15. Shape reconstruction results. Our method can embed single-view images (top) and point clouds (bottom) into our latent space and faithfully reconstruct shapes that match the raw inputs. Note the complex topology structures and fine geometric details in the reconstructed shapes.

5.5 Novelty Analysis on Shape Generation

Next, we analyze whether our method can generate shapes that are not necessarily the same as the training-set shapes, meaning that it does not simply memorize the training data. To do so, we use our method to generate 500 random shapes and retrieve the top-four most similar shapes in the training set separately via two different metrics, i.e., Chamfer Distance (CD) and Light Field Distance (LFD) [Chen et al. 2003]. It is noted that LFD is computed based on rendered images from multiple views on each shape, so it focuses more on the visual similarity between shapes and is considered to be more robust for shape retrieval. For the details on the metrics, please see supplementary material Section H.

Figure 16 (top) shows a shape generated by our method, together with top-four most similar shapes retrieved from the training set by the CD and LFD metrics. Further, we show another ten examples in supplementary material Section G. Comparing our shapes with the retrieved ones, we can see that the shapes share similar structures, showing that our method is able to generate realistic-looking structures like those in the training set. Beyond that, our shapes exhibit noticeable differences in various local structures.

Fig. 16.

Fig. 16. Shape novelty analysis. Top: From our generated shape (in green), we retrieve top-four most similar shapes (in blue) in training set by CD and LFD. Bottom: We generate 500 chairs using our method; for each chair, we retrieve the most similar shape in the training set by LFD; then, we plot the distribution of LFDs for all retrievals, showing that our method is able to generate shapes that are more similar (low LFDs) or more novel (high LFDs) compared to the training set. Note that the generated shape at 50th percentile is already not that similar to the associated training-set shape.

As mentioned earlier, a good generator should produce diverse shapes that are not necessarily the same as the training shapes. So, we further statistically analyze the novelty of our generated shapes relative to the training set. To do so, we use our method to generate 500 random chairs; for each generated chair shape, we use LFD to retrieve the most similar shape in the training set. Figure 16 (bottom) plots the distribution of LFDs between our generated shapes (in green) and retrieved shapes (in blue). Also, we show four shape pairs at various percentiles, revealing that shapes with larger LFDs are more different from the most similar shapes in the training set. From the LFD distribution, we can see that our method can learn a generation distribution that covers shapes in the training set (low LFD) and also generates novel and realistic-looking shapes that are more different (high LFD) from the training-set shapes.

5.6 Ablation Study

Ablation on shape generation. To evaluate the major components in shape generation, we successively ablate the full pipeline. First, we evaluate the effect of detail predictor on the generation performance. Then, we study the contributions of the diffusion model by replacing it with a variational autodecoder (VAD) [Zadeh et al. 2019; Hao et al. 2020; Cheng et al. 2022]. Last, we ablate the wavelet representation against directly producing the TSDF volumes with two settings. Specifically, the first setting directly generates a low-resolution TSDF volume of the same resolution as our coarse wavelet coefficient volume (\(46^3\)), whereas the second one applies a network of the same architecture as our detail predictor to upsample the TSDF volume to the same resolution as our detail wavelet coefficient volume (\(76^3\)).

First, from the results reported in Table 3 first vs. second rows, we can see the capability of the detail predictor, which introduces a substantial improvement on all metrics. Second, replacing our generator with the VAD model leads to a significant performance degradation; see first vs. third rows. Third, the ablation of the wavelet representation indicates that directly producing a low-resolution SDF leads to a consistent performance drop in all metrics; see first vs. fourth rows. In the second setting (first vs last rows), though the upsampling network can improve the generation quality, there still exhibit a significant performance gap compared with our full pipeline in terms of all metrics, especially “1-NNA.”

Table 3.
MethodCOV \(\uparrow\)MMD \(\downarrow\)1-NNA \(\sim \ 50\)
CDEMDCDEMDCDEMD
Full Model58.1955.4611.7014.3161.4761.62
W/o detail predictor54.2050.9612.3214.5462.4662.57
VAD Generator21.8326.7721.8326.7795.2093.62
Low-resolution TSDF50.5150.6712.8315.2468.6968.29
High-resolution TSDF51.3351.1112.2214.4264.5364.09
  • The Unit of CD is \(10^{-3}\) and the Unit of EMD is \(10^{-2}\).

Table 3. Comparing our Full Generation Pipeline with Various Ablated Cases on the Chair Category

  • The Unit of CD is \(10^{-3}\) and the Unit of EMD is \(10^{-2}\).

Figure 17 provides visual ablation results on shape generation (top) and inversion (bottom). First, comparing Figures 17(a) and (b) shows that the absence of the detail predictor leads to a loss of fine details and thin structures. Second, replacing the diffusion model with the VAD model, as proposed in Zadeh et al. [2019], leads to a considerable performance drop, as indicated in Figure 17(c). Third, comparing Figures 17(b) and (d) shows that our compact wavelet representation outperforms the TSDF in the same resolution \(46^3\) (the first setting). This is primarily due to the superior shape representation capability of our neural wavelet representation, while the TSDF of such a low resolution cannot well capture necessary information especially the high-frequency components. In the second setting, despite that the upsampling network (Figures 17 (e)) can capture more details than (d), it still falls short when compared with the full pipeline, as shown in Figures 17(a) and (e).

Fig. 17.

Fig. 17. Top: the visual ablation results of the shapes generated by (a) the full generation pipeline; (b) removing the detail predictor; (c) replacing the diffusion model with VAD [Zadeh et al. 2019; Hao et al. 2020]; (d) directly generating a low-resolution TSDF without the wavelet representation; and (e) producing a high-resolution TSDF by an upsampling network. Bottom: the visual ablation results of the shapes inverted by (g) the full inversion pipeline; (h) using the latent code from the encoder without refinement; (i) removing the detail predictor; (j) directly predicting a low-resolution TSDF ( \(46^3\) ); and (k) upsampling the TSDF to \(76^3\) resolution.

Besides, we explored the possibility of using a GAN [Goodfellow et al. 2014] (with the same 3D convolution network architecture) to generate wavelet coefficients. However, we found that the model cannot converge well even though adopting the loss function introduced in WGAN [Arjovsky et al. 2017]. It can be partially attributed to the complexity of the wavelet coefficients in the frequency domain, making it challenging for the GAN model to converge. To ensure fairness, we compare our diffusion model with the VAD model, which can achieve a nearly monotonic decrease of objective loss in the training process, as shown in Figures 17(a) and (c).

Ablation on shape inversion. Next, we explore our shape inversion pipeline on the Chair category to evaluate its major components. First, we quantitatively evaluate the effect of directly reconstructing the input from the latent code predicted by the encoder without the shape-guided refinement. Second, we evaluate the shape inversion performance with/without the detail predictor. Last, we evaluate the effectiveness of the wavelet representation for shape inversion by constructing a baseline to predict the low-resolution and high-resolution TSDFs as introduced in the generation setting.

Table 4 reports the quantitative results, demonstrating the capability of our shape-guided refinement (first vs. second rows). Further, dropping the detail predictor largely degrades the performance (first vs. third rows).

Table 4.
MethodCD \(\downarrow\)EMD \(\downarrow\)LFD \(\downarrow\)
Full Model1.1424.491924.91
W/o Refinement4.5788.9192358.98
W/o detail predictor1.1945.1291200.04
Low-resolution TSDF2.5287.5682103.94
High-resolution TSDF2.4617.0921914.91
  • The Units of CD, EMD, and LFD are \(10^{-3}\), \(10^{-2}\), and 1, Respectively.

Table 4. Comparing our Full Inversion Pipeline with Various Ablated Cases on the Chair Category

  • The Units of CD, EMD, and LFD are \(10^{-3}\), \(10^{-2}\), and 1, Respectively.

Also, directly predicting the low-resolution TSDFs can harm the inversion performance (“Full model” vs. “Low-resolution TSDF”), which aligns with the above findings on shape generation (first vs. fourth rows). Similarly, introducing the upsampling network can improve the shape inversion quality (“Full model” vs. “Low-resolution TSDF”), yet it still yields inferior performance compared with the full pipeline (first vs. last rows).

Furthermore, Figures 17 (f–k) present visual ablation results on shape inversion. First, inversion using the latent code without refinement can already produce a reasonably similar shape to the input. However, it is difficult to recover complex topological structures, as depicted in Figures 17(g) and (h). Second, when the detail predictor is ablated, there is a large performance drop in inversion capabilities, particularly on the thin and fine structures of the shape, as demonstrated in Figures 17 (g) and (i). Also, directly predicting a low-resolution TSDF also proves to be inadequate, especially when dealing with shapes like thin bars at the chair’s back, as evidenced in Figures 17(i) and (j). Also, despite that adopting a high-resolution TSDF can enhance the quality of the inverted shapes as shown in Figures 17(j) and (k), it still encounters challenges to generate shapes with fine details, as shown in Figures 17(g) and (k).

Details of ablation study. Please refer to supplementary material Section M for the implementation details of each ablation baseline.

5.7 Limitations and Future Works

Discussion on shape generation. While our method can generate diverse and realistic-looking shapes, the generated shapes may not meet the desired functionality. As Figure 18(a) shows, the generated chairs, despite looking interesting and structurally reasonable, have an exceptionally tall seat back and low seat height for normal human bodies. In the future, we may incorporate functionality, e.g., [Blinn et al. 2021], into the shape generation process. Second, though our method can learn efficiently via the wavelet representation, it still requires a large number of shapes for training. So, for categories with few training samples, especially those with complex and very thin structures; see, e.g., Figure 18(b), the generated shape may exhibit artifacts like broken parts and structures. Last, while our method can better fit the data distribution than existing methods, the generated shapes still conform to structures/appearances in the training set. Exploring how to generate out-of-distribution shapes is an interesting but challenging future direction.

Fig. 18.

Fig. 18. Failure cases. Left: chairs that unlikely meet the basic functionality in the real world. Right: artifacts in complex and very thin structures.

Discussion on shape inversion and manipulation. Our method is able to invert and reconstruct shapes faithfully and facilitate manipulations with high fidelity, yet requiring users to manually select regions for manipulation. An interesting future direction is to guide the shape manipulation by sketches or texts [Li et al. 2017a; Liu et al. 2022]. Also, while our compact wavelet representation maintains a good spatial correspondence with the original TSDF domain and enables various part-aware manipulations, inter-shape correspondences still rely mainly on a canonically-aligned dataset. We hope to establish stronger correspondences in the compact wavelet domain to enable more compelling downstream applications such as texture and detail transfer. Besides, the diffusion process requires a large number of time steps, so producing manipulated shapes is time-consuming. Yet, simply reducing the time steps leads to a significant quality drop. We plan to explore acceleration strategies to promote interactive shape manipulation.

Skip 6CONCLUSION Section

6 CONCLUSION

This paper presents a new framework for 3D shape generation, inversion, and manipulation. Unlike prior works, we operate on the frequency domain. By decomposing the implicit function in the form of the TSDF using biorthogonal wavelets, we build a compact wavelet representation with a pair of coarse and detail coefficient volumes, as an encoding of 3D shape. Then, we formulate our generator upon a probabilistic diffusion model to learn to generate diverse shapes in the form of coarse coefficient volumes from noise samples and a detail predictor to further learn to generate compatible detail coefficient volumes for reconstructing fine details. Further, by introducing an encoder into the generation process, our framework can enable faithful inversion of unseen shapes, shape interpolation, and a rich variety of region-aware manipulations. Both quantitative and qualitative experiments show superior capabilities of our new approach on shape generation, inversion, and manipulation over the state-of-the-art methods.

Skip ACKNOWLEDGMENTS Section

ACKNOWLEDGMENTS

The authors would like to thank the anonymous reviewers for their valuable comments. We also acknowledge help from Tianyu Wang for help on producing various visualizations.

Skip Supplemental Material Section

Supplemental Material

REFERENCES

  1. Abdal Rameen, Zhu Peihao, Mitra Niloy J., and Wonka Peter. 2021. StyleFlow: Attribute-conditioned exploration of StyleGAN-generated images using conditional continuous normalizing flows. ACM Transactions on Graphics (SIGGRAPH) 40, 3 (2021), 121.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Achlioptas Panos, Diamanti Olga, Mitliagkas Ioannis, and Guibas Leonidas J.. 2018. Learning representations and generative models for 3D point clouds. In Proceedings of International Conference on Machine Learning (ICML). 4049.Google ScholarGoogle Scholar
  3. Arjovsky Martin, Chintala Soumith, and Bottou Léon. 2017. Wasserstein generative adversarial networks. In Proceedings of International Conference on Machine Learning (ICML). 214223.Google ScholarGoogle Scholar
  4. Atzmon Matan and Lipman Yaron. 2020. SAL: Sign agnostic learning of shapes from raw data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 25652574.Google ScholarGoogle ScholarCross RefCross Ref
  5. Bau David, Strobelt Hendrik, Peebles William, Wulff Jonas, Zhou Bolei, Zhu Jun-Yan, and Torralba Antonio. 2019. Semantic photo manipulation with a generative image prior. ACM Transactions on Graphics (SIGGRAPH) 38, 4 (2019), 11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Blinn Bryce, Ding Alexander, Ritchie Daniel, Jones R. Kenny, Sridhar Srinath, and Savva Manolis. 2021. Learning body-aware 3D shape generative models. arXiv preprint arXiv:2112.07022 (2021).Google ScholarGoogle Scholar
  7. Cai Ruojin, Yang Guandao, Averbuch-Elor Hadar, Hao Zekun, Belongie Serge, Snavely Noah, and Hariharan Bharath. 2020. Learning gradient fields for shape generation. In European Conference on Computer Vision (ECCV). 364381.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. 2015. ShapeNet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015).Google ScholarGoogle Scholar
  9. Chen Ding-Yun, Tian Xiao-Pei, Shen Yu-Te, and Ouhyoung Ming. 2003. On visual similarity based 3D model retrieval. In Computer Graphics Forum, Vol. 22. 223232.Google ScholarGoogle Scholar
  10. Chen Zhiqin and Zhang Hao. 2019. Learning implicit fields for generative shape modeling. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 59395948.Google ScholarGoogle ScholarCross RefCross Ref
  11. Chen Zhang, Zhang Yinda, Genova Kyle, Fanello Sean, Bouaziz Sofien, Häne Christian, Du Ruofei, Keskin Cem, Funkhouser Thomas, and Tang Danhang. 2021. Multiresolution deep implicit functions for 3D shape representation. In IEEE International Conference on Computer Vision (ICCV). 1308713096.Google ScholarGoogle ScholarCross RefCross Ref
  12. Cheng Zezhou, Chai Menglei, Ren Jian, Lee Hsin-Ying, Olszewski Kyle, Huang Zeng, Maji Subhransu, and Tulyakov Sergey. 2022. Cross-modal 3D shape generation and manipulation. In European Conference on Computer Vision (ECCV). 303321.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Chibane Julian, Alldieck Thiemo, and Pons-Moll Gerard. 2020. Implicit functions in feature space for 3D shape reconstruction and completion. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 69706981.Google ScholarGoogle ScholarCross RefCross Ref
  14. Cohen Albert. 1992. Biorthogonal wavelets. Wavelets: A Tutorial in Theory and Applications 2 (1992), 123152.Google ScholarGoogle Scholar
  15. Cotter Fergal. 2020. Uses of Complex Wavelets in Deep Convolutional Neural Networks. Ph.D. Dissertation. University of Cambridge.Google ScholarGoogle Scholar
  16. Daubechies Ingrid. 1990. The wavelet transform, time-frequency localization and signal analysis. IEEE Transactions on Information Theory 36, 5 (1990), 9611005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Dhariwal Prafulla and Nichol Alexander. 2021. Diffusion models beat GANS on image synthesis. Conference on Neural Information Processing Systems (NeurIPS) (2021), 87808794.Google ScholarGoogle Scholar
  18. Fan Haoqiang, Su Hao, and Guibas Leonidas J.. 2017. A point set generation network for 3D object reconstruction from a single image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 605613.Google ScholarGoogle ScholarCross RefCross Ref
  19. Fathony Rizal, Sahu Anit Kumar, Willmott Devin, and Kolter J. Zico. 2020. Multiplicative filter networks. In International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  20. Gal Rinon, Bermano Amit, Zhang Hao, and Cohen-Or Daniel. 2020. MRGAN: Multi-rooted 3D shape generation with unsupervised part disentanglement. In IEEE International Conference on Computer Vision (ICCV). 20392048.Google ScholarGoogle Scholar
  21. Gao Jun, Shen Tianchang, Wang Zian, Chen Wenzheng, Yin Kangxue, Li Daiqing, Litany Or, Gojcic Zan, and Fidler Sanja. 2022. GET3D: A generative model of high quality 3D textured shapes learned from images. In Conference on Neural Information Processing Systems (NeurIPS).Google ScholarGoogle Scholar
  22. Genova Kyle, Cole Forrester, Sud Avneesh, Sarna Aaron, and Funkhouser Thomas. 2020. Local deep implicit functions for 3D shape. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 48574866.Google ScholarGoogle ScholarCross RefCross Ref
  23. Girdhar Rohit, Fouhey David F., Rodriguez Mikel, and Gupta Abhinav. 2016. Learning a predictable and generative vector representation for objects. In European Conference on Computer Vision (ECCV). 484499.Google ScholarGoogle ScholarCross RefCross Ref
  24. Goodfellow Ian, Pouget-Abadie Jean, Mirza Mehdi, Xu Bing, Warde-Farley David, Ozair Sherjil, Courville Aaron, and Bengio Yoshua. 2014. Generative adversarial nets. In Conference on Neural Information Processing Systems (NeurIPS). 26722680.Google ScholarGoogle Scholar
  25. Gropp Amos, Yariv Lior, Haim Niv, Atzmon Matan, and Lipman Yaron. 2020. Implicit geometric regularization for learning shapes. In Proceedings of International Conference on Machine Learning (ICML). 35693579.Google ScholarGoogle Scholar
  26. Groueix Thibault, Fisher Matthew, Kim Vladimir G., Russell Bryan C., and Aubry Mathieu. 2018. A papier-mâché approach to learning 3D surface generation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 216224.Google ScholarGoogle Scholar
  27. Hao Zekun, Averbuch-Elor Hadar, Snavely Noah, and Belongie Serge. 2020. DualSDF: Semantic shape manipulation using a two-level representation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 76317641.Google ScholarGoogle ScholarCross RefCross Ref
  28. Hertz Amir, Perel Or, Giryes Raja, Sorkine-Hornung Olga, and Cohen-Or Daniel. 2022. SPAGHETTI: Editing implicit shapes through part aware generation. ACM Transactions on Graphics (SIGGRAPH) 41, 4 (2022), 20.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Ho Jonathan, Jain Ajay, and Abbeel Pieter. 2020. Denoising diffusion probabilistic models. Conference on Neural Information Processing Systems (NeurIPS) (2020), 68406851.Google ScholarGoogle Scholar
  30. Hui Ka-Hei, Li Ruihui, Hu Jingyu, and Fu Chi-Wing. 2022. Neural wavelet-domain diffusion for 3D shape generation. In Proceedings of SIGGRAPH ASIA. 9.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Hui* Ka-Hei, Li* Ruihui, Hu Jingyu, and authors) Chi-Wing Fu (* joint first. 2022. Neural template: Topology-aware reconstruction and disentangled generation of 3D meshes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  32. Hui Le, Xu Rui, Xie Jin, Qian Jianjun, and Yang Jian. 2020. Progressive point cloud deconvolution generation network. In European Conference on Computer Vision (ECCV). 397413.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Ibing Moritz, Lim Isaak, and Kobbelt Leif. 2021. 3D shape generation with grid-based implicit functions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1355913568.Google ScholarGoogle ScholarCross RefCross Ref
  34. Jiang Li, Shi Shaoshuai, Qi Xiaojuan, and Jia Jiaya. 2018. GAL: Geometric adversarial loss for single-view 3D-object reconstruction. In European Conference on Computer Vision (ECCV). 802816.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Kangxue Yin, Zhiqin Chen, Siddhartha Chaudhuri, Matthew Fisher, Vladimir G. Kim, and Hao Zhang. 2020. Coalesce: Component assembly by learning to synthesize connections. In International Conference on 3D Vision (3DV). 61–70.Google ScholarGoogle Scholar
  36. Kim Hyeongju, Lee Hyeonseung, Kang Woo Hyun, Lee Joun Yeop, and Kim Nam Soo. 2020. SoftFlow: Probabilistic framework for normalizing flow on manifolds. In Conference on Neural Information Processing Systems (NeurIPS). 1638816397.Google ScholarGoogle Scholar
  37. Kingma Diederik P. and Ba Jimmy. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google ScholarGoogle Scholar
  38. Kleineberg Marian, Fey Matthias, and Weichert Frank. 2020. Adversarial generation of continuous implicit shape representations. In Eurographics (Short Paper).Google ScholarGoogle Scholar
  39. Li Changjian, Pan Hao, Liu Yang, Tong Xin, Sheffer Alla, and Wang Wenping. 2017a. BendSketch: Modeling freeform surfaces through 2D sketching. ACM Transactions on Graphics (SIGGRAPH) 36, 4 (2017), 114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Li Jun, Xu Kai, Chaudhuri Siddhartha, Yumer Ersin, Zhang Hao, and Guibas Leonidas J.. 2017b. GRASS: Generative recursive autoencoders for shape structures. ACM Transactions on Graphics (SIGGRAPH) 36, 4 (2017), 114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Li Manyi and Zhang Hao. 2021. D\(^2\)IM-Net: Learning detail disentangled implicit fields from single images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1024610255.Google ScholarGoogle Scholar
  42. Li Ruihui, Li Xianzhi, Hui Ke-Hei, and Fu Chi-Wing. 2021. SP-GAN: Sphere-guided 3D shape generation and manipulation. ACM Transactions on Graphics (SIGGRAPH) 40, 4 (2021).Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Lin Connor Z., Mitra Niloy J., Wetzstein Gordon, Guibas Leonidas, and Guerrero Paul. 2022. NeuForm: Adaptive overfitting for neural shape editing. In Conference on Neural Information Processing Systems (NeurIPS).Google ScholarGoogle Scholar
  44. Liu Lingjie, Gu Jiatao, Lin Kyaw Zaw, Chua Tat-Seng, and Theobalt Christian. 2020. Neural sparse voxel fields. Conference on Neural Information Processing Systems (NeurIPS) (2020), 1565115663.Google ScholarGoogle Scholar
  45. Liu Shichen, Saito Shunsuke, Chen Weikai, and Li Hao. 2019. Learning to infer implicit surfaces without 3D supervision. Conference on Neural Information Processing Systems (NeurIPS) (2019), 12.Google ScholarGoogle Scholar
  46. Liu Shi-Lin, Guo Hao-Xiang, Pan Hao, Wang Pengshuai, Tong Xin, and Liu Yang. 2021. Deep implicit moving least-squares functions for 3D reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 17881797.Google ScholarGoogle ScholarCross RefCross Ref
  47. Liu Zhengzhe, Wang Yi, Qi Xiaojuan, and Fu Chi-Wing. 2022. Towards implicit text-guided 3D shape generation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1789617906.Google ScholarGoogle ScholarCross RefCross Ref
  48. Lorensen William E. and Cline Harvey E.. 1987. Marching Cubes: A high resolution 3D surface construction algorithm. In Proceedings of SIGGRAPH, Vol. 21. 163169.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Lugmayr Andreas, Danelljan Martin, Romero Andres, Yu Fisher, Timofte Radu, and Gool Luc Van. 2022. Repaint: Inpainting using denoising diffusion probabilistic models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1146111471.Google ScholarGoogle ScholarCross RefCross Ref
  50. Luo Andrew, Li Tianqin, Zhang Wen-Hao, and Lee Tai Sing. 2021. SurfGen: Adversarial 3D shape synthesis with explicit surface discriminators. In IEEE International Conference on Computer Vision (ICCV). 1623816248.Google ScholarGoogle ScholarCross RefCross Ref
  51. Luo Shitong and Hu Wei. 2021. Diffusion probabilistic models for 3D point cloud generation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 28372845.Google ScholarGoogle ScholarCross RefCross Ref
  52. Mallat Stephane G.. 1989. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 11, 7 (1989), 674693.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Martel Julien N. P., Lindell David B., Lin Connor Z., Chan Eric R., Monteiro Marco, and Wetzstein Gordon. 2021. ACORN: Adaptive coordinate networks for neural scene representation. ACM Transactions on Graphics (SIGGRAPH) 40, 4 (2021), 13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Mescheder Lars, Oechsle Michael, Niemeyer Michael, Nowozin Sebastian, and Geiger Andreas. 2019. Occupancy networks: Learning 3D reconstruction in function space. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 44604470.Google ScholarGoogle ScholarCross RefCross Ref
  55. Mo Kaichun, Guerrero Paul, Yi Li, Su Hao, Wonka Peter, Mitra Niloy J., and Guibas Leonidas J.. 2019. StructureNet: Hierarchical graph networks for 3D shape generation. ACM Transactions on Graphics (SIGGRAPH Asia) 38, 6 (2019), 242:1–242:19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Nichol Alexander Quinn and Dhariwal Prafulla. 2021. Improved denoising diffusion probabilistic models. In Proceedings of International Conference on Machine Learning (ICML). 81628171.Google ScholarGoogle Scholar
  57. Niemeyer Michael, Mescheder Lars, Oechsle Michael, and Geiger Andreas. 2020. Differentiable volumetric rendering: Learning implicit 3D representations without 3D supervision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 35043515.Google ScholarGoogle ScholarCross RefCross Ref
  58. Park Jeong Joon, Florence Peter, Straub Julian, Newcombe Richard, and Lovegrove Steven. 2019. DeepSDF: Learning continuous signed distance functions for shape representation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 165174.Google ScholarGoogle ScholarCross RefCross Ref
  59. Preechakul Konpat, Chatthee Nattanat, Wizadwongsa Suttisak, and Suwajanakorn Supasorn. 2022. Diffusion autoencoders: Toward a meaningful and decodable representation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1061910629.Google ScholarGoogle ScholarCross RefCross Ref
  60. Saragadam Vishwanath, Tan Jasper, Balakrishnan Guha, Baraniuk Richard G., and Veeraraghavan Ashok. 2022. MINER: Multiscale implicit neural representations. In European Conference on Computer Vision (ECCV).Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Smith Edward J., Fujimoto Scott, Romero Adriana, and Meger David. 2019. GEOMetrics: Exploiting geometric structure for graph-encoded objects. In Proceedings of International Conference on Machine Learning (ICML). 58665876.Google ScholarGoogle Scholar
  62. Smith Edward J. and Meger David. 2017. Improved adversarial systems for 3D object generation and reconstruction. In Conference on Robot Learning. PMLR, 8796.Google ScholarGoogle Scholar
  63. Sohl-Dickstein Jascha, Weiss Eric, Maheswaranathan Niru, and Ganguli Surya. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of International Conference on Machine Learning (ICML). 22562265.Google ScholarGoogle Scholar
  64. Song Jiaming, Meng Chenlin, and Ermon Stefano. 2020. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020).Google ScholarGoogle Scholar
  65. Takikawa Towaki, Litalien Joey, Yin Kangxue, Kreis Karsten, Loop Charles, Nowrouzezahrai Derek, Jacobson Alec, McGuire Morgan, and Fidler Sanja. 2021. Neural geometric level of detail: Real-time rendering with implicit 3D shapes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1135811367.Google ScholarGoogle ScholarCross RefCross Ref
  66. Tang Jiapeng, Han Xiaoguang, Pan Junyi, Jia Kui, and Tong Xin. 2019. A skeleton-bridged deep learning approach for generating meshes of complex topologies from single RGB images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 45414550.Google ScholarGoogle ScholarCross RefCross Ref
  67. Tang Jiapeng, Han Xiaoguang, Tan Mingkui, Tong Xin, and Jia Kui. 2021. SkeletonNet: A topology-preserving solution for learning mesh reconstruction of object surfaces from RGB images. IEEE Transactions Pattern Analysis & Machine Intelligence 44, 10 (2021), 64546471.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Tewari Ayush, Elgharib Mohamed, Bernard Florian, Seidel Hans-Peter, Pérez Patrick, Zollhöfer Michael, and Theobalt Christian. 2020. PIE: Portrait image embedding for semantic control. ACM Transactions on Graphics (SIGGRAPH Asia) 39, 6 (2020), 114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Tov Omer, Alaluf Yuval, Nitzan Yotam, Patashnik Or, and Cohen-Or Daniel. 2021. Designing an encoder for StyleGAN image manipulation. ACM Transactions on Graphics (SIGGRAPH) 40, 4 (2021), 114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Velho Luiz, Terzopoulos Demetri, and Gomes Jonas. 1994. Multiscale implicit models. In Proceedings of SIBGRAPI, Vol. 94. 93100.Google ScholarGoogle Scholar
  71. Wang Nanyang, Zhang Yinda, Li Zhuwen, Fu Yanwei, Liu Wei, and Jiang Yu-Gang. 2018. Pixel2Mesh: Generating 3D mesh models from single RGB images. In European Conference on Computer Vision (ECCV). 5267.Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Wei Fangyin, Sizikova Elena, Sud Avneesh, Rusinkiewicz Szymon, and Funkhouser Thomas. 2020. Learning to infer semantic parameters for 3D shape editing. In 2020 International Conference on 3D Vision (3DV). 434442.Google ScholarGoogle ScholarCross RefCross Ref
  73. Wu Jiajun, Zhang Chengkai, Xue Tianfan, Freeman Bill, and Tenenbaum Josh. 2016. Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. In Conference on Neural Information Processing Systems (NeurIPS). 8290.Google ScholarGoogle Scholar
  74. Xia Weihao, Zhang Yulun, Yang Yujiu, Xue Jing-Hao, Zhou Bolei, and Yang Ming-Hsuan. 2022. GAN inversion: A survey. IEEE Transactions Pattern Analysis & Machine Intelligence (2022).Google ScholarGoogle ScholarCross RefCross Ref
  75. Xu Qiangeng, Wang Weiyue, Ceylan Duygu, Mech Radomir, and Neumann Ulrich. 2019. DISN: Deep implicit surface network for high-quality single-view 3D reconstruction. In Conference on Neural Information Processing Systems (NeurIPS). 490500.Google ScholarGoogle Scholar
  76. Xu Yifan, Fan Tianqi, Yuan Yi, and Singh Gurprit. 2020. Ladybird: Quasi-Monte Carlo sampling for deep implicit field based 3D reconstruction with symmetry. In European Conference on Computer Vision (ECCV). 248263.Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Yan Xingguang, Lin Liqiang, Mitra Niloy J., Lischinski Dani, Cohen-Or Daniel, and Huang Hui. 2022. ShapeFormer: Transformer-based shape completion via sparse representation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 62396249.Google ScholarGoogle ScholarCross RefCross Ref
  78. Yang Guandao, Cui Yin, Belongie Serge, and Hariharan Bharath. 2018. Learning single-view 3D reconstruction with limited pose supervision. In European Conference on Computer Vision (ECCV). 86101.Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Yang Guandao, Huang Xun, Hao Zekun, Liu Ming-Yu, Belongie Serge, and Hariharan Bharath. 2019. PointFlow: 3D point cloud generation with continuous normalizing flows. In IEEE International Conference on Computer Vision (ICCV). 45414550.Google ScholarGoogle ScholarCross RefCross Ref
  80. Zadeh Amir, Lim Yao-Chong, Liang Paul Pu, and Morency Louis-Philippe. 2019. Variational auto-decoder: A method for neural generative modeling from incomplete data. arXiv preprint arXiv:1903.00840 (2019).Google ScholarGoogle Scholar
  81. Zhang Junzhe, Chen Xinyi, Cai Zhongang, Pan Liang, Zhao Haiyu, Yi Shuai, Yeo Chai Kiat, Dai Bo, and Loy Chen Change. 2021. Unsupervised 3D shape completion through GAN inversion. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 17681777.Google ScholarGoogle ScholarCross RefCross Ref
  82. Zhao Wenbin, Lei Jiabao, Wen Yuxin, Zhang Jianguo, and Jia Kui. 2021. Sign-agnostic implicit learning of surface self-similarities for shape modeling and reconstruction from raw point clouds. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1025610265.Google ScholarGoogle ScholarCross RefCross Ref
  83. Zheng Xin-Yang, Liu Yang, Wang Peng-Shuai, and Tong Xin. 2022. SDF-StyleGAN: Implicit SDF-based StyleGAN for 3D shape generation. In Eurographics Symposium on Geometry Processing (SGP).Google ScholarGoogle Scholar
  84. Zhou Linqi, Du Yilun, and Wu Jiajun. 2021. 3D shape generation and completion through point-voxel diffusion. In IEEE International Conference on Computer Vision (ICCV). 58265835.Google ScholarGoogle ScholarCross RefCross Ref
  85. Zhu Jun-Yan, Krähenbühl Philipp, Shechtman Eli, and Efros Alexei A.. 2016. Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision (ECCV). 597613.Google ScholarGoogle ScholarCross RefCross Ref
  86. Zhu Rui, Galoogahi Hamed Kiani, Wang Chaoyang, and Lucey Simon. 2017. Rethinking Reprojection: Closing the loop for pose-aware shape reconstruction from a single image. In IEEE International Conference on Computer Vision (ICCV). 5765.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Neural Wavelet-domain Diffusion for 3D Shape Generation, Inversion, and Manipulation

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Graphics
          ACM Transactions on Graphics  Volume 43, Issue 2
          April 2024
          199 pages
          ISSN:0730-0301
          EISSN:1557-7368
          DOI:10.1145/3613549
          • Editor:
          • Carol O'Sullivan
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 3 January 2024
          • Online AM: 1 December 2023
          • Accepted: 4 November 2023
          • Revised: 3 July 2023
          • Received: 23 December 2022
          Published in tog Volume 43, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader