PhD Thesis Defense: Dmitry Petrov, Structure-Aware Shape and Image Synthesis
Content
Speaker
Abstract
In recent years, a variety of approaches have developed deep neural network-based architectures for 3D shape and image generation with wide-ranging applications to computer-aided design, fabrication, architecture, art, and entertainment. While these methods can capture diverse macro-level appearances, they rarely model 3D shape structure or topology of generated objects explicitly, relying instead on the representational power of the network to generate plausible-looking shapes or images. In my work, I introduce shape and image synthesis methods methods that model complex topological and geometrical details, and support interpretable control of generated object structure and geometry.
(1) I propose ANISE – a new part-aware 3D shape reconstruction method based on neural implicit functions. Specifically, given a partial shape observation (image or point cloud), it reconstructs shapes as a combination of parts, each with its own geometric representations. I formulate shape reconstruction in two different manners: as a union of part implicit functions, or by retrieving parts in a reference database and assembling them into a final shape. This approach allows modification of the final result either by moving parts or swapping them using part latent codes.
(2) I introduce the GEM3D – a two-step 3D shape generation model that first generates medial skeletal abstraction that captures and then infers and assembles a collection of locally-supported neural implicit functions, conditioned on generated skeletal abstraction. This skeleton-based latent grid is more structure-aware compared to other irregular latent grid approaches, providing more interpretable support for latent codes in 3D space, while remaining capable of representing complex, fine-grained topological structures. It also allows for editing of the resulting surface through manipulation of the generated skeleton.
(3) Finally, I propose ShapeWords, an approach for synthesizing images based on 3D shape guidance and text prompts. ShapeWords incorporates target 3D shape information within specialized tokens embedded together with the input text, effectively blending 3D shape awareness with textual context to guide the image synthesis process. Unlike conventional shape guidance methods that rely on depth maps restricted to fixed viewpoints and often overlook full 3D structure or textual context, ShapeWords generates diverse yet consistent images that reflect both the target shape's geometry and the textual description. We show that ShapeWords produces images that are more text-compliant, aesthetically plausible, while also maintaining 3D shape awareness.
Advisor
Evangelos Kalogerakis