Controllable Generative Models

How do we efficiently control generative models to produce what we want—preserving identity, 3D structure, style—without sacrificing quality?

A generative model that can sample new content is impressive; one that produces exactly what a user has in mind is useful. Controlling generation requires aligning the model's latent structure with axes a person can articulate—identity, pose, style, lighting, geometry—without sacrificing the photorealism that brought the model to relevance in the first place. There is usually a quality-versus-control tradeoff to manage.

This thread runs from Youssef Mejjati's PhD work on unsupervised attention for image-to-image translation, through compositional controls (object stamps, GaussiGAN's 3D Gaussian primitives from silhouettes alone), into 3DMM-conditioned face generation where Yiwen Huang's PhD now sits. Two recent moves matter: TaxFreeGAN closes the FID gap to unconditional StyleGAN under 3DMM conditioning, and our disentangling-3D work shows that the noise in CLIP's embedding space—not the disentanglement strategy—is what kills quality. R3GAN sits alongside this arc as our architectural reset: a principled relativistic loss that lets the modern GAN drop its bag of tricks.

Authors

Akin Caliskan · Darren Cosker · Aaron Gokaslan · Yiwen Huang · Berkay Kicanaoglu · Hyeongwoo Kim · Kwang In Kim · Atsunobu Kotani · Volodymyr Kuleshov · Youssef A. Mejjati · Isa Milefchik · Christian Richardt · Zejiang Shen · Michael Snower · Stefanie Tellex · Vikas Thamizharasan · Oliver Wang · Yue Wang · Xinjie Yi · Zhiqiu Yu · Qian Zhang

Papers in this thread

Unsupervised Attention-guided Image-to-Image Translation

Neural Information Processing Systems (NeurIPS), 2018

Jointly trains attention with generators and discriminators so unsupervised image-to-image translation can localise edits to objects without disturbing background or inter-object structure.

Generating Handwriting via Decoupled Style Descriptors

European Conference on Computer Vision (ECCV), 2020

Factors handwriting style into separate character-level and writer-level descriptors, letting the model generate new characters in a held-out writer's hand from only a few samples.

Generating Object Stamps

AI for Content Creation (AI4CC) @ CVPR, 2020

Splits conditional object insertion into a mask generator (shape, given a class and bounding box) and a texture generator (appearance, conditioned on the background), so the inserted object is both diverse in shape and consistent with its surroundings.

GaussiGAN: Controllable Image Synthesis with 3D Gaussians from Unposed Silhouettes

British Machine Vision Conference (BMVC) + AI for Content Creation (AI4CC) @ CVPR, 2021

Learns a coarse 3D object representation as a set of self-supervised anisotropic 3D Gaussians from unposed 2D masks alone, then uses it to drive controllable mask and texture synthesis with interactive posing.

Learning Physically-based Material and Lighting Decompositions for Face Editing

International Conference on Computational Visual Media (CVM), 2022

Estimates per-portrait surface normals, albedo, roughness, and a high-frequency lighting map, and decomposes diffuse and specular reflectance—so a downstream editor can relight a face from a single photograph.

Removing the Quality Tax in Controllable Face Generation

Winter Conference on Applications of Computer Vision (WACV) + AI for Content Creation (AI4CC) @ CVPR, 2024

Formalises 3DMM-conditioned face generation as a maths problem, then applies targeted fixes that close the FID gap to unconditional StyleGAN—so controllability no longer costs visible image quality.

Disentangling 3D from Large Vision-Language Models for Controlled Portrait Generation

2025

Disentangles 3D portrait generation from a frozen CLIP plus a FLAME morphable model, then identifies CLIP's noisy embedding directions as the residual source of entanglement and damps them with a stochastic Jacobian regulariser.

The GAN is Dead; Long Live the GAN! A Modern GAN Baseline

Neural Information Processing Systems (NeurIPS), 2024

A regularised relativistic GAN loss with proven local convergence lets a minimalist StyleGAN2-derived architecture—stripped of the usual stabilisation tricks—beat StyleGAN2 on FFHQ, ImageNet, CIFAR, and Stacked MNIST, and compete with diffusion models.

Associate Professor

Visual Computing

Contact