What we’d like to find out about GANs that we don’t know yet.
By some metrics, research on Generative Adversarial Networks (GANs) has progressed substantially in the past 2 years.
Practical improvements to image synthesis models are being made
However, by other metrics, less has happened. For instance, there is still widespread disagreement about how GANs should be evaluated. Given that current image synthesis benchmarks seem somewhat saturated, we think now is a good time to reflect on research goals for this sub-field.
Lists of open problems have helped other fields with this
In addition to GANs, two other types of generative model are currently popular: Flow Models and Autoregressive Models
For concreteness, let’s temporarily focus on the difference in computational cost between GANs and Flow Models.
At first glance, Flow Models seem like they might make GANs unnecessary.
Flow Models allow for exact log-likelihood computation and exact inference,
so if training Flow Models and GANs had the same computational cost, GANs might not be useful.
A lot of effort is spent on training GANs,
so it seems like we should care about whether Flow Models make GANs obsolete
However, there seems to be a substantial gap between the computational cost of training GANs and Flow Models.
To estimate the magnitude of this gap, we can consider two models trained on datasets of human faces.
The GLOW model
Why are the Flow Models less efficient?
We see two possible reasons:
First, maximum likelihood training might be computationally harder to do than adversarial training.
In particular, if any element of your training set is assigned zero probability by your generative model,
you will be penalized infinitely harshly!
A GAN generator, on the other hand, is only penalized indirectly for assigning zero probability to training set elements,
and this penalty is less harsh.
Second, normalizing flows might be an inefficient way to represent certain functions.
Section 6.1 of
We’ve discussed the trade-off between GANs and Flow Models, but what about Autoregressive Models?
It turns out that Autoregressive Models can be expressed as Flow Models
(because they are both reversible
Parallel | Efficient | Reversible | |
---|---|---|---|
GANs | Yes | Yes | No |
Flow Models | Yes | No | Yes |
Autoregressive Models | No | Yes | Yes |
This brings us to our first open problem:
What are the fundamental trade-offs between GANs and other generative models?
In particular, can we make some sort of CAP Theorem
One way to approach this problem could be to study more models that are a hybrid of multiple model families.
This has been considered for hybrid GAN/Flow Models
We’re also not sure about whether maximum likelihood training is necessarily harder than GAN training.
It’s true that placing zero mass on a training data point is not explicitly prohibited under the
GAN training loss, but it’s also true that a sufficiently powerful discriminator will be able
to do better than chance if the generator does this.
It does seem like GANs are learning distributions of low support in practice
Ultimately, we suspect that Flow Models are fundamentally less expressive per-parameter than arbitrary decoder functions, and we suspect that this is provable under certain assumptions.
Most GAN research focuses on image synthesis.
In particular, people train GANs on a handful of standard (in the Deep Learning community) image datasets:
MNIST
However, we’ve had to come to these conclusions through the laborious and noisy process of trying to train GANs on ever larger and more complicated datasets. In particular, we’ve mostly studied how GANs perform on the datasets that happened to be laying around for object recognition.
As with any science, we would like to have a simple theory that explains our experimental observations.
Ideally, we could look at a dataset, perform some computations without ever actually
training a generative model, and then say something like ‘this dataset will be
easy for a GAN to model, but not a VAE’.
There has been some progress on this topic
Given a distribution, what can we say about how hard it will be for a GAN to model that distribution?
We might ask the following related questions as well: What do we mean by ‘model the distribution’? Are we satisfied with a low-support representation, or do we want a true density model? Are there distributions that a GAN can never learn to model? Are there distributions that are learnable for a GAN in principle, but are not efficiently learnable, for some reasonable model of resource-consumption? Are the answers to these questions actually any different for GANs than they are for other generative models?
We propose two strategies for answering these questions:
Aside from applications like image-to-image
translation
Despite these attempts, images are clearly the easiest domain for GANs. This leads us to the statement of the problem:
How can GANs be made to perform well on non-image data?
Does scaling GANs to other domains require new training techniques, or does it simply require better implicit priors for each domain?
We expect GANs to eventually achieve image-synthesis-level success on other continuous data, but that it will require better implicit priors. Finding these priors will require thinking hard about what makes sense and is computationally feasible in a given domain.
For structured data or data that is not continuous, we’re less sure.
One approach might be to make both the generator and discriminator
be agents trained with reinforcement learning. Making this approach work could require
large-scale computational resources
Training GANs is different from training other neural networks because we simultaneously optimize
the generator and discriminator for opposing objectives.
Under certain assumptions
Unfortunately, it’s hard to prove interesting things about the fully general case. This is because the discriminator/generator’s loss is a non-convex function of its parameters. But all neural networks have this problem! We’d like some way to focus on just the problems created by simultaneous optimization. This brings us to our question:
When can we prove that GANs are globally convergent?
Which neural network convergence results can be applied to GANs?
There has been nontrivial progress on this question. Broadly speaking, there are 3 existing techniques, all of which have generated promising results but none of which have been studied to completion:
When it comes to evaluating GANs, there are many proposals but little consensus. Suggestions include:
Those are just a small fraction of the proposed GAN evaluation schemes. Although the Inception Score and FID are relatively popular, GAN evaluation is clearly not a settled issue. Ultimately, we think that confusion about how to evaluate GANs stems from confusion about when to use GANs. Thus, we have bundled those two questions into one:
When should we use GANs instead of other generative models?
How should we evaluate performance in those contexts?
What should we use GANs for?
If you want an actual density model, GANs probably aren’t the best choice.
There is now good experimental evidence that GANs learn a ‘low support’ representation of the target dataset
Rather than worrying too much about this,
How should we evaluate GANs on these perceptual tasks?
Ideally, we would just use a human judge, but this is expensive.
A cheap proxy is to see if a classifier can distinguish between real and fake examples.
This is called a classifier two-sample test (C2STs)
Ideally, we’d have a holistic evaluation that isn’t dominated by a single factor.
One approach might be to make a critic that is blind to the dominant defect.
But once we do this, some other defect may dominate, requiring a new critic, and so on.
If we do this iteratively, we could get a kind of ‘Gram-Schmidt procedure for critics’,
creating an ordered list of the most important defects and
critics that ignore them.
Perhaps this can be done by performing PCA on the critic activations
Finally, we could evaluate on humans despite the expense.
This would allow us to measure the thing that we actually care about.
This kind of approach can be made less expensive by predicting human answers and only interacting with a real human when
the prediction is uncertain
Large minibatches have helped to scale up image classification
At first glance, it seems like the answer should be yes — after all, the discriminator in most GANs is just an image classifier. Larger batches can accelerate training if it is bottlenecked on gradient noise. However, GANs have a separate bottleneck that classifiers don’t: the training procedure can diverge. Thus, we can state our problem:
How does GAN training scale with batch size?
How big a role does gradient noise play in GAN training?
Can GAN training be modified so that it scales better with batch size?
There’s some evidence that increasing minibatch size improves quantitative results and reduces training time
Can alternate training procedures make better use of large batches?
Optimal Transport GANs
Finally, asynchronous SGD
It’s well known
Since the GAN discriminator is an image classifier, one might worry about it suffering from adversarial examples.
Despite the large bodies of literature on GANs and adversarial examples,
there doesn’t seem to be much work on how they relate.
How does the adversarial robustness of the discriminator affect GAN training?
How can we begin to think about this problem? Consider a fixed discriminator D. An adversarial example for D would exist if there were a generator sample G(z) correctly classified as fake and a small perturbation p such that G(z) + p is classified as real. With a GAN, the concern would be that the gradient update for the generator would yield a new generator G’ where G’(z) = G(z) + p.
Is this concern realistic?
We would like to thank Colin Raffel, Ben Poole, Eric Jang, Dustin Tran, Alex Kurakin, David Berthelot, Aurko Roy, Ian Goodfellow, and Matt Hoffman for helpful discussions and feedback. We would especially like to single out Chris Olah, who provided substantial feedback on the text and help with editing.