1 Introduction
In generative adversarial networks (GANs) (Goodfellow et al., 2014) a generator network is trained to produce samples from a given target distribution. To achieve this, a discriminator network is employed to distinguish between “real” samples from the dataset and “fake” samples from the generator network. The discriminator’s feedback is used by the generator to improve its output. While GANs have become highly effective at synthesising realistic examples even for complex data such as natural images (Radford et al., 2015; Karras et al., 2018), they typically rely on large training datasets, which are not available for many tasks. So far, it remains unclear how incomplete observations could be used for training, which further limits the amount of available data especially in applications such as image segmentation (Cordts et al., 2016) or audio source separation (Stoller et al., 2018), where annotated examples are rare compared to individual input or output examples. Furthermore, large discriminator networks operating on the joint distribution are more difficult to train, as they have to consider dependencies between all input dimensions (Karaletsos, 2016).
In this paper, we adapt the standard GAN framework to enable training from incomplete observations. To achieve this, we split the discriminator into multiple “marginal” discriminators, each modelling a separate set of dimensions of the input. As this modification on its own would ignore any dependencies between these parts, we incorporate two additional “dependency discriminators”, each focusing only on interpart relationships. We show how the outputs from these marginal and dependency discriminators can be recombined and used to estimate the same density ratios as in the original GAN framework – which enables training any generator network in an unmodified form. In contrast to previous GANs, however, our approach only requires full observations to train the smaller dependency model and can leverage much bigger, simpler datasets to train the marginal discriminators, which enables the generator to model the marginal distributions more accurately. Additionally, prior knowledge about the marginals and dependencies can be incorporated into the architecture of each discriminator. Compared to other approaches that rely on imputation models to handle incomplete observations
(Pu et al., 2018; Yoon et al., 2018), our approach is designed for cases where the pattern of missing data is known, which enables us to construct a factorization scheme that completely eliminates any overlap between subdiscriminators. This way, subcomponents in our approach require considerably less capacity, which further limits the need for large datasets. Finally, our approach can be extended to the conditional generation setting in a straightforward way.In our experiments, we apply our approach (“FactorGAN”)^{1}^{1}1Implementation available at https://github.com/f90/FactorGAN to two image generation tasks (Sections 4.1 and 4.2), image segmentation (Section 4.3) and audio source separation (Section 4.4), and observe improved performance in missing data scenarios compared to a GAN. For image segmentation, we also compare to the CycleGAN (Zhu et al., 2017), which does not require images to be paired with their segmentation maps. However, it relies on an additional loss, which assumes a onetoone mapping between inputs and outputs and needs to be balanced with the GAN loss with a hyperparameter. FactorGAN instead learns a probabilistic mapping from inputs to outputs from a mixture of paired and unpaired examples using a single adversarial objective with a known optimal solution for the generator, and reaches a much higher segmentation accuracy even with only paired samples.
2 Method
After a brief summary of GANs in Section 2.1, we introduce our method to learn from missing data in Section 2.2, and present variants for conditional generation (2.3) and independent outputs (2.4).
2.1 Generative adversarial networks
To model a probability distribution
over , we follow the standard GAN framework and introduce a generator model that maps an dimensional input to a dimensional sample , resulting in the generator distribution . To train such that approximates the real data density , a discriminator is trained to estimate whether a given sample is real or generated:(1) 
In the nonparametric limit (Goodfellow et al., 2014), approaches at every point . The generator is updated based on the discriminator’s estimate of
. In this paper, we use the alternative loss function for
as proposed by Goodfellow et al. (2014):(2) 
2.2 Adaptation to missing data
In the following we consider the case that incomplete observations are available in addition to our regular dataset (i.e. simpler yet larger datasets). In particular, we partition the set of input dimensions of into () nonoverlapping subsets . For each , an incomplete (“marginal”) observation can be drawn from , which is obtained from after marginalising out all dimensions not in . Analogously, denotes the th marginal distribution of the generator . Next, we extend the existing GAN framework such we can employ the additional incomplete observations. In this context, a main hurdle is that a standard GAN discriminator is trained with samples from the full joint . To eliminate this restriction, we note that can be mapped to a “joint density ratio” by applying the bijective function . For our approach, we exploit that this joint density ratio can be factorised into a product of density ratios:
(3) 
Each “marginal density ratio” captures the generator’s output quality for one marginal variable , while the and terms describe the dependency structure between marginal variables in the real and generated distribution, respectively. We can estimate each density ratio independently by training a “subdiscriminator” network, and combine their outputs for an estimate of , as we will show in the following.
Estimating the marginal density ratios:
To estimate for each , we train a “marginal discriminator network” with parameters to determine whether a marginal sample is real or generated following the GAN discriminator loss in Equation (1) ^{2}^{2}2Samples are drawn from and instead of and , respectively.. This allows making use of the additional incomplete observations. In the nonparametric limit, will approach , so that we can use as an estimate of .
Estimation of and :
Note that and are also density ratios, this time containing a distribution over in both the numerator and denominator – the main difference being that in the latter the individual parts are independent from each other. To approximate the ratio , we can apply the same principles as above and train a “pdependency discriminator” to distinguish samples from the two distributions, i.e. to discriminate real joint samples from samples where the individual parts are real but were drawn independently of each other (i.e. the individual parts might not originate from the same real joint sample). Again, in the nonparametric limit, its response approaches and thus can be approximated via . Analogously, the term is estimated with a “qdependency discriminator” – here, we compare joint generator samples with samples where the individual parts were shuffled across several generated samples (to implement the independence assumption).
Joint discriminator sample complexity:
In contrast to , where the generator provides an infinite number of samples, estimating without overfitting to the limited number of joint training samples can be challenging. While standard GANs suffer from the same difficulty, our factorisation into specialised subunits allows for additional opportunities to improve the sample complexity. In particular, we can design the architecture of the pdependency discriminator to incorporate prior knowledge about the dependency structure^{3}^{3}3If only certain features of a marginal variable influence the dependencies, we can limit the input to the pdependency discriminator to these features instead of the full marginal sample to prevent overfitting..
Combining the discriminators:
As the marginal and the p and qdependency subdiscriminators provide estimates of their respective density ratios, we can multiply them and apply to obtain the desired ratio , following Equation (3). We describe a numerically stable and simple implementation in the supplementary material, involving only a linear combination of preactivation subdiscriminator outputs followed by a sigmoid (see Section 6.4 for details and proof). The time for a generator update step grows linearly with the number of marginals , assuming the time to update each of the marginal discriminators remains constant.
2.3 Adaptation to conditional generation
Conditional generation, such as image segmentation or inpainting, can be performed with GANs by using a generator that maps a conditional input and noise to an output , resulting in an output density . We can view and as parts of a joint variable with distribution , which leads to the equivalent task of matching to the joint generator distribution . In a conditional GAN, the discriminator needs to distinguish between joint samples from and , which requires “paired” samples from and is inefficient as the inputs are the same in both and . In contrast, applying our factorisation principle from Equation (3) to and yields
(4) 
suggesting the use of a p and a qdependency discriminator to model the inputoutput relationship, and a marginal discriminator over that matches aggregate generator predictions from to real output examples from . Note that we do not need a marginal discriminator for , which increases computational efficiency. This adaptation can also involve additionally partitioning into multiple partial observations as shown in Equation 3.
2.4 Adaption to independent marginals
In case the marginals can be assumed to be completely independent, one can remove the pdependency discriminator from our framework, since for all inputs . This approach can be useful in the conditional setting, when each output is related to the input but their marginals are independent from each other. In this context, our approach is related to adversarial ICA (Brakel and Bengio, 2017). Note that the qdependency discriminator still needs to be trained on the full generator outputs if the generator should not introduce unwanted dependencies between the marginals.
2.5 Further extensions
There are many more ways of partitioning the joint distribution into marginals. We discuss two additional variants (Hierarchical and autoregressive FactorGANs) of our approach in Section 6.3 of the supplementary material.
3 Related work
Yoon et al. (2018) randomly mask the inputs to a GAN generator so it learns to impute missing values, but not to generate joint observations from scratch like the FactorGAN. Pu et al. (2018) use GANs for joint distribution modelling by training a generator for each possible factorisation of the joint distribution, which enables flexible missing data imputation. However, generators are required when partitioning the joint into marginals, so the approach is prohibitively slow for large . In contrast, we assume either all parts or exactly one part of the variable of interest is observed, allowing the discriminators to be factorised without introducing functional redundancies between individual parts that create computational overhead. Karaletsos (2016)
propose adversarial inference on local factors of a highdimensional joint distribution and factorise both generator and discriminator based on independence assumptions given by a Bayesian network, whereas we keep a joint sample generator and model all dependencies.
While our approach is not limited to conditional generation, we will briefly review related approaches in the following. The “CycleGAN” (Zhu et al., 2017) exploits unpaired samples by assuming a onetoone mapping between the domains and using bidirectional generators (along with Gan et al. (2017)), while FactorGAN makes no such assumptions and instead uses paired examples to learn the dependency structure. Brakel and Bengio (2017)
perform independent component analysis in an adversarial fashion using a discriminator to identify correlations similarly to a qdependency discriminator to enforce the separator outputs to be independent. While similar, our method is fully adversarial and extends this framework with a pdependency discriminator to enable modelling of dependencies. For audio source separation, GANs have been used to match the outputs of a source separation model to real source signals but source dependencies were either ignored
(Zhang et al., 2017) or modelled with an additional supervised mean squared error loss (Stoller et al., 2018), which lacks a unified objective with known optimal solution as provided by the FactorGAN framework.4 Experiments
To validate our method, we compare our FactorGAN with the regular GAN approach, both for unsupervised generation as well as supervised prediction tasks. To investigate whether FactorGAN can make use of additional partial observations, we vary the proportion of the training samples available for joint sampling (paired), while using the rest to sample from the marginals (unpaired). We train all models using a single NVIDIA GTX 1080 GPU. The code to reproduce all experiments can be found in the supplementary material.
Training procedure
For stable training, we employ spectral normalisation (Miyato et al., 2018) on each discriminator network to ensure they satisfy a Lipschitz condition. Since the overall output used for training the generator is simply a linear combination of the individual discriminators (see Section 6.4 in the supplementary material), the generator gradients are also constrained in magnitude accordingly. Unless otherwise noted, we use an Adam optimiser with learning rate and a batch size of for training all models. We perform two discriminator updates after each generator update.
4.1 Paired MNIST
Our first experiment will involve “Paired MNIST”, a synthetic dataset of low complexity whose dependencies between marginals can be easily controlled. More precisely, we generate a paired version of the original MNIST dataset^{4}^{4}4http://yann.lecun.com/exdb/mnist/ by creating samples that contain a pair of vertically stacked digit images. With a probability of , the lower digit chosen during random generation is the same as the upper one, and different otherwise. For FactorGAN, we model the distributions of upper and lower digits as individual marginal distributions ().
Experimental setup
We compare the normal GAN with our FactorGAN, also including a variant without pdependency discriminator that assumes marginals to be independent (“FactorGANnocp”). We conduct the experiment with and
and also vary the amount of training samples available in paired form, while keeping the others as marginal samples only usable by FactorGAN. For both generators and discriminators, we used simple multilayer perceptrons (MLPs) (Tables
1 and 2, see supplementary material).To evaluate the quality of generated digits, we adopt the “Frechét Inception Distance” (FID) as metric (Heusel et al., 2017)
. It is based on estimating the distance between the distributions of hidden layer activations of a pretrained Imagenet object detection model for real and fake examples. To adapt the metric to MNIST data, we pretrain a classifier to predict MNIST digits (see Table
3 in supplementary material) on the training set for epochs, obtaining a test accuracy of . We input the top and bottom digits in each sample separately to the classifier and collect the activations from the last hidden layer (FC1) to compute FIDs for the top and bottom digits, respectively. We use the average of both FIDs to measure the overall output quality of the marginals (lower value is better).Since the only dependencies in the data are digit correlations controlled by , we can evaluate how well FactorGAN models these dependencies. We compute as the probability for a real sample to have digit at the top and digit at the bottom, along with marginal probabilities and (and analogously
for generated data). Since we do not have ground truth digit labels for the generated samples, we instead use the class with highest probability according to the pretrained classifier. We encode the dependency as a ratio between a joint and the product of its marginals, where the ratios for real and generated data are ideally the same. Therefore, we take their absolute difference for all digit combinations as evaluation metric (lower is better):
(5) 
Note that the metric computes how well dependencies in the real data are modelled by a generator, but not whether it introduces any additional unwanted dependencies such as top and bottom digits sharing stroke thickness, and thus presents only a necessary condition for a good generator.
Results
The results of our experiment are shown in Figure 1. Since FactorGANnocp trains on all samples independently of the number of paired observations, both FID and are constant. As expected, FactorGANnocp delivers good digit quality, and performs well for (as it assumes independence) and badly for with regards to dependency modelling.
FactorGAN outperforms GAN with small numbers of paired samples in terms of FID by exploiting the additional unpaired samples, although this gap closes as both models eventually have access to the same amount of data. FactorGAN also consistently improves in modelling the digit dependencies with an increasing number of paired observations. For , this also applies to the normal GAN, although its performance is much worse for small sample sizes as it introduces unwanted digit dependencies. Additionally, its performance appears unstable for , where it achieves the best results for a small number of paired examples. Further improvements in this setting could be gained by incorporating prior knowledge about the nature of these dependencies into the pdependency discriminator to increase its sample efficiency, but this is left for future work.
4.2 Image pair generation
In this section, we use GAN and FactorGAN for generating pairs of images in an unsupervised way to evaluate how well FactorGAN models more complex data distributions.
Datasets
We use the “Cityscapes” dataset (Cordts et al., 2016) and the “Edges2Shoes” dataset (Isola et al., 2017). To keep the outputs in a continuous domain, we treat the segmentation maps in the Cityscapes dataset as RGB images, instead of a set of discrete categorical labels. Each input and output image is downsampled to pixels as a preprocessing step to reduce computational complexity and to ensure stable GAN training.
Experimental setup
We define the distributions of input as well as output images as marginal distributions. Therefore, FactorGAN uses two marginal discriminators and a p and qdependency discriminator. All discriminators employ a convolutional architecture shown in Table 5 with and (see supplementary material). To control for the impact of discriminator size, we also train a GAN with twice the number of filters in each discriminator layer to match its size with the combined size of the FactorGAN discriminators. The same convolutional generator shown in Table 4 in the supplementary material is used for GAN and FactorGAN. Each image pair is concatenated along the channel dimension to form one sample, so that for the Cityscapes and for the Edges2Shoes dataset (since edge maps are greyscale). We make either , , or all training samples available in paired form, to investigate whether FactorGAN can improve upon GAN by exploiting the remaining unpaired samples or match its quality if there are none.
For evaluation, we randomly assign of validation data to a “testtrain” and the rest to a “testtest” partition. We train an LSGAN discriminator (Mao et al., 2017) with the architecture shown in Table 5 in the supplementary material (but half the filters in each layer) on the testtrain partition for epochs to distinguish real from generated samples, before measuring its loss on the test set. We continuously sample from the generator during training and testing instead of using a fixed set of samples to better approximate the true generator distribution. As evaluation metric, we use the average test loss over training runs, which was shown to correlate with subjective ratings of visual quality (Im et al., 2018) and also with our own quality judgements throughout this study. A larger value indicates better performance, as we use a flipped sign compared to Im et al. (2018). While the quantitative results appear indicative of output quality, accurate GAN evaluation is still an open problem and so we encourage the reader to judge generated examples in the supplementary material.
Results
Our FactorGAN achieves better or similar output quality compared to the GAN baseline in all cases, as seen in Figure 2. For the Edges2Shoes dataset, the performance gains are most pronounced for small numbers of paired samples. On the more complex Cityscapes dataset, FactorGAN outperforms GAN by a large margin independent of training set size, even when the discriminators are closely matched in size. This suggests that FactorGAN converges with fewer training iterations for , although the exact cause is unclear and should be investigated in future work.
We show some generated examples in Figure 3. Due to the small number of available paired samples, we observe a strong mode collapse of the GAN in Figure 2(a), while FactorGAN provides highfidelity, diverse outputs, as shown in Figure 2(b). Similar observations can be made for the Cityscapes dataset when using 100 paired samples (see Section 6.5.2 in supplementary material).
4.3 Image segmentation
Our approach extends to the case of conditional generation (see Section 2.3), so we tackle a complex and important image segmentation task on the Cityscapes dataset, where we ask the generator to predict a segmentation map for a city scene (instead of generating both from scratch as in Section 4.2).
Experimental setup
We downsample the scenes and segmentation maps to pixels and use a UNet architecture (Ronneberger et al., 2015) shown in Table 6 in supplementary material with and as segmentation model. For FactorGAN, we use one marginal discriminator to match the distribution of real and fake segmentation maps to ensure realistic predictions, which enables training with isolated city scenes and segmentation maps. To ensure the correct predictions for each city scene, a p and a qdependency discriminator learns the inputoutput relationship using joint samples, both employing a convolutional architecture shown in Table 5 (see supplementary material). Note that as in Section 4.2, we output segmentation maps in the continuous RGB space instead of performing classification. In addition to the MSE in the RGB space, we compute the widely used pixelwise classification accuracy (Cordts et al., 2016) by assigning each output pixel to the class whose colour has the lowest Euclidean distance in RGB space.
Results
The results in Figure 4 demonstrate that our approach can exploit additional unpaired samples to deliver better MSE and accuracy than a GAN and less noisy outputs as seen in Figure 5. While the CycleGAN reaches accuracy (Zhu et al., 2017) treating all samples as unpaired, FactorGAN offers an increase to accuracy when only 25 samples are paired, although other factors such as the choice of discriminator architecture or GAN loss might also affect this difference.
4.4 Audio source separation
We apply our method to audio source separation as another conditional generation task to investigate whether it transfers across application domains. Specifically, we conduct experiments on separating music signals into singing voice and accompaniment, which are detailed in the supplementary material in Section 6.2. Similarly to image segmentation in Section 4.3, we find that FactorGAN outperforms the normal GAN regarding separation quality, suggesting that our factorisation is indeed useful across problem domains.
5 Discussion
We find that FactorGAN outperforms GAN across all experiments when additional incomplete samples are available, especially when they are abundant in comparison to the number of joint samples. When using only joint observations, FactorGAN should be expected to match the GAN in quality, and it does so quite closely in most of our experiments. Surprisingly, it outperforms GAN in some scenarios such as image segmentation, even when discriminator sizes are matched – a phenomenon we do not fully understand yet and should be investigated in the future.
Since the pdependency discriminator does not rely on generator samples that change during training, it could be pretrained to reduce computation time, but this led to sudden training instabilities in our experiments. We suspect that this is due to a mismatch between training and testing conditions for the pdependency discriminator since it is trained on real but evaluated on fake data, and neural networks can yield overly confident predictions outside the support of the training set
(Gal and Ghahramani, 2016). Therefore, we expect classifiers with better uncertainty calibration to alleviate this issue.6 Conclusion
In this paper, we demonstrated how a joint distribution can be factorised into a set of marginals and dependencies, giving rise to the FactorGAN – a GAN in which the discriminator is split into parts that can be independently trained with incomplete observations. For both generation and conditional prediction tasks in multiple domains, we find that FactorGAN outperforms the standard GAN when additional incomplete observations are available. For Cityscapes scene segmentation in particular, FactorGAN achieves a much higher accuracy than the fully unsupervised CycleGAN, while requiring only of all examples to be annotated.
Factorising discriminators enables incorporating more prior knowledge into the design of neural architectures in GANs, which could improve empirical results in applied domains. The presented factorisation is generally applicable independent of model choice, so it can be readily integrated into many existing GANbased approaches. Since the joint density can be factorised in different ways, multiple extensions are conceivable depending on the particular application (as shown in Section 6.3 in the supplementary material). This paper derives FactorGAN from the original GAN proposed by Goodfellow et al. (2014) by exploiting the probabilistic view of the optimal discriminator. Adapting the FactorGAN to alternative GAN objectives (such as the Wasserstein GAN (Arjovsky et al., 2017)) might be possible as well.
References
 Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. arXiv preprint arXiv:1701.07875, 2017.
 Brakel and Bengio (2017) Philemon Brakel and Yoshua Bengio. Learning independent features with adversarial nets for nonlinear ICA. arXiv preprint arXiv:1710.05050, 2017.

Cordts et al. (2016)
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler,
Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele.
The cityscapes dataset for semantic urban scene understanding.
InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pages 3213–3223, 2016. 
Gal and Ghahramani (2016)
Yarin Gal and Zoubin Ghahramani.
Dropout as a Bayesian approximation: Representing model uncertainty in deep learning.
InProceedings of the International Conference on Machine Learning (ICML)
, pages 1050–1059, 2016.  Gan et al. (2017) Zhe Gan, Liqun Chen, Weiyao Wang, Yuchen Pu, Yizhe Zhang, Hao Liu, Chunyuan Li, and Lawrence Carin. Triangle generative adversarial networks. In Advances in Neural Information Processing Systems, pages 5247–5256, 2017.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
 Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two timescale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6629–6640, 2017.
 Im et al. (2018) Daniel Jiwoong Im, Alllan He Ma, Graham W. Taylor, and Kristin Branson. Quantitatively Evaluating GANs With Divergences Proposed for Training. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.
 Isola et al. (2017) Phillip Isola, JunYan Zhu, Tinghui Zhou, and Alexei A Efros. Imagetoimage translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1125–1134, 2017.
 Karaletsos (2016) Theofanis Karaletsos. Adversarial message passing for graphical models. arXiv preprint arXiv:1612.05048, 2016.
 Karras et al. (2018) Tero Karras, Samuli Laine, and Timo Aila. A StyleBased Generator Architecture for Generative Adversarial Networks. arXiv preprint arXiv:1812.04948, 2018.
 Mao et al. (2017) Xudong Mao, Qing Li, Haoran Xie, Raymond Y.K. Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2794–2802, 2017.
 Miyato et al. (2018) Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. Proceedings of the International Conference on Learning Representations (ICLR), 2018.

Mogren (2016)
Olof Mogren.
CRNNGAN: A continuous recurrent neural network with adversarial training.
In Constructive Machine Learning Workshop (CML) at NIPS 2016, page 1, 2016.  Pu et al. (2018) Yunchen Pu, Shuyang Dai, Zhe Gan, Weiyao Wang, Guoyin Wang, Yizhe Zhang, Ricardo Henao, and Lawrence Carin Duke. JointGAN: Multidomain joint distribution learning with generative adversarial nets. In Proceedings of the International Conference on Machine Learning (ICML), pages 4151–4160, 2018.
 Radford et al. (2015) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 Rafii et al. (2017) Zafar Rafii, Antoine Liutkus, FabianRobert Stöter, Stylianos Ioannis Mimilakis, and Rachel Bittner. The Musdb18 Corpus For Music Separation, December 2017.
 Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical image computing and computerassisted intervention, pages 234–241, 2015.

Sønderby et al. (2017)
Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc
Huszár.
Amortised map inference for image superresolution.
In Proceedings of the International Conference on Learning Representations (ICLR), 2017.  Stoller et al. (2018) Daniel Stoller, Sebastian Ewert, and Simon Dixon. Adversarial SemiSupervised Audio Source Separation applied to Singing Voice Extraction. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 2391–2395, 2018.
 Vincent et al. (2006) Emmanuel Vincent, Rémi Gribonval, and Cédric Févotte. Performance measurement in blind audio source separation. IEEE transactions on audio, speech, and language processing, 14(4):1462–1469, 2006.
 Yoon et al. (2018) Jinsung Yoon, James Jordon, and Mihaela Van Der Schaar. GAIN: Missing data imputation using generative adversarial nets. arXiv preprint arXiv:1806.02920, 2018.
 Zhang et al. (2017) Ning Zhang, Junchi Yan, and Yu Chen Zhou. Unsupervised Audio Source Separation via Spectrum Energy Preserved Wasserstein Learning. CoRR, abs/1711.04121, 2017.
 Zhu et al. (2017) JunYan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired imagetoimage translation using cycleconsistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2223–2232, 2017.
Appendix
6.1 Tables
Layer  Input shape  Outputs  Output shape  Activation 

FC  ReLU  
FC  ReLU  
FC  Sigmoid 
Layer  Input shape  Outputs  Output shape  Activation 

FC  LeakyReLU  
FC  LeakyReLU  
FC   
Layer  Input shape  Filter size  Stride  Outputs  Output shape  Activation 

Conv    
AvgPool  LeakyReLU  
Conv    
AvgPool  LeakyReLU  
FC1      LeakyReLU  
FC2       
Layer  Input shape  Filter size  Stride  Outputs  Output shape  Activation 

ConvT  ReLU  
ConvT  ReLU  
ConvT  ReLU  
ConvT  ReLU  
ConvT  ReLU  
Conv  Sigmoid 
Layer  Input shape  Filter size  Stride  Outputs  Output shape  Activation 

Conv  LeakyReLU  
Conv  LeakyReLU  
Conv  LeakyReLU  
Conv  LeakyReLU  
Conv  LeakyReLU  
FC      1  LeakyReLU 
Layer  Input (shape)  Outputs  Output shape 

DoubleConv1  
MP1  
DoubleConv2  
MP2  
DoubleConv3  
MP3  
DoubleConv4  
MP4  
DoubleConv5  
FC  
Concat  DoubleConv5    
UpConv  
Concat  DoubleConv4  
Conv  
UpConv  
Concat  DoubleConv3  
Conv  
UpConv  
Concat  DoubleConv2  
Conv  
UpConv  
Concat  DoubleConv1  
Conv  
Conv 
Layer  Input shape  Outputs  Output shape 

Conv  
BatchNorm & ReLU    
Conv  
BatchNorm & ReLU   
6.2 Audio source separation experiment
For our audio source separation experiment, our generator takes a music spectrogram along with noise and maps it to an estimate of the accompaniment and vocal spectra and , implicitly defining an output probability . We define the joint real and generated distributions that should be matched as and . Since the source signals in our dataset are simply added in the timedomain to produce the mixture, this approximately applies to the spectrogram as well, so we assume that . We can constrain our generator to make predictions that always satisfy this condition, thereby taking care of the inputoutput relationship manually, similarly to Sønderby et al. [2017]. Instead of predicting the sources directly, a mask with values in the range is computed, and the accompaniment and vocals are estimated as and , respectively. As a result, , so we can simplify the joint density ratio to
(6) 
meaning that the discriminator(s) in the GAN and the FactorGAN only require pairs, but not the mixture as additional input, as the correct inputoutput relationship is already incorporated into the generator. Furthermore, the last equality suggests a FactorGAN application with one marginal discriminator for each source along with dependency discriminators to model source dependencies.
Dataset
We use MUSDB [Rafii et al., 2017] as multitrack dataset for our experiment, featuring 100 songs for training and 50 songs for testing. Each song is downsampled to KHz before spectrogram magnitudes are computed, using an STFT with a sample window and a sample hop^{5}^{5}5This results in
frequency bins but we discard the bin with the highest frequency to obtain a power of 2 and thus avoid padding issues in our network architectures.
. Snippets with timeframes each are created by cropping each song’s full spectrogram at regular intervals of timeframes. Thus, the generator only separates snippets and outputs predictions of the same shape, however this does not change the derivation presented in Equation (6), and longer inputs at test time can be processed by partitioning them into snippets and concatenating the model predictions.Experimental setup
For our generator, we use the UNet architecture detailed in Table 6 with and . We use the convolutional discriminator described in Table 5 (see supplementary material) with , and . The source dependency discriminators take two sources as input via concatenation along the channel dimension, so they use .
In each experiment, we vary the number of training songs whose snippets are available for paired training between , and and compare between GAN and FactorGAN. The spectrograms predicted on the test set are converted to audio with the inverse STFT by reusing the phase from the mixture, and then evaluated using the signaltodistortion ratio (SDR), a wellestablished evaluation metric for source separation [Vincent et al., 2006].
Results
Figure 6 shows our separation results. Compared to a GAN, the separation performance is significantly higher using FactorGAN. As expected, FactorGAN improves slightly with more paired examples, which is not the case for the GAN – here we find that the vocal output becomes too quiet when increasing the number of songs for training, possibly a sign of mode collapse. Similarly to the results seen in the image pair generation experiments, we suspect that the FactorGAN discriminator might approximate the joint density more closely than the GAN discriminator due to its use of multiple discriminators, although the reasons for this are not yet understood.
6.3 Possible extensions
We can decompose the joint density ratio in other ways than shown in Equation 3 in the paper. In the following, we discuss two additional possibilities.
6.3.1 Hierarchical FactorGAN
The decomposition of the joint density ratio could be applied recursively, splitting the obtained marginals further into “submarginals” and their dependencies, which could be repeated multiple times. In addition to training with incomplete observations where only a single part is given, this also allows making use of samples where only subparts of these parts are given and is thus more flexible than a single factorisation as used in the standard FactorGAN.
As a demonstration, we split each marginal further into a group of marginals, , and their dependencies, without further recursion for simplicity:
(7) 
and are dependency terms analogously to and , but only defined on marginal variable , whose “submarginals” are denoted by .
Such a hierarchical decomposition might also be beneficial if the data is known to be generated from a hierarchical process. We leave the empirical exploration of this concept to future work.
6.3.2 Autoregressive FactorGAN
For a multidimensional variable composed of elements arranged in a sequence, such as time series data, the joint density ratio can also be decomposed in a causal, autoregressive fashion:
(8)  
(9) 
Note that is defined here as ( analogously using ). Equation (8) suggests an autoregressive version of FactorGAN in which the generator output quality at each timestep is evaluated using a marginal discriminator that estimates combined with dependency discriminators that model the dependency between the current and all past timesteps.
The final product formulation in Equation (9) reveals a close similarity to autoregressive models and suggests a modification of the normal GAN with an autoregressive discriminator that rates an input at each timestep given the previous ones. Using a derivation analogous to the one shown in Section 6.4, this implies taking the unnormalised discriminator outputs at each timestep, summing them, and applying a sigmoid nonlinearity to obtain the overall estimate of the probability . A similar implementation was used before in Mogren [2016], attempting to stabilise GAN training with recurrent neural networks as discriminators, but for the first time, we provide a rigorous theoretical justification for this practice here.
6.4 Discriminator combination
Definition 6.1.
Sigmoid discriminator output. Let for all , analogously define and .
Definition 6.2.
Combined discriminator. Let be the output of the combined discriminator that is used for training using Equation 2.
Comments
There are no comments yet.