1 Introduction
Recent advancements in computer vision have resulted in significant improvements for image classification systems
[13], [19], [14], [39]. Especially the rise of Deep Convolutional Neural Network has resulted in classification error rates surpassing the humanlevel performance
[12]. These promising results, enable their potential use in many real world applications. However, when deployed in a real world scenario, such systems are likely to observe samples from classes not seen during training (i.e. unknown classes also referred as “unknown unknowns” [35]). Since, the traditional training methods follow this closedset assumption, the classification systems observing any unknown class samples are forced to recognize it as one of the known classes. As a result, it affects the performance of these systems, as evidenced by Jain et al. with digit recognition example. Hence, it becomes critical to correctly identify test samples as either known or unknown for a classification model. This problem setting of identifying test samples as known/unknown and simultaneously correctly classifying all of known classes, is referred to as openset recognition [35]. Fig. 1 illustrates a typical example of classification in the openset problem setting.In an openset problem setting, it becomes challenging to identify unknown samples due to the incomplete knowledge of the world during training (i.e. only the known classes are accessible). To overcome this problem many openset methods in the literature [4], [36], [40], [38] adopt recognition score based thresholding models. However, when using these models one needs to deal with two key questions, 1) what is a good score for openset identification? (i.e., identifying a class as known or unknown), and given a score, 2) what is a good operating threshold for the model?
. There have been many methods that explore these questions in the context of traditional methods such as Support Vector Machines
[35], [36], Nearest Neighbors [16], [3] and Sparse Representation [40]. However, these questions are relatively unexplored in the context of deep neural networks [38], [4], [24], [8], [7].Eventhough deep neural networks are powerful in learning highly discriminative representations, they still suffer from performance degradation in the openset setting [4]. In a naive approach, one could apply a thresholding model on SoftMax scores. However, as shown by experiments in [4], that model is suboptimal for openset identification. A few methods have been proposed to better adapt the SoftMax scores for openset setting. Bendale et al. proposed a calibration strategy to update SoftMax scores using extreme value modeling [4]. Other strategies, Ge et al. [8] and Lawrence et al. [24] follow data augmentation technique using Generative Adversarial Networks (GANs) [10]. GANs are used to synthesize openset samples and later used to finetuning to adapt SoftMax/OpenMax scores for openset setting. Shu et al. [38]
introduced a novel sigmoidbased loss function for training the neural network to get better scores for openset identification.
All of these methods modify the SoftMax scores, so that it can perform both openset identification and maintain its classification accuracy. However, it is extremely challenging to find a single such score measure, that can perform both. In Contrast to these methods, in proposed approach the training procedure for openset recognition using class conditional autoencoders, is divided it into two subtasks, 1. closedset classification, and 2. openset identification. These subtasks are trained separately in a stagewise manner. Experiments show that such approach provides good openset identification scores and it is possible to find a good operating threshold using the proposed training and testing strategy.
In summary, this paper makes following contributions,

[topsep=0pt,noitemsep,leftmargin=*]

A novel method for openset recognition is proposed with novel training and testing algorithm based on class conditioned autoencoders.

We show that dividing openset problem in subtasks can help learn better openset identification scores.

Extensive experiments are conducted on various image classification datasets and comparisons are performed against several recent stateoftheart approaches. Furthermore, we analyze the effectiveness of the proposed method through ablation experiments.
2 Related Work
Openset Recognition. The openset recognition methods can be broadly classified in to two categories, traditional methods and neural networkbased methods. Traditional methods are based on classification models such as Support Vector Machines (SVMs), Nearest Neighbors, Sparse Representation etc. Scheirer et al. [36] extended the SVM for openset recognition by calibrating the decision scores using the extreme value distribution. Specifically, Scheirer et al. [36] utilized two SVM models, one for identifying a sample as unknown (referred as CAP models) and other for traditional closedset classification. PRM Junior et al. [15] proposed a nearest neighborbased openset recognition model utilizing the neighbor similarity as a score for openset identification. PRM Junior et al.
later also presented specialized SVM by constraining the bias term to be negative. This strategy in the case of Radial Basis Function kernel, yields an openset recognition model. Zhang
et al. [40] proposed an extension of the Sparse Representationbased Classification (SRC) algorithm for openset recognition. Specifically, they model residuals from SRC using the GeneralizedPareto extreme value distribution to get score for openset identification.In neural networkbased methods, one of the earliest works by Bendale et al. [4] introduced an openset recognition model based on “activation vectors” (i.e. penultimate layer of the network). Bendale et al. utilized metarecognition for multiclass classification by modeling the distance from “mean activation vector” using the extreme value distribution. SoftMax scores are calibrated using these models for each class. These updated scores, termed as OpenMax, are then used for openset identification. Ge et al. [8] introduced a data augmentation approach called GOpenMax. They generate unknown samples from the known class training data using GANs and use it to finetune the closedset classification model. This helps in improving the performance for both SoftMax and OpenMax based deep network. Along the similar motivation, Neal et al. [24] proposed a data augmentation strategy called counterfacutal image generation. This strategy also utilizes GANs to generate images that resemble known class images but belong to unknown classes. In another approach, Shu et al. [38] proposed a
sigmoid activationbased novel loss function to train the neural network. Additionally, they perform score analysis on the final layer activations to find an operating threshold, which is helpful for openset identification. There are some variation of openset recognition by relaxing its formulation in the form of anomaly detection
[26], [27], [30][31], [28] etc, but for this paper we only focus on the general openset recognition problem.Extreme Value Theory. Extreme value modeling is a branch of statistics that deals with modeling of statistical extremes. The use of extreme value theory in vision tasks largely deals with post recognition score analysis [29], [36]. Often for a given recognition model the threshold to reject/accept lies in the overlap region of extremes of match and nonmatch score distributions [37]. In such cases, it makes sense to model the tail of the match and nonmatch recognition scores as one of the extreme value distributions. Hence, many visual recognition methods including some described above, utilize extreme value models to improve the performance further [40], [36]. In the proposed approach as well, the tail of openset identification scores are modeled using the extreme value distribution to find the optimal threshold for operation.
3 Proposed Method
The proposed approach divides the openset recognition problem into two subtasks, namely, closedset classification and openset identification. The training procedure for these tasks are shown in Fig. 2 as stage1 and stage2. Stage3 in Fig. 2 provides overview of the proposed approach at inference. In what follows, we present details of these stages.
3.1 Closedset Training (Stage 1)
Given images in a batch , and their corresponding labels . Here is the batch size and . The encoder () and the classifier () with parameters and , respectively are trained using the following cross entropy loss,
(1) 
where, is an indicator function for label (i.e.
, one hot encoded vector) and
is a predicted probability score vector.
is probability of the sample being from the class.3.2 Openset Training (Stage 2)
There are two major parts in openset training, conditional decoder training, followed by EVT modeling of the reconstruction errors. In this stage, the encoder and classifier weights are fixed and don’t change during optimization.
3.2.1 Conditional Decoder Training
For any batch described in Sec. 3.1, is used to extract the latent vectors as, . This latent vector batch is conditioned following the work by Perez et al. [32] called FiLM. FiLM influences the input feature map by applying a featurewise linear modulations (hence the name FiLM) based on conditioning information. For a input feature and vector containing conditioning information can be given as,
(2)  
(3) 
where,
Here, and are neural networks with parameters and
. Tensors
, , have the same shape and represents the Hadamard product. is used for conditioning, and referred to as label condition vector in the paper. Also, the notation is used to describe the latent vector conditioned on the label condition vector , i.e, .The decoder ( with parameters ) is expected to perfectly reconstruct the original input when conditioned on the label condition vector matching the class identity of the input, referred here as the match condition vector (), can be viewed as a traditional autoencoder. However, here is additionally trained to poorly reconstruct the original input when conditioned on the label condition vector, that does not match the class identity of the input, referred here as the nonmatch condition vector (). The importance of this additional constraint on the decoder is discussed in Sec. 3.2.2 while modeling the reconstruction errors using EVT. For the rest of this paper, we use superscript and to indicate match and nonmatch, respectively.
Now, for a given input from the batch and and , for any random sampled from , be its corresponding match and nonmatch condition vectors, the feed forward path for stage2 can be summarized through the following equations,
Following the above feedforward path, the loss functions in the second stage of training to train the decoder ( with parameters ) and conditioning layer (with parameters and ) are given as follows,
(4) 
(5) 
(6)  
Here, the loss function corresponds to the constraint that output generated using match condition vector , should be perfect reconstruction of . Whereas, the loss function corresponds to the constraint that output generated using non match condition vector , should have poor reconstruction. To enforce the later condition, another batch , is sampled from the training data, such that new batch does not have class identity consistent with the match condition vector. This in effect achieves the goal of poor reconstruction when conditioned . This conditioning strategy in a way, emulates openset behavior (as will be discussed further in Sec. 3.2.2). Here, the network is specifically trained to produce poor reconstructions when class identity of an input image does not match the condition vector. So, when encountered with an unknown class test sample, ideally none of the condition vector would match the input image class identity. This will result in poor reconstruction for all condition vectors. While, when encountered with the known test sample, as one of the condition vector will match the input image class identity, it will produce a perfect reconstruction for that particular condition vector. Hence, training with the nonmatch loss helps the network adapt better to openset setting. Here, and are weighted with .
3.2.2 EVT Modeling
Extreme Value Theory. Extreme value theory is often used in many visual recognition systems and is an effective tool for modeling posttraining scores [36], [37]. It has been used in many applications such as finance, railway track inspection etc. [23], [1], [9] as well as openset recognition [4], [36], [40]. In this paper we follow the PicklandsBalkemadeHaan formulation [33], [2]
of the extreme value theorem. It considers modeling probabilities conditioned on random variable exceeding a high threshold. For a given random variable
with a cumulative distribution function (CDF)
the conditional CDF for any exceeding the threshold is defined as,where, denotes probability measure function. Now, given I.I.D. samples, , the extreme value theorem [33] states that, for large class of underlying distributions and given a large enough , can be well approximated by the Generalized Pareto Distribution (GPD),
(7) 
such that , , and . is CDF of GPD and for
, reduces to the exponential distribution with parameter
and for takes the form of Pareto distribution [6].Parameter Estimation. When modeling the tail of any distribution as GPD, the main challenge is in finding the tail parameter
to get the conditional CDF. However, it is possible to find an estimated value of
using mean excess function (MEF), i.e., [37]. It has been shown that for GPD, MEF holds a linear relationship with . Many researchers use this property of GPD to estimate the value of [37], [29]. Here, the algorithm for finding , introduced in [29] for GPD is adopted with minor modifications. See [29], [37] for more details regarding MEF or tail parameter estimation. After getting an estimate for , since from extreme value theorem [33], we know that set , follows GPD distribution, rest of the parameters for GPD, i.e. and can be easily estimated using the maximum likelihood estimation techniques [11], except for some rarely observed cases [5].3.2.3 Threshold Calculation
After training procedure described in previous sections, Sec. 3.1 and Sec. 3.2, set of match and nonmatch reconstruction errors are created from training set, , and their corresponding match and non match labels, and . Let, be the match reconstruction error and be the non match reconstruction error for the input , then the set of match and non match errors can be calculated as,
Typical histograms of (set of match reconstruction errors) and (set of nonmatch reconstruction errors) are shown in Fig. 2(a). Note that the elements in these sets are calculated solely based on what is observed during training (i.e., without utilizing any unknown samples). Fig. 2(b) shows the normalized histogram of the reconstruction errors observed during inference from the test samples of known class set , and unknown class set . Comparing these figures in Fig. 3, it can be observed that the distribution of and computed during training, provides a good approximation for the error distributions observed during inference, for test samples from known set and unknown set . This observation also validates that non match training emulates an openset test scenario (also discussed in Sec. 3.2) where the input does not match any of the class labels. This motivates us to use and to find an operating threshold for openset recognition to make a decision about any test sample being known/unknown.
Now, It is safe to assume that the optimal operating threshold () lies in the region . Here, the underlying distributions of and are not known. However, as explained in 3.2.2, it is possible to model the tails of (right tail) and (left tail) with GPD as and with being a CDF. Though, GPD is only defined for modeling maxima, before fitting left tail of we perform inverse transform as
. Assuming the prior probability of observing unknown samples is
, the probability of errors can be formulated as a function of the threshold ,Solving the above equation should give us an operating threshold that can minimize the probability of errors for a given model and can be solved by a simple line search algorithm by searching for in the range . Here, the accurate estimation of depends on how well and represent the known and unknown error distributions. It also depends on the prior probability , effect of this prior will be further discussed in Sec. 4.3.
3.3 Openset Testing by kinference (Stage 3)
Here, we introduce the openset testing algorithm for proposed method. The testing procedure is described in Algo. 1 and an overview of this is also shown in Fig. 2. This testing strategy involves conditioning the decoder times with all possible condition vectors to get reconstruction errors. Hence, it is referred as inference algorithm.
4 Experiments and Results
In this section we evaluate the performance of the proposed approach and compare it with the state of the art openset recognition methods. The experiments in Sec. 4.2, we measure the ability of algorithm to identify test samples as known or unknown without considering operating threshold. In second set of experiments in Sec. 4.3, we measure overall performance (evaluated using Fmeasure) of openset recognition algorithm. Additionally through ablation experiments, we analyze contribution from each component of the proposed method.
4.1 Implementation Details
We use Adam optimizer [17] with learning rate and batch size, =. The parameter , described in Sec. 3.2, is set equal to 0.9. For all the experiments, conditioning layer networks and are a single layer fully connected neural networks. Another important factor affecting openset performance is openness of the problem. Defined by Scheirer et al. [35], it quantifies how open the problem setting is,
(8) 
where, is the number of training classes seen during training, is the number of test classes that will be observed during testing, is the number of target classes that needs to be correctly recognized during testing. We evaluate performance over multiple openness value depending on the experiment and dataset.
4.2 Experiment I : Openset Identification
The evaluation protocol defined in [24]
is considered and area under ROC (AUROC) is used as evaluation metric. AUROC provides a calibration free measure and characterizes the performance for a given score by varying threshold. The encoder, decoder and classifier architecture for this experiment is similar to the architecture used by
[24] in their experiments. Following the protocol in [24], we report the AUROC averaged over five randomized trials.4.2.1 Datasets
Method  MNIST  SVHN  CIFAR10  CIFAR+10  CIFAR+50  TinyImageNet 
SoftMax  0.978  0.886  0.677  0.816  0.805  0.577 
OpenMax [4] (CVPR’16)  0.981  0.894  0.695  0.817  0.796  0.576 
GOpenMax [8] (BMVC’17)  0.984  0.896  0.675  0.827  0.819  0.580 
OSRCI [24] (ECCV’18)  0.988  0.910  0.699  0.838  0.827  0.586 
Proposed Method  0.989  0.922  0.895  0.955  0.937  0.748 
Here, we provide summary of these protocols for each dataset,
MNIST, SVHN, CIFAR10. For MNIST [21], SVHN [25] and CIFAR10 [18], openness of the problem is set to , by randomly sampling 6 known classes and 4 unknown classes.
CIFAR+10, CIFAR+50. For CIFAR+ experiments, 4 classes are sampled from CIFAR10 for training. non overlapping classes are used as the unknowns, which are sampled from the CIFAR100 dataset [18]. Openness of the problem for CIFAR+10 and CIFAR+50 is and , respectively.
TinyImageNet. For experiments with the TinyImageNet [20], 20 known classes and 180 unknown classes with openness are randomly sampled for evaluation.
4.2.2 Comparison with stateoftheart
For comparing the openset identification performance, we consider the following methods:
I. SoftMax : SoftMax score of a predicted class is used for openset identification.
II. OpenMax [4]: The score of + class and score of the predicted class is used for openset identification.
III. GOpenMax [8]: It is a data augmentation technique, which utilizes the OpenMax scores after training the network with the generated data.
IV. OSRCI [24]: Another data augmentation technique called counterfactual image generation is used for training the network for + class classification. We refer to this method as Openset Recognition using Counterfactual Images (OSRCI). The score value is used for openset identification.
Results corresponding to this experiment are shown in Table 2. As seen from this table, the proposed method outperform the other methods, showing that openset identification training in proposed approach learns better scores for identifying unknown classes. From the results, we see that our method on the digits dataset produces a minor improvement compared to the other recent methods. This is mainly do the reason that results on the digits dataset are almost saturated. On the other hand, our method performs significantly better than the other recent methods on the object datasets such as CIFAR and TinyImageNet.
4.3 Experiment II : Openset Recognition
This experiment shows the overall openset recognition performance evaluated with Fmeasure. For this experiment we consider LFW Face dataset [22]. We extend the protocol introduced in [35]
for openset face recognition on LFW. Total 12 classes containing more than 50 images are considered as known classes and divided into training and testing split by 80/20 ratio. Image size is kept to 64
64. Since, LFW has 5717 number of classes, we vary the openness from to by taking 18 to 5705 unknown classes during testing. Since, many classes contain only one image, instead of random sampling, we sort them according to the number of images per class and add it sequentially to increase the openness. It is obvious that with the increase in openness, the probability of observing unknown will also increase. Hence, it is reasonable to assume that prior probability will be a function of openness. For this experiment we set .4.3.1 Comparison with stateoftheart
For comparing the openset recognition performance, we consider the following methods:
I. WSVM (PAMI’14) : WSVM is used as formulated in [35], which trains Weibull calibrated SVM classifier for open set recognition.
II. SROR (PAMI’16) : SROR is used as formulated in [40]. It uses sparse representationbased framework for openset recognition.
III. DOC (EMNLP’16) : It utilizes a novel sigmoidbased loss function for training a deep neural network [38].
To have a fair comparison with these methods, we use features extracted from the encoder (
) to train WSVM and SROR. For DOC, the encoder () is trained with the loss function proposed in [38]. Experiments on LFW are performed using a UNet [34] inspired encoderdecoder architecture. More details regarding network architecture is included in the supplementary material.Results corresponding to this experiment is shown in Fig. 3(a). From this figure, we can see that the proposed approach remains relatively stable with the increase in openness, outperforming all other methods. One interesting trend noticed here is, that DOC initially performs better than the statistical methods such as WSVM and SROR. However with openness more than 50% the performance suffers significantly. While the statistical methods though initially perform poor compared to DOC, but remain relatively stable and performs better than DOC as the openness is increased (especially over 50%).
4.3.2 Ablation Study
In this section, we present ablation analysis of the proposed approach on the LFW Face dataset. The contribution of individual components to the overall performance of the method is reported by creating multiple baselines of the proposed approach. Starting with the most simple baseline, i.e., thresholding SoftMax probabilities of a closedset model, each component is added building up to the proposed approach. Detailed descriptions of these baselines are given as follows,
I. CLS : Encoder and the classifier are trained for class classification. Samples with probability score prediction less than 0.5 are classified as unknown.
II. CLS+DEC : In this baseline, only the networks , and the decoder are trained as described in Sec. 3, except is only trained with match loss function, . Samples with more than 95% of maximum train reconstruction error observed, are classified as unknown.
III. Naive : Here, the networks , and and the conditioning layer networks ( and ) are trained as described in Sec. 3, but instead of modeling the scores using EVT as described in Sec. 3.2.2, threshold is directly estimated from the raw reconstruction errors.
IV. Proposed method (p = 0.5) : , , and condition layer networks ( and ) are trained as described in Sec. 3 and to find the threshold prior probability of observing unknown is set to .
V. Proposed method: Method proposed in this paper, with set as described in Sec. 4.3.
Results corresponding to the ablation study are shown in Fig. 3(b). Being a simple SoftMax thresholding baseline, CLS has weakest performance. However, when added with a match loss function () as in CLS+DEC
, the openset identification is performed using reconstruction scores. Since, it follows a heuristic way of thresholding, the performance degrades rapidly as openness increases. However, addition of non match loss function (
), as in the Naive baseline, helps find a threshold value without relying on heuristics. As seen from the Fig. 3(b) performance of Naive baseline remains relatively stable with increase in openness, showing the importance of loss function . Proposed method with fixed to 0.5, introduces EVT modeling on reconstruction errors to calculate a better operating threshold. It can be seen from the Fig. 3(b), such strategy improves over finding threshold based on raw score values. This shows importance applying EVT models on reconstruction errors. Now, if is set to , as in the proposed method, there is a marginal improvement over the fixed baseline. This shows benefit of setting as a function of openness. It is interesting to note that for large openness values (as ), both fixed baseline and proposed method achieve similar performance.5 Conclusion
We presented an openset recognition algorithm based on class conditioned autoencoders. We introduced training and testing strategy for these networks. It was also shown that dividing the openset recognition into sub tasks helps learn a better score for openset identification. During training, enforcing conditional reconstruction constraints are enforced, which helps learning approximate known and unknown score distributions of reconstruction errors. Later, this was used to calculate an operating threshold for the model. Since inference for a single sample needs feedforwards, it suffers from increased test time. However, the proposed approach performs well across multiple image classification datasets and providing significant improvements over many state of the art openset algorithms. In our future research, generative models such as GAN/VAE/FLOW can be explored to modify this method. We will revise the manuscript with such details in the conclusion.
Acknowledgements
This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. 201414071600012. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government.
References
 [1] Isabel Fraga Alves and Cláudia Neves. Extreme value distributions. In International encyclopedia of statistical science, pages 493–496. Springer, 2011.
 [2] August A Balkema and Laurens De Haan. Residual life time at great age. The Annals of probability, pages 792–804, 1974.

[3]
Abhijit Bendale and Terrance Boult.
Towards open world recognition.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 1893–1902, 2015.  [4] Abhijit Bendale and Terrance E Boult. Towards open set deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1563–1572, 2016.
 [5] Vartan Choulakian and Michael A Stephens. Goodnessoffit tests for the generalized pareto distribution. Technometrics, 43(4):478–484, 2001.
 [6] Herbert Aron David and Haikady Navada Nagaraja. Order statistics. Wiley Online Library, 1970.
 [7] Akshay Raj Dhamija, Manuel Günther, and Terrance Boult. Reducing network agnostophobia. In Advances in Neural Information Processing Systems, pages 9175–9186, 2018.
 [8] ZongYuan Ge, Sergey Demyanov, Zetao Chen, and Rahil Garnavi. Generative openmax for multiclass open set classification. arXiv preprint arXiv:1707.07418, 2017.
 [9] Xavier Gibert, Vishal M Patel, and Rama Chellappa. Deep multitask learning for railway track inspection. IEEE Transactions on Intelligent Transportation Systems, 18(1):153–164, 2017.
 [10] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [11] Scott D Grimshaw. Computing maximum likelihood estimates for the generalized pareto distribution. Technometrics, 35(2):185–191, 1993.

[12]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification.
In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.  [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [14] Jie Hu, Li Shen, and Gang Sun. Squeezeandexcitation networks. arXiv preprint arXiv:1709.01507, 2017.
 [15] Pedro Ribeiro Mendes Júnior, Terrance E Boult, Jacques Wainer, and Anderson Rocha. Specialized support vector machines for openset recognition. arXiv preprint arXiv:1606.03802, 2016.
 [16] Pedro R Mendes Júnior, Roberto M de Souza, Rafael de O Werneck, Bernardo V Stein, Daniel V Pazinato, Waldir R de Almeida, Otávio AB Penatti, Ricardo da S Torres, and Anderson Rocha. Nearest neighbors distance ratio openset classifier. Machine Learning, 106(3):359–386, 2017.
 [17] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. 2015.
 [18] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 [19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [20] Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 2015.
 [21] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2, 2010.
 [22] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pages 3730–3738, 2015.
 [23] Gerald A Meehl, Thomas Karl, David R Easterling, Stanley Changnon, Roger Pielke Jr, David Changnon, Jenni Evans, Pavel Ya Groisman, Thomas R Knutson, Kenneth E Kunkel, et al. An introduction to trends in extreme weather and climate events: observations, socioeconomic impacts, terrestrial ecological impacts, and model projections. Bulletin of the American Meteorological Society, 81(3):413–416, 2000.
 [24] Lawrence Neal, Matthew Olson, Xiaoli Fern, WengKeen Wong, and Fuxin Li. Open set learning with counterfactual images. In Proceedings of the European Conference on Computer Vision (ECCV), pages 613–628, 2018.
 [25] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
 [26] Poojan Oza and Vishal M Patel. Active authentication using an autoencoder regularized cnnbased oneclass classifier. arXiv preprint arXiv:1903.01031, 2019.
 [27] Poojan Oza and Vishal M Patel. Oneclass convolutional neural network. IEEE Signal Processing Letters, 26(2):277–281, 2019.
 [28] Pramuditha Perera, Ramesh Nallapati, and Bing Xiang. Ocgan: Oneclass novelty detection using gans with constrained latent representations. arXiv preprint arXiv:1903.08550, 2019.
 [29] Pramuditha Perera and Vishal M Patel. Extreme value analysis for mobile active user authentication. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pages 346–353. IEEE, 2017.
 [30] Pramuditha Perera and Vishal M Patel. Learning deep features for oneclass classification. arXiv preprint arXiv:1801.05365, 2018.
 [31] Pramuditha Perera and Vishal M Patel. Deep transfer learning for multiple class novelty detection. arXiv preprint arXiv:1903.02196, 2019.
 [32] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. arXiv preprint arXiv:1709.07871, 2017.
 [33] James Pickands III et al. Statistical inference using extreme order statistics. the Annals of Statistics, 3(1):119–131, 1975.
 [34] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pages 234–241. Springer, 2015.
 [35] Walter J Scheirer, Anderson de Rezende Rocha, Archana Sapkota, and Terrance E Boult. Toward open set recognition. IEEE transactions on pattern analysis and machine intelligence, 35(7):1757–1772, 2013.
 [36] Walter J Scheirer, Lalit P Jain, and Terrance E Boult. Probability models for open set recognition. IEEE transactions on pattern analysis and machine intelligence, 36(11):2317–2324, 2014.
 [37] Zhixin Shi, Frederick Kiefer, John Schneider, and Venu Govindaraju. Modeling biometric systems using the general pareto distribution (gpd). In Biometric Technology for Human Identification V, volume 6944, page 69440O. International Society for Optics and Photonics, 2008.
 [38] Lei Shu, Hu Xu, and Bing Liu. Doc: Deep open classification of text documents. arXiv preprint arXiv:1709.08716, 2017.
 [39] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [40] He Zhang and Vishal M Patel. Sparse representationbased open set recognition. IEEE transactions on pattern analysis and machine intelligence, 39(8):1690–1696, 2017.
6 Supplementary Material for C2AE: Class Conditioned AutoEncoder for Openset Recognition
This contains the supplementary material for the paper C2AE: Class Conditioned AutoEncoder for Openset Recognition. Due to the space limitations in the submitted paper, we provide some additional details regarding the proposed method.
6.1 Toy Examples
To see the decision boundaries learned using the proposed approach, we perform few experiments with 2Dimensional toy data. For these experiments the encoder, decoder and classifier architectures are FC(2)SigFC(5)Sig, FC(5)SigFC(2) and FC(5)SigFC(2), respectively. Here, FC(T) indicates fully connected layer with T hidden units, Sig is the sigmoid activation. We train these networks using the proposed approach for three different variations of 2Dimensional datasets, namely TwoGauss, FourGauss and UniGauss. TwoGauss and FourGauss have two and four 2D Gaussians with different means and same variance, respectively. Whereas UniGauss has one class as 2D Gaussian and another classes Uniformly distributed. As it can be seen from Fig.
5, the proposed approach is able to learn tight boundaries surrounding the data points and identify all of the remaining space as unknown.6.2 Results
Here we present the AUROC table for openset identification with standard deviation values. Standard deviation values were not available for CIFAR+10, CIFAR+50 and TinyImageNet as the values are taken from
[24].Method  MNIST  SVHN  CIFAR10  CIFAR+10  CIFAR+50  TinyImageNet 
SoftMax  0.978 0.002  0.886 0.006  0.677 0.032  0.816 –  0.805 –  0.577 – 
OpenMax (CVPR’16)  0.981 0.002  0.894 0.008  0.695 0.032  0.817 –  0.796 –  0.576 – 
GOpenMax (BMVC’17)  0.984 0.001  0.896 0.006  0.675 0.035  0.827 –  0.819 –  0.580 – 
OSRCI (ECCV’18)  0.988 0.001  0.910 0.006  0.699 0.029  0.838 –  0.827 –  0.586 – 
Proposed Method  0.989 0.002  0.922 0.009  0.895 0.008  0.955 0.006  0.937 0.004  0.748 0.005 
6.3 Histogram Progression
Fig. 6 and Fig. 7, provides evolution of reconstruction errors during the learning procedure. The reconstruction errors for match, non match, known and unknown are provided at iteration 1, 5k and 500k. As it can be seen from Fig. 5(a), since the network is initialized with random weights, the reconstruction errors for match and non match are not discriminative. However, since the network is trained to learn the discriminative reconstructions for match and non match conditioning, with the increase in iterations, the reconstruction errors become more discriminative as seen from the Fig. 5(b) and Fig. 5(c). As a result, known and unknown reconstruction errors follow the same trend as that of match and non match as evident from the Fig. 7. The SVHN dataset is used for generating the normalized histograms of match, non match, known and unknown data reconstruction errors.
6.4 Network Architecture
The network architecture for the LFW experiments is shown in the Fig. 8. It is an UNet inspired network architecture with a FiLM conditioning layer in the middle. The network architecture is as follows,
C(64)C(128)C(256)C(512)C(1024)FiLMDC(2048)DC(1024)DC(512)DC(256)DC(124)DC(3)Tanh.
Here, C(T) represents T channel convolution layer followed by instance normalization and leaky ReLU activation. DC(T) represents T channel transposed convolution layer followed by instance normalization and Upsampling. FiLM layer is a conditioning layer which modulates feature maps from C(1024) with linear modulation parameters
and of size 102422, based on label conditioning vector. Here, the convolution blocks are used as encoder and deconvolution blocks are used as decoder. As explained in the proposed approach, the encoder weights are frozen during training in stage2. The classifier network for the experiments with the LFW dataset is a single layer fully connected network with 12 hidden units (same as number of known classes).
Comments
There are no comments yet.