1 Introduction
Automated theorem provers (ATPs) and proof assistants helped with formal verification of many wellknown mathematical theorems, most notable examples being the proofs of four colour theorem [12] and Kepler conjecture [13]. However, they face two significant limitations.
First of them is the fact that they can only employ formalised mathematics. Most of the corpora of mathematical results, although written in a relatively formal manner, is still only available as natural language texts, and there exist no efficient semantic or formal parsers to translate them into the machine language. This makes many important theorems (especially those published recently) unavailable for automated systems. Secondly, there is no formal method of emulating human intuition in choosing the relevant, already known facts (i.e. premise selection) and strategies for the proof. This makes even simple and intuitive proofs intractable for ATPs.
Recent advancements in another field of artificial intelligence –
machine learning, and in particular neural networks (also known by their rebadged name: deep learning) give us hope that this issues can be resolved. This is because they proved to be very successful in the fields previously reserved almost exclusively for formal methods, such as strategy board games – most notably the game of GO [3].The first attempt to employ deep learning to automated theorem proving was made in [1], where the neural network models are trained on pairs of firstorder logic axioms and conjectures to determine which axiom is most likely to be relevant in constructing an automated proof by the prover. In this paper we build on the foundation laid therein, showing that selecting an appropriate representation of premises can greatly simplify the problem, allowing us to use much simpler neural networks and, consequently, make the decision in much shorter time.
In the context of theorem proving, deep learning techniques were also recently used for example in [2, 5, 6, 7, 10, 11, 22, 25].
The code for neural network architectures presented in this paper, as well as for processing of the input data, is available at [20].
2 Datasets
For this experiment we use a dataset of 32,524 examples collected and organised by Josef Urban in [24] for Mizar40 experiment and DeepMath experiment [1]. Each example is of the form:
C tptpformula + tptpformula ... + tptpformula  tptpformula ...  tptpformula
where tptpformula is the standard firstorder logic TPTP representation, C indicates a conjecture, + a premise that is needed for an ATP proof of C, and  a premise that is not required for the proof but was highly ranked by a nearest neighbour algorithm trained on the proofs. For the practical reasons dictated by the theory of machine learning, there is roughly the same number of useful and redundant facts associated with each conjecture. In total, we have 102,442 unique formulae across this dataset; 32,524 conjectures and 69,918 axioms. Every conjecture has 16 axioms assigned to it on average, with the minimum being 2 and the maximum – 270. We take each conjecture and corresponding axioms and form pairs (conjecture, axiom), which will constitute our positive and negative examples (522,528 in total). In [1]
two alternative data representations were adopted; characterlevel and wordlevel representation. Both of these are problematic however. Premises have a variable number of characters (5 to 84,299 with mean 304.5) and words/tokens (2 to 21,251, with mean 107.4) so they have to be either truncated or padded with zeros. The character representation is given by an 80 dimensional onehot encoding with a maximum sequence length of 2048 characters, and the word representation is obtained by word encoding of axiom embeddings computed by the previously trained characterlevel model, and generating pseudorandom vectors (of the same dimension) to encode tokens such as brackets and operators. The maximum number of words is limited to 500. The resulting datasets are sparse and highly dimensional, and some of the important information is lost by restricting the maximal number of words or characters. This obstructs the performance of machine learning algorithms applied to it and, in case of artificial neural networks, imposes serious limitations on the network architecture. In the
next section we present our approach to tackle this problem.3 Distributed representation of formulae signatures
First, we limit the information obtained from each formula to the functor symbols, ignoring variable symbols (since they are essentially arbitrarily chosen characters), brackets, quantifiers, connectives, equality symbol etc. We will show that this information is sufficient to obtain accuracy similar to or higher than these obtained in the earlier models. There is 13,217 unique functor symbols across all the formulae in our dataset. Thus, to each of these functor symbols we can assign a unique positive integer smaller or equal to 13,217 and then, for any formula in our dataset, we can represent its functional signature by a 13,217 dimensional vector, whose ^{th} coordinate is equal to the number of occurrences of a function in the scope of , associated with the integer . But this does not really solve the problem present in previous approaches. Each formula usually contains only a handful of functions, and hence, in this setting, it would be represented by a sparse and long vector.
This phenomenon is very common, especially in natural language processing, and it is known as the
curse of dimensionality. It can be solved by a distributed representation of features, and there are several algorithms which can efficiently create such representation, for example a neural probabilistic language model from [4] or tSNE technique from [21]. However, these methods are normally applied to textual (and hence temporal) data, and rely on the concept of a context, which is not defined for formulae signatures. We must therefore modify it to suit our setting.Let
be a finite set of real, linearly independent vectors, and let
be real vector spaces, which we will call input and output, respectively, and let . Suppose that we know the values for some (or for all) arguments but we do not explicitly know what is. The essence of machine learning is to determine or to find its approximation, and consequently, to find the values which were previously unknown. Usually cannot be (easily) represented algebraically, but we can find a good approximation of as a composition of simpler functions. This, in turn, is the essence of neural network methods.Suppose we have a task , to approximate a function , and we do it by a neural network , which can be represented as a composition of functions:
where denotes approximate equality with respect to some fixed cost function on , and are called hidden layers (and they often are composite functions themselves).
If the network performs well after some training, we may assume that the first layers preserve and pass on some crucial information about the input set to the latter layers, needed to complete task . Thus, we may fix the parameters of and only train those of , regarding as the input set of a new neural network ,
Now, let and let be some (other) real vector space. Assume that we have a new task , to approximate a function . If we decide to solve task using neural networks, we need to remember that there is a positive relationship between the number of parameters of the network and the dimensionality of . If the latter is big, then we must either choose a simpler neural network architecture (potentially damaging its accuracy) or devote more time and hardware resources to the training process, which is not always possible. In practice this is bypassed by dimensionality reducing data preprocessing, training only top layers of the network in the later phases of the training or by loading pretrained layers (from some other tasks) and fixing them as the initial layers for our network model, and only training the layers on top of them. Pragmatic motivation of getting a lower dimensional embedding for the input space, as well as the advantages of obtaining it either during the main training process or beforehand  as a separate learning task, is described in the context of image classification and natural language processing, for example, in [9].
Given that forms a basis for , we may solve a simplified version of task in the following way. Every element in can be represented as
for some constants and distinct vectors . We use the pretrained layers to define
for all . Then we can approximate with a neural network , whose input space is given by , subject to the constraint
Although we are still approximating , if
then this embedding will reduce dimensionality of the input for a neural network, allowing for a more robust architecture of the network , as compared to networks using as the input. We can also experiment with several different network architectures, without having to obtain a new, lower dimensional embedding each time. And since is a basis for , training the layers on it will be faster than training all layers () on a training set from before freezing the first of them. That is, provided that the cardinality of is smaller than the cardinality of this training set.
In natural language processing we usually start with a vocabulary (i.e. a list of words) . It can be represented as a canonical basis for , that is the set of dimensional vectors, with all entries equal to 0, but for the ^{th} entry, where is the index of a word in which corresponds to . Since such vocabularies are normally immensely large, before any language processing task, it is good to find a lower dimensional, dense representation of . It is usually done by extracting features from temporal context of each word. If we want to mimic the same strategy for functional signatures of logical expressions, we must first define what a context is in this setting, since a functional signature, unlike a sequence of words, is not a temporal object.
First, note that if is a premise, then we can represent its functional signature by
where is the total number of unique function symbols across some corpus of premises’ functional signatures, which contains , is, again, the unit vector corresponding to the ^{th} function, and is the number of occurrences of this function in the scope of . Now, let and be functions corresponding to and respectively, for . If there exists a premise such that , then we say that is in the context of . We may represent the frequency distribution of functions in the context of by
We want to approximate by a neural network with two hidden layers:
where and are and matrices, and are and dimensional vectors, and
are activation functions applied elementwise, and
. After training this network, we may use this new, lower dimensional representation for the functional signature of a premise :In case when is an invertible function, we may even use:
() 
This is because approximating is equivalent to approximating , whenever is the inverse of .
In our experimental setting, is the set of all onehot encoded functor symbols ( = 13,217), , , is the hyperbolic tangent function, and
is the softmax function. We use Keras library
[8] to create neural network models for the dimensionality reduced embeddings, as well as the premise selection model in the next section. We initialise the weight matrices using He uniform initialisation [14]. We train the network on the 13,217 dimensional identity matrix, taking batches of 4096 training examples, for 150 epochs, using RMSprop algorithm
[15] with decay of the learning rate equal to .Usually, we train a model on some set of training data, so that we can produce an estimate of some unknown function, which can later be used to predict values of this function for data points which were previously unavailable. Here, we know the contextual distribution for all the functor symbols, and hence all the values of
, and we use the network model to simply find a less complex approximation of . For that reason we do not split the data into training and validation tests (also, doing so would effectively exclude some parameters from training, as the set is linearly independent). And since we want to approximate as accurately as possible, given the fixed number of network parameters, overfitting is not discouraged. We do, however, shuffle the data after every epoch, to allow for more distributed features in the lower dimensional representation of our dataset. The accuracy of this network, with respect to categorical crossentropy, reaches 84% after the training.Since the set of all functional signatures contains only linear combinations of onehot encoded representations of functor symbols (set ), and because is an invertible function, we can use ( ‣ 3) as the lower dimensional representation for functional signatures of premises in out dataset to develop a premise selection model in the next section.
Alternatively, we could have used autoencoders (see
[16, 17]), with the same network architecture, to obtain a lower dimensional representation of functional signatures directly. That is, we would want to find tensors
(with the same shapes as above) such thatfor all premises in our dataset. This naïve approach saves us the trouble of finding contextual distributions for all the symbols, and normally it would be a more natural choice of a dimensionality reduction technique. However, empirically, the premise selection models presented in the next section is less accurate if it uses this representation of data. Nonetheless, we included the implementation of this alternative approach in [20], should an interested reader wish to verify this.
4 Premise selection model
From the set of 32,524 conjectures and 69,918 axioms, we form a set of 522,528 positive and negative examples, by concatenating the new 256 dimensional signatures of corresponding axioms and premises. The resulting set may be represented as matrix (which is considerably smaller than
representation of full functional signatures, if the dimensionality reduction had not been applied). From this tensor we randomly select 470,275 rows (90%) for training, and use the remaining 52,253 to form a test set (10%). We use the standard regularisation of the data, by computing the mean and standard deviation along each column of the training set, and subtracting this mean from each corresponding column, and dividing them by standard deviation.
We develop a several variants of neural networks with two hidden densely connected layers. The first layer has 64, 128, 256, 512 or 1024 output units, and the number of the output units of the second layer lies in the same range, provided that it is never bigger than the number of output units of the first layer. The activation functions for these layers are the rectified linear units (ReLU). Both of these hidden layers are followed by a dropout layer
[23], with the dropout rate 0.5  to reduce the overfitting of the model. The output layer activation function is the logistic sigmoid function, returning the predicted probability that the tested premise is relevant for some proof of the tested conjecture. During the development stage, we also extract 10% of the training data for validation of the models. The models are trained for up to 1500 epochs, on batches of 4096 examples using the Adam optimiser
[19] (with the learning rate) with respect to the logistic loss function. The training data is shuffled after each epoch. The test results are presented in the table below.
Layer 1  

Layer 2  64  128  256  512  1024  
loss  0.5418  0.5295  0.5173  0.5292  0.5687  
64  accuracy  72.21%  72.75%  73.52%  74.57%  75.19% 
# of param.  37,057  73,985  147,841  295,553  590,977  
loss  0.5315  0.5158  0.5224  0.5523  
128  accuracy  72.94%  73.73%  74.49%  75.47%  
# of param.  82,305  164,535  328,449  656,641  
loss  0.5195  0.5135  0.5347  
256  accuracy  73.63%  74.19%  75.32%  
# of param.  197,377  394,241  787,969  
loss  0.5095  0.5166  
512  accuracy  74.58%  75.34%  
# of param.  525,825  1,050,625  
loss  0.5024  
1024  accuracy  75.76%  
# of param.  1,575,937 
As we can see above the (64 output units for each of the hidden layers) reaches the lowest accuracy out of these fourteen models. But is does so with comparatively few trainable parameters, which means that it can be trained in a short time and it makes predictions quickly. The model has the highest accuracy, but it requires the biggest amount of parameters, so naturally it is significantly slower. Furthermore, it is also more prone to overfitting, which can be seen of the graphs below.
Given the tradeoff between the number of parameters (and hence the computation time) and the accuracy of the model, one should choose the most suitable model carefully. We chose and , and trained them again, this time for 2500 epochs and without extracting any validation data. The results are presented below.
loss  0.5385  0.5194  0.5127  0.4895 

accuracy  72.14%  73.74%  74.73%  76.45% 
false negatives  13.5%  12.35%  9.32%  11.0% 
5 Conclusion and discussion
It is clear that thanks to dimensionality reduction we can create a neural network model that can perform the premise selection task very swiftly and with relatively high accuracy. Nevertheless, it seems that, in different applications, deep learning achieves even better results. So we could ask a question: how the above approach could be improved. First of all, we need to realise that the choice of negative examples may have influenced the performance negatively. The machine learning algorithms generally require an equal number of positive and negative examples, so that the model is not biased towards predicting one more often than the other. But as long as producing positive examples is trivial (provided that we have a valid proof of the conjecture), and we empirically see that the algorithm seldom misclassifies positive examples as negatives (see the table above), the same cannot be said about negative examples. The fact, that we have no proof of a given conjecture, which would rely on some axiom, does not imply that there exist no proof depending on this axiom. Obviously, we could include, as negative examples, axioms from a completely different theory, assuring that they are almost certainly useless. But this only weakens our model, as positive and negative examples should have similar nature, so that the model can focus on this features which really decide whether given axiom is useful or not. So far it does not seem like there is any good solution for this dilemma.
Another problem, is the fact that, when we focus solely on the functional signatures of premises, we completely ignore the logical structure of the statements, and hence the relations between functions. This does not happen if we use characterlevel representation (or even wordlevel representation with tokens like brackets also treated as words). But for the neural network to clearly identify these relations, it would have to be very deep, and hence computationally inefficient  given that the input is also highly dimensional in this setting. Another way of dealing with this issue is to substitute the statements with graphs, where vertices represent objects, and edges the relations between them. But this setting also requires complicated neural networks, obstructing its performance.
The dimensionality reduction, that we adopted in this paper, is a wonderful tool, which allows us to greatly decrease the time required to make predictions, but it can also mean the loss of essential information, often required to make these predictions. In natural language processing it is very likely that the blank space in the statement
This is a glass of an orange _____.
ought to be filled with the word ’juice’, indicating that words can often easily be deduced just from the context, and the loss of information, while switching from onehot to context embeddings, is negligible. If we also have a sentence
This is a glass of an apple _____.
then it will probably be filled with the same word. So ’orange’ and ’apple’ will have similar context embedding, without explicitly telling the computer that they are fruits. Whether or not a similar phenomenon occurs in functional signatures is debatable. Perhaps in the future a more natural embeddings will emerge.
And finally, having discussed the issues with the input data, let us deliberate on the network architecture. Let us start with emphasising that using convolutional or recurrent networks is unsuitable in this setting. The ordering of functions inside the functional signature (and thus also inside the lower dimensional embedding) is arbitrary (and we simply used the alphabetical order), so there is no theoretical justification for the use of convolutional neural networks, as their purpose is to identify local patterns between neighbouring objects. Also in practice, their performance appears to be inferior to fully connected networks for this task, when trained and tested on the same data. Temporal architectures are unsuitable for the similar reason, i.e. there is no clear temporal ordering of the functional signatures. It is possible to slightly improve the performance of the model however, by including more hidden  densely connected layers. But this, while decreasing the training time and increasing the accuracy, also increases overfitting and the prediction time and, making their introduction counterproductive.
6 Acknowledgement
The authors of this article would like to thank the UK Engineering and Physical Sciences Research Council (EPSRC) and the School of Computer Science at the University of Manchester for their financial support.
References
 [1] A. A. Alemi, F. Chollet, N. Een, G. Irving, Ch. Szegedy, J. Urban, DeepMath  Deep Sequence Models for Premise Selection, Proceedings of the 30^{th} International Conference on Neural Information Processing Systems, Advances in Neural Information Processing Systems 29, (2016), pp. 2243–2251.
 [2] M. Allamanis, P. Chanthirasegaran, P. Kohli, Ch. Sutton, Learning Continuous Semantic Representations of Symbolic Expressions, https://arxiv.org/abs/1611.01423, (2017).
 [3] AlphaGO, DeepMind, https://deepmind.com/research/alphago/.
 [4] Y. Bengio, R. Ducharme, P. Vincent, Ch. Jauvin, A Neural Probabilistic Language Model, Journal of Machine Learning Research 3, (2003), pp. 1137–1155.
 [5] Ch.H. Cai, SLDRDL: A Framework for SLDResolution with Deep Learning, https://arxiv.org/abs/1705.02210, (2017).
 [6] Ch.H. Cai, D. Ke, Y. Xu, K. Su, Learning of Humanlike Algebraic Reasoning Using Deep Feedforward Neural Networks, The Computing Research Repository, https://arxiv.org/abs/1704.07503, (2017).
 [7] P. Chojecki, DeepAlgebra  An Outline of a Program, International Conference on Intelligent Computer Mathematics, Lecture Notes in Computer Science book series 10383, (2017), pp. 1–8.
 [8] F. Chollet, Keras, https://keras.io/, (2015).
 [9] F. Chollet, Deep Learning with Python, Manning (2018).
 [10] F. Chollet, C. Kaliszyk, Ch. Szegedy, HolStep: A Machine Learning Dataset for Higherorder Logic Theorem Proving, 5^{th} International Conference on Learning Representations, https://arxiv.org/abs/1703.00426, (2017).
 [11] J. Deng, J. Wang, M. Wang, Y. Tang, Premise Selection for Theorem Proving by Deep Graph Embedding, Proceedings of the 31^{st} International Conference on Neural Information Processing Systems, Advances in Neural Information Processing Systems 30, (2017), pp. 2243–2251.
 [12] G. Gonthier, Formal Proof – The FourColor Theorem, Notices of the American Mathematical Society 55 (11), (2008), pp. 1382–1393.
 [13] T. Hales et al., A formal proof of the Kepler conjecture, Forum of Mathematics, Pi 5 (e2), (2017).
 [14] K. He, X. Zhang, Sh. Ren, J. Sun, Delving Deep into Rectifiers: Surpassing HumanLevel Performance on ImageNet Classification, https://arxiv.org/abs/1502.01852, (2015).
 [15] G. Hinton, https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf, (2016).
 [16] Ch.Y. Liou, J.Ch. Huang WCh. Yang, Modelling word perception using the Elman network, Neurocomputing 71, (16–18), Elsevier (2008), pp. 3150–3157.
 [17] Ch.Y. Liou, J.Ch. Huang WCh. Yang, Autoencoder for words, Neurocomputing 139, Elsevier (2014), pp. 84–96.

[18]
C. Kaliszyk, J. Urban, MizAR 40 for Mizar 40
, Journal of Automated Reasoning
55 (3), (2015), pp. 245–256.  [19] D. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, 3^{rd} International Conference on Learning Representations, https://arxiv.org/abs/1412.6980, (2015).
 [20] A. S. Kucik, https://github.com/AndrzejKucik/premiseselectionnn.
 [21] L. van der Maaten, G. Hinton, Visualizing Data using tSNE, Journal of Machine Learning Research 9, (2008), pp. 2579–2605.
 [22] K. Peng, D. Ma, TreeStructure CNN for Automated Theorem Proving, International Conference on Neural Information Processing, Lecture Notes in Computer Science 10635, (2017), pp. 3–12.
 [23] G. Hinton, A. Krizhevsky, R. Salakhutdinov, N. Srivastava, I. Sutskever Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Journal of Machine Learning Research 15, (2014), pp. 1929–1958.
 [24] J. Urban, Deep learning for math, https://github.com/JUrban/deepmath.
 [25] D. Whalen, Holophrasm: a neural Automated Theorem Prover for higherorder logic, https://arxiv.org/abs/1608.02644, (2016).
Comments
There are no comments yet.