In representation learning, knowledge graph embedding (KGE), also referred to as knowledge representation learning (KRL), or multi-relation learning,[1] is a machine learning task of learning a low-dimensional representation of a knowledge graph's entities and relations while preserving their semantic meaning.[1][2][3] Leveraging their embedded representation, knowledge graphs (KGs) can be used for various applications such as link prediction, triple classification, entity recognition, clustering, and relation extraction.[1][4]
A knowledge graph is a collection of entities , relations , and facts .[5] A fact is a triple that denotes a link between the head and the tail of the triple. Another notation that is often used in the literature to represent a triple (or fact) is . This notation is called resource description framework (RDF).[1][5] A knowledge graph represents the knowledge related to a specific domain; leveraging this structured representation, it is possible to infer a piece of new knowledge from it after some refinement steps.[6] However, nowadays, people have to deal with the sparsity of data and the computational inefficiency to use them in a real-world application.[3][7]
The embedding of a knowledge graph translates each entity and relation of a knowledge graph, into a vector of a given dimension , called embedding dimension.[7] In the general case, we can have different embedding dimensions for the entities and the relations .[7] The collection of embedding vectors for all the entities and relations in the knowledge graph can then be used for downstream tasks.
A knowledge graph embedding is characterized by four different aspects:[1]
All the different knowledge graph embedding models follow roughly the same procedure to learn the semantic meaning of the facts.[7] First of all, to learn an embedded representation of a knowledge graph, the embedding vectors of the entities and relations are initialized to random values.[7] Then, starting from a training set until a stop condition is reached, the algorithm continuously optimizes the embeddings.[7] Usually, the stop condition is given by the overfitting over the training set.[7] For each iteration, is sampled a batch of size from the training set, and for each triple of the batch is sampled a random corrupted fact—i.e., a triple that does not represent a true fact in the knowledge graph.[7] The corruption of a triple involves substituting the head or the tail (or both) of the triple with another entity that makes the fact false.[7] The original triple and the corrupted triple are added in the training batch, and then the embeddings are updated, optimizing a scoring function.[5][7] At the end of the algorithm, the learned embeddings should have extracted the semantic meaning from the triples and should correctly predict unseen true facts in the knowledge graph.[5]
The following is the pseudocode for the general embedding procedure.[9][7]
algorithm Compute entity and relation embeddings is input: The training set , entity set , relation set , embedding dimension output: Entity and relation embeddings initialization: the entities and relations embeddings (vectors) are randomly initialized while stop condition do // From the training set randomly sample a batch of size b for each in do // sample a corrupted fact of triple end for Update embeddings by minimizing the loss function end while
These indexes are often used to measure the embedding quality of a model. The simplicity of the indexes makes them very suitable for evaluating the performance of an embedding algorithm even on a large scale.[10] Given as the set of all ranked predictions of a model, it is possible to define three different performance indexes: Hits@K, MR, and MRR.[10]
Hits@K or in short, H@K, is a performance index that measures the probability to find the correct prediction in the first top K model predictions.[10] Usually, it is used .[10] Hits@K reflects the accuracy of an embedding model to predict the relation between two given triples correctly.[10]
Hits@K
Larger values mean better predictive performances.[10]
Mean rank is the average ranking position of the items predicted by the model among all the possible items.[10]
The smaller the value, the better the model.[10]
Mean reciprocal rank measures the number of triples predicted correctly.[10] If the first predicted triple is correct, then 1 is added, if the second is correct is summed, and so on.[10]
Mean reciprocal rank is generally used to quantify the effect of search algorithms.[10]
The larger the index, the better the model.[10]
Knowledge graph completion (KGC) is a collection of techniques to infer knowledge from an embedded knowledge graph representation.[11] In particular, this technique completes a triple inferring the missing entity or relation.[11] The corresponding sub-tasks are named link or entity prediction (i.e., guessing an entity from the embedding given the other entity of the triple and the relation), and relation prediction (i.e., forecasting the most plausible relation that connects two entities).[11]
Triple Classification is a binary classification problem.[1] Given a triple, the trained model evaluates the plausibility of the triple using the embedding to determine if a triple is true or false.[11] The decision is made with the model score function and a given threshold.[11] Clustering is another application that leverages the embedded representation of a sparse knowledge graph to condense the representation of similar semantic entities close in a 2D space.[4]
The use of knowledge graph embedding is increasingly pervasive in many applications. In the case of recommender systems, the use of knowledge graph embedding can overcome the limitations of the usual reinforcement learning.[12][13] Training this kind of recommender system requires a huge amount of information from the users; however, knowledge graph techniques can address this issue by using a graph already constructed over a prior knowledge of the item correlation and using the embedding to infer from it the recommendation.[12] Drug repurposing is the use of an already approved drug, but for a therapeutic purpose different from the one for which it was initially designed.[14] It is possible to use the task of link prediction to infer a new connection between an already existing drug and a disease by using a biomedical knowledge graph built leveraging the availability of massive literature and biomedical databases.[14] Knowledge graph embedding can also be used in the domain of social politics.[4]
Given a collection of triples (or facts) , the knowledge graph embedding model produces, for each entity and relation present in the knowledge graph a continuous vector representation.[7] is the corresponding embedding of a triple with and , where is the embedding dimension for the entities, and for the relations.[7] The score function of a given model is denoted by and measures the distance of the embedding of the head from the embedding of tail given the embedding of the relation, or in other words, it quantifies the plausibility of the embedded representation of a given fact.[5]
Rossi et al. propose a taxonomy of the embedding models and identifies three main families of models: tensor decomposition models, geometric models, and deep learning models.[5]
The tensor decomposition is a family of knowledge graph embedding models that use a multi-dimensional matrix to represent a knowledge graph,[1][5][17] that is partially knowable due to the gaps of the knowledge graph describing a particular domain thoroughly.[5] In particular, these models use a three-way (3D) tensor, which is then factorized into low-dimensional vectors that are the entities and relations embeddings.[5][17] The third-order tensor is a suitable methodology to represent a knowledge graph because it records only the existence or the absence of a relation between entities,[17] and for this reason is simple, and there is no need to know a priori the network structure,[15] making this class of embedding models light, and easy to train even if they suffer from high-dimensional and sparsity of data.[5][17]
This family of models uses a linear equation to embed the connection between the entities through a relation.[1] In particular, the embedded representation of the relations is a bidimensional matrix.[5] These models, during the embedding procedure, only use the single facts to compute the embedded representation and ignore the other associations to the same entity or relation.[18]
The geometric space defined by this family of models encodes the relation as a geometric transformation between the head and tail of a fact.[5] For this reason, to compute the embedding of the tail, it is necessary to apply a transformation to the head embedding, and a distance function is used to measure the goodness of the embedding or to score the reliability of a fact.[5]
Geometric models are similar to the tensor decomposition model, but the main difference between the two is that they have to preserve the applicability of the transformation in the geometric space in which it is defined.[5]
This class of models is inspired by the idea of translation invariance introduced in word2vec.[7] A pure translational model relies on the fact that the embedding vector of the entities are close to each other after applying a proper relational translation in the geometric space in which they are defined.[18] In other words, given a fact, when the embedding of head is added to the embedding of relation, the expected result should be the embedding of the tail.[5] The closeness of the entities embedding is given by some distance measure and quantifies the reliability of a fact.[17]
It is possible to associate additional information to each element in the knowledge graph and their common representation facts.[1] Each entity and relation can be enriched with text descriptions, weights, constraints, and others in order to improve the overall description of the domain with a knowledge graph.[1] During the embedding of the knowledge graph, this information can be used to learn specialized embeddings for these characteristics together with the usual embedded representation of entities and relations, with the cost of learning a more significant number of vectors.[5]
This family of models, in addition or in substitution of a translation they employ a rotation-like transformation.[5]
This group of embedding models uses deep neural network to learn patterns from the knowledge graph that are the input data.[5] These models have the generality to distinguish the type of entity and relation, temporal information, path information, underlay structured information,[18] and resolve the limitations of distance-based and semantic-matching-based models in representing all the features of a knowledge graph.[1] The use of deep learning for knowledge graph embedding has shown good predictive performance even if they are more expensive in the training phase, angry of data, and often required a pre-trained embedding representation of knowledge graph coming from a different embedding model.[1][5]
This family of models, instead of using fully connected layers, employs one or more convolutional layers that convolve the input data applying a low-dimensional filter capable of embedding complex structures with few parameters by learning nonlinear features.[1][5][18]
This family of models uses capsule neural networks to create a more stable representation that is able to recognize a feature in the input without losing spatial information.[5] The network is composed of convolutional layers, but they are organized in capsules, and the overall result of a capsule is sent to a higher-capsule decided by a dynamic process routine.[5]
This class of models leverages the use of recurrent neural network.[5] The advantage of this architecture is to memorize a sequence of fact, rather than just elaborate single events.[40]
The machine learning task for knowledge graph embedding that is more often used to evaluate the embedding accuracy of the models is the link prediction.[1][3][5][6][7][18] Rossi et al.[5] produced an extensive benchmark of the models, but also other surveys produces similar results.[3][7][18][25] The benchmark involves five datasets FB15k,[9] WN18,[9] FB15k-237,[41] WN18RR,[36] and YAGO3-10.[42] More recently, it has been discussed that these datasets are far away from real-world applications, and other datasets should be integrated as a standard benchmark.[43]
Dataset name | Number of different entities | Number of different relations | Number of triples |
---|---|---|---|
FB15k[9] | 14951 | 1345 | 584,113 |
WN18[9] | 40943 | 18 | 151,442 |
FB15k-237[41] | 14541 | 237 | 310,116 |
WN18RR[36] | 40943 | 11 | 93,003 |
YAGO3-10[42] | 123182 | 37 | 1,089,040 |
Model name | Memory complexity | FB15K (Hits@10) | FB15K (MR) | FB15K (MRR) | FB15K - 237 (Hits@10) | FB15K - 237 (MR) | FB15K - 237 (MRR) | WN18 (Hits@10) | WN18 (MR) | WN18 (MRR) | WN18RR (Hits@10) | WN18RR (MR) | WN18RR (MRR) | YAGO3-10 (Hits@10) | YAGO3-10 (MR) | YAGO3-10 (MRR) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DistMul[19] | 0.863 | 173 | 0.784 | 0.490 | 199 | 0.313 | 0.946 | 675 | 0.824 | 0.502 | 5913 | 0.433 | 0.661 | 1107 | 0.501 | |
ComplEx[20] | 0.905 | 34 | 0.848 | 0.529 | 202 | 0.349 | 0.955 | 3623 | 0.949 | 0.521 | 4907 | 0.458 | 0.703 | 1112 | 0.576 | |
HolE[23] | 0.867 | 211 | 0.800 | 0.476 | 186 | 0.303 | 0.949 | 650 | 0.938 | 0.487 | 8401 | 0.432 | 0.651 | 6489 | 0.502 | |
ANALOGY[21] | 0.837 | 126 | 0.726 | 0.353 | 476 | 0.202 | 0.944 | 808 | 0.934 | 0.380 | 9266 | 0.366 | 0.456 | 2423 | 0.283 | |
SimplE[22] | 0.836 | 138 | 0.726 | 0.343 | 651 | 0.179 | 0.945 | 759 | 0.938 | 0.426 | 8764 | 0.398 | 0.631 | 2849 | 0.453 | |
TuckER[24] | 0.888 | 39 | 0.788 | 0.536 | 162 | 0.352 | 0.958 | 510 | 0.951 | 0.514 | 6239 | 0.459 | 0.680 | 2417 | 0.544 | |
MEI[26] | 0.552 | 145 | 0.365 | 0.551 | 3268 | 0.481 | 0.709 | 756 | 0.578 | |||||||
MEIM[27] | 0.557 | 137 | 0.369 | 0.577 | 2434 | 0.499 | 0.716 | 747 | 0.585 | |||||||
TransE[9] | 0.847 | 45 | 0.628 | 0.497 | 209 | 0.310 | 0.948 | 279 | 0.646 | 0.495 | 3936 | 0.206 | 0.673 | 1187 | 0.501 | |
STransE[32] | 0.796 | 69 | 0.543 | 0.495 | 357 | 0.315 | 0.934 | 208 | 0.656 | 0.422 | 5172 | 0.226 | 0.073 | 5797 | 0.049 | |
CrossE[33] | 0.862 | 136 | 0.702 | 0.470 | 227 | 0.298 | 0.950 | 441 | 0.834 | 0.449 | 5212 | 0.405 | 0.654 | 3839 | 0.446 | |
TorusE[34] | 0.839 | 143 | 0.746 | 0.447 | 211 | 0.281 | 0.954 | 525 | 0.947 | 0.535 | 4873 | 0.463 | 0.474 | 19455 | 0.342 | |
RotatE[35] | 0.881 | 42 | 0.791 | 0.522 | 178 | 0.336 | 0.960 | 274 | 0.949 | 0.573 | 3318 | 0.475 | 0.570 | 1827 | 0.498 | |
ConvE[36] | 0.849 | 51 | 0.688 | 0.521 | 281 | 0.305 | 0.956 | 413 | 0.945 | 0.507 | 4944 | 0.427 | 0.657 | 2429 | 0.488 | |
ConvKB[38] | 0.408 | 324 | 0.211 | 0.517 | 309 | 0.230 | 0.948 | 202 | 0.709 | 0.525 | 3429 | 0.249 | 0.604 | 1683 | 0.420 | |
ConvR[37] | 0.885 | 70 | 0.773 | 0.526 | 251 | 0.346 | 0.958 | 471 | 0.950 | 0.526 | 5646 | 0.467 | 0.673 | 2582 | 0.527 | |
CapsE[39] | 0.217 | 610 | 0.087 | 0.356 | 405 | 0.160 | 0.950 | 233 | 0.890 | 0.559 | 720 | 0.415 | 0 | 60676 | 0.000 | |
RSN[40] | 0.870 | 51 | 0.777 | 0.444 | 248 | 0.280 | 0.951 | 346 | 0.928 | 0.483 | 4210 | 0.395 | 0.664 | 1339 | 0.511 |