Datalog
Dialects
Paradigm	Logic, Declarative
Family	Prolog
First appeared	1977; 47 years ago (1977)
Typing discipline	Weak
Datomic, .QL, Soufflé, XTDB, etc.
Influenced by
Prolog
Influenced
SQL

Datalog is a declarative logic programming language. While it is syntactically a subset of Prolog, Datalog generally uses a bottom-up rather than top-down evaluation model. This difference yields significantly different behavior and properties from Prolog. It is often used as a query language for deductive databases. Datalog has been applied to problems in data integration, networking, program analysis, and more.

Example

A Datalog program consists of facts, which are statements that are held to be true, and rules, which say how to deduce new facts from known facts. For example, here are two facts that mean xerces is a parent of brooke and brooke is a parent of damocles:

parent(xerces, brooke).
parent(brooke, damocles).

The names are written in lowercase because strings beginning with an uppercase letter stand for variables. Here are two rules:

ancestor(X, Y) :- parent(X, Y).
ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).

The :- symbol is read as "if", and the comma is read "and", so these rules mean:

X is an ancestor of Y if X is a parent of Y.
X is an ancestor of Y if X is a parent of some Z, and Z is an ancestor of Y.

The meaning of a program is defined to be the set of all of the facts that can be deduced using the initial facts and the rules. This program's meaning is given by the following facts:

parent(xerces, brooke).
parent(brooke, damocles).
ancestor(xerces, brooke).
ancestor(brooke, damocles).
ancestor(xerces, damocles).

Some Datalog implementations don't deduce all possible facts, but instead answer queries:

?- ancestor(xerces, X).

This query asks: Who are all the X that xerces is an ancestor of? For this example, it would return brooke and damocles.

Comparison to relational databases

The non-recursive subset of Datalog is closely related to query languages for relational databases, such as SQL. The following table maps between Datalog, relational algebra, and SQL concepts:

Datalog	Relational algebra	SQL
Relation	Relation	Table
Fact	Tuple	Row
Rule	n/a	Materialized view
Query	Select	Query

More formally, non-recursive Datalog corresponds precisely to unions of conjunctive queries, or equivalently, negation-free relational algebra.

Schematic translation from non-recursive Datalog into SQL
s(x, y). t(y). r(A, B) :- s(A, B), t(B). CREATE TABLE s ( z0 TEXT NONNULL, z1 TEXT NONNULL, PRIMARY KEY (z0, z1) ); CREATE TABLE t ( z0 TEXT NONNULL PRIMARY KEY ); INSERT INTO s VALUES ('x', 'y'); INSERT INTO t VALUES ('y'); CREATE VIEW r AS SELECT s.z0, s.z1 FROM s, t WHERE s.z1 = t.z0;

Schematic translation from non-recursive Datalog into SQL

s(x, y).
t(y).
r(A, B) :- s(A, B), t(B).

CREATE TABLE s (
  z0 TEXT NONNULL,
  z1 TEXT NONNULL,
  PRIMARY KEY (z0, z1)
);
CREATE TABLE t (
  z0 TEXT NONNULL PRIMARY KEY
);
INSERT INTO s VALUES ('x', 'y');
INSERT INTO t VALUES ('y');
CREATE VIEW r AS
SELECT s.z0, s.z1
FROM s, t
WHERE s.z1 = t.z0;

Syntax

A Datalog program consists of a list of rules (Horn clauses).^[1] If constant and variable are two countable sets of constants and variables respectively and relation is a countable set of predicate symbols, then the following BNF grammar expresses the structure of a Datalog program:

<program> ::= <rule> <program> | ""
<rule> ::= <atom> ":-" <atom-list> "."
<atom> ::= <relation> "(" <term-list> ")"
<atom-list> ::= <atom> | <atom> "," <atom-list> | ""
<term> ::= <constant> | <variable>
<term-list> ::= <term> | <term> "," <term-list> | ""

Atoms are also referred to as literals. The atom to the left of the :- symbol is called the head of the rule; the atoms to the right are the body. Every Datalog program must satisfy the condition that every variable that appears in the head of a rule also appears in the body (this condition is sometimes called the range restriction).^[1]^[2]

There are two common conventions for variable names: capitalizing variables, or prefixing them with a question mark ?.^[3]

Note that under this definition, Datalog does not include negation nor aggregates; see § Extensions for more information about those constructs.

Rules with empty bodies are called facts. For example, the following rule is a fact:

r(x) :- .

The set of facts is called the extensional database or EDB of the Datalog program. The set of tuples computed by evaluating the Datalog program is called the intensional database or IDB.

Syntactic sugar

Many implementations of logic programming extend the above grammar to allow writing facts without the :-, like so:

r(x).

Some also allow writing 0-ary relations without parentheses, like so:

p :- q.

These are merely abbreviations (syntactic sugar); they have no impact on the semantics of the program.

Semantics

Main article: Syntax and semantics of logic programming

Herbrand universe, base, and model of a Datalog program
Program: edge(x, y). edge(y, z). path(A, B) :- edge(A, B). path(A, C) :- path(A, B), edge(B, C).
Herbrand universe: `x`, `y`, `z`
Herbrand base: `edge(x, x)`, `edge(x, y)`, ..., `edge(z, z)`, `path(x, x)`, ..., `path(z, z)`
Herbrand model: `edge(x, y)`, `edge(y, z)`, `path(x, y)`, `path(y, z)`, `path(x, z)`

There are three widely-used approaches to the semantics of Datalog programs: model-theoretic, fixed-point, and proof-theoretic. These three approaches can be proven equivalent.^[4]

An atom is called ground if none of its subterms are variables. Intuitively, each of the semantics define the meaning of a program to be the set of all ground atoms that can be deduced from the rules of the program, starting from the facts.

Model theoretic

A rule is called ground if all of its atoms (head and body) are ground. A ground rule R₁ is a ground instance of another rule R₂ if R₁ is the result of a substitution of constants for all the variables in R₂. The Herbrand base of a Datalog program is the set of all ground atoms that can be made with the constants appearing in the program. The Herbrand model of a Datalog program is the smallest subset of the Herbrand base such that, for each ground instance of each rule in the program, if the atoms in the body of the rule are in the set, then so is the head.^[5] The model-theoretic semantics define the minimal Herbrand model to be the meaning of the program.

Fixed-point

Let $I$ be the power set of the Herbrand base of a program P. The immediate consequence operator for P is a map $T$ from $I$ to $I$ that adds all of the new ground atoms that can be derived from the rules of the program in a single step. The least-fixed-point semantics define the least fixed point of $T$ to be the meaning of the program; this coincides with the minimal Herbrand model.^[6]

The fixpoint semantics suggest an algorithm for computing the minimal model: Start with the set of ground facts in the program, then repeatedly add consequences of the rules until a fixpoint is reached. This algorithm is called naïve evaluation.

Proof-theoretic

The proof-theoretic semantics defines the meaning of a Datalog program to be the set of facts with corresponding proof trees. Intuitively, a proof tree shows how to derive a fact from the facts and rules of a program.

One might be interested in knowing whether or not a particular ground atom appears in the minimal Herbrand model of a Datalog program, perhaps without caring much about the rest of the model. A top-down reading of the proof trees described above suggests an algorithm for computing the results of such queries. This reading informs the SLD resolution algorithm, which forms the basis for the evaluation of Prolog.

Evaluation

There are many different ways to evaluate a Datalog program, with different performance characteristics.

Bottom-up evaluation strategies

Bottom-up evaluation strategies start with the facts in the program and repeatedly apply the rules until either some goal or query is established, or until the complete minimal model of the program is produced.

Naïve evaluation

Naïve evaluation mirrors the fixpoint semantics for Datalog programs. Naïve evaluation uses a set of "known facts", which is initialized to the facts in the program. It proceeds by repeatedly enumerating all ground instances of each rule in the program. If each atom in the body of the ground instance is in the set of known facts, then the head atom is added to the set of known facts. This process is repeated until a fixed point is reached, and no more facts may be deduced. Naïve evaluation produces the entire minimal model of the program.^[7]

Semi-naïve evaluation

This section needs expansion. You can help by adding to it. (February 2023)

Semi-naïve evaluation is a bottom-up evaluation strategy that can be asymptotically faster than naïve evaluation.^[8]

Performance considerations

Naïve and semi-naïve evaluation both evaluate recursive Datalog rules by repeatedly applying them to a set of known facts until a fixed point is reached. In each iteration, rules are only run for "one step", i.e., non-recursively. As mentioned above, each non-recursive Datalog rule corresponds precisely to a conjunctive query. Therefore, many of the techniques from database theory used to speed up conjunctive queries are applicable to bottom-up evaluation of Datalog, such as

Index selection^[10]
Query optimization, especially join order^[11]^[12]
Join algorithms
Selection of data structures used to store relations; common choices include hash tables and B-trees, other possibilities include disjoint set data structures (for storing equivalence relations),^[13] bries (a variant of tries),^[14] binary decision diagrams,^[15] and even SMT formulas^[16]

Many such techniques are implemented in modern bottom-up Datalog engines such as Soufflé. Some Datalog engines integrate SQL databases directly.^[17]

Bottom-up evaluation of Datalog is also amenable to parallelization. Parallel Datalog engines are generally divided into two paradigms:

In the shared-memory, multi-core setting, Datalog engines execute on a single node. Coordination between threads may be achieved using locking or lock-free data structures. The shared-memory setting may be further divided into single instruction, multiple data and multiple instruction, multiple data paradigms:
- Datalog engines that execute on graphics processing units fall into the SIMD paradigm.^[18]
- Datalog engines using OpenMP^[19] are instances of the MIMD paradigm.
In the shared-nothing setting, Datalog engines execute on a cluster of nodes. Such engines generally operate by splitting relations into disjoint subsets based on a hash function, performing computations (joins) on each node, and then exchanging newly-generated tuples over the network.^[20] Examples include Datalog engines based on MPI,^[9] Hadoop,^[21] and Spark.^[22]

Top-down evaluation strategies

This section needs expansion. You can help by adding to it. (March 2023)

SLD resolution is sound and complete for Datalog programs.

Magic sets

Top-down evaluation strategies begin with a query or goal. Bottom-up evaluation strategies can answer queries by computing the entire minimal model and matching the query against it, but this can be inefficient if the answer only depends on a small subset of the entire model. The magic sets algorithm takes a Datalog program and a query, and produces a more efficient program that computes the same answer to the query while still using bottom-up evaluation.^[23] A variant of the magic sets algorithm has been shown to produce programs that, when evaluated using semi-naïve evaluation, are as efficient as top-down evaluation.^[24]

Complexity

The decision problem formulation of Datalog evaluation is as follows: Given a Datalog program $P$ split into a set of facts (EDB) $E$ and a set of rules $R$ , and a ground atom $A$ , is $A$ in the minimal model of $P$ ? In this formulation, there are three variations of the computational complexity of evaluating Datalog programs:^[25]

The data complexity is the complexity of the decision problem when $A$ and $E$ are inputs and $R$ is fixed.
The program complexity is the complexity of the decision problem when $A$ and $R$ are inputs and $E$ is fixed.
The combined complexity is the complexity of the decision problem when $A$ , $E$ , and $R$ are inputs.

With respect to data complexity, the decision problem for Datalog is P-complete. With respect to program complexity, the decision problem is EXPTIME-complete. In particular, evaluating Datalog programs always terminates; Datalog is not Turing-complete.

Some extensions to Datalog do not preserve these complexity bounds. Extensions implemented in some Datalog engines, such as algebraic data types, can even make the resulting language Turing-complete.

Extensions

Several extensions have been made to Datalog, e.g., to support negation, aggregate functions, inequalities, to allow object-oriented programming, or to allow disjunctions as heads of clauses. These extensions have significant impacts on the language's semantics and on the implementation of a corresponding interpreter.

Datalog is a syntactic subset of Prolog, disjunctive Datalog, answer set programming, DatalogZ, and constraint logic programming. When evaluated as an answer set program, a Datalog program yields a single answer set, which is exactly its minimal model.^[26]

Many implementations of Datalog extend Datalog with additional features; see § Datalog engines for more information.

Aggregation

This section needs expansion. You can help by adding to it. (February 2023)

Datalog can be extended to support aggregate functions.^[27]

Notable Datalog engines that implement aggregation include:

Negation

Further information: Syntax and semantics of logic programming § Extending Datalog with negation

Adding negation to Datalog complicates its semantics, leading to whole new languages and strategies for evaluation. For example, the language that results from adding negation with the stable model semantics is exactly answer set programming.

Stratified negation can be added to Datalog while retaining its model-theoretic and fixed-point semantics. Notable Datalog engines that implement stratified negation include:

Comparison to Prolog

Unlike in Prolog, statements of a Datalog program can be stated in any order. Datalog does not have Prolog's cut operator. This makes Datalog a fully declarative language.

In contrast to Prolog, Datalog

disallows complex terms as arguments of predicates, e.g., p(x, y) is admissible but not p(f(x), y),
disallows negation,
requires that every variable that appears in the head of a clause also appear in a literal in the body of the clause.

This article deals primarily with Datalog without negation (see also Syntax and semantics of logic programming § Extending Datalog with negation). However, stratified negation is a common addition to Datalog; the following list contrasts Prolog with Datalog with stratified negation. Datalog with stratified negation

also disallows complex terms as arguments of predicates,
requires that every variable that appears in the head of a clause also appear in a positive (i.e., not negated) atom in the body of the clause,
requires that every variable appearing in a negative literal in the body of a clause also appear in some positive literal in the body of the clause.^[30]^{[unreliable source?]}

Expressiveness

Datalog generalizes many other query languages. For instance, conjunctive queries and union of conjunctive queries can be expressed in Datalog. Datalog can also express regular path queries.

The boundedness problem for Datalog asks, given a Datalog program, whether it is bounded, i.e., the maximal recursion depth reached when evaluating the program on an input database can be bounded by some constant. In other words, this question asks whether the Datalog program could be rewritten as a nonrecursive Datalog program, or, equivalently, as a union of conjunctive queries. Solving the boundedness problem on arbitrary Datalog programs is undecidable,^[31] but it can be made decidable by restricting to some fragments of Datalog.

Datalog engines

Systems that implement languages inspired by Datalog, whether compilers, interpreters, libraries, or embedded DSLs, are referred to as Datalog engines. Datalog engines often implement extensions of Datalog, extending it with additional data types, foreign function interfaces, or support for user-defined lattices. Such extensions may allow for writing non-terminating or otherwise ill-defined programs.^{[citation needed]}

Here is a short list of systems that are either based on Datalog or provide a Datalog interpreter:

Free software/open source

Written in	Name	Try it online	External Database	Description	Licence
C	XSB			A logic programming and deductive database system for Unix and Microsoft Windows with tabling giving Datalog-like termination and efficiency, including incremental evaluation^[32]	GNU LGPL
C++	Coral^[33]			A deductive database system written in C++ with semi-naïve datalog evaluation. Developed 1988-1997.	custom licence, free for non-commercial use
	DLV^[34]			A Datalog extension that supports disjunctive head clauses.	custom licence, free for academic and non-commercial educational use, as well as for use by non-profit organisations^[35]
	Inter4QL^[36]			an open-source command-line interpreter of Datalog-like 4QL query language implemented in C++ for Windows, Mac OS X and Linux. Negation is allowed in heads and bodies of rules as well as in recursion	GNU GPL v3
	RDFox^[37]		in-memory	A high-performance RDF triple store with OWL and Datalog reasoning. Implements the FBF algorithm for incremental evaluation, extending Datalog to include stratified negation and equality. Capable of running in a high availability setup.	custom licence, free for non-commercial use^[38]
	Soufflé	Yes	file, in-memory, sqlite3	an open-source Datalog engine that has a compiler translating Datalog to high-performance, parallel C++ code and a high-performance interpreter; specifically designed for complex Datalog queries over large data sets as encountered in the context of static program analysis	UPL v1.0
Clojure	Cascalog		Hadoop	a Clojure library for querying data stored on Hadoop clusters	Apache
	Clojure Datalog			a contributed library implementing aspects of Datalog	Eclipse Public License 1.0
	XTDB (formerly Crux)	Yes	Apache Kafka	A general-purpose database with an "unbundled" architecture, using log-centric streaming of documents and transactions to achieve significant architectural flexibility and elegant horizontal scaling. Pluggable components include Kafka, RocksDB and LMDB. Indexes are bitemporal to support point-in-time Datalog queries by default. Java and HTTP APIs are provided.	MIT License
	Datascript		in-memory	Immutable database and Datalog query engine that runs in the browser	Eclipse Public License 1.0
	Datalevin		LMDB	A fork of Datascript that is optimized for the LMDB durable storage	Eclipse Public License 1.0
	Datahike		file, in-memory	A fork of Datascript with a durable backend using a hitchhiker tree.	Eclipse Public License 1.0
	Naga/Asami		file, in-memory	A combination of a graph database (Asami), and a rules processing system (Naga) that evaluates native Datalog syntax and executes using the database. Runs in browsers (memory), on the JVM (memory/files), or natively (memory/files).	Eclipse Public License 1.0
Erlang	Datalog			The library is designed to query and formalise relation of n-ary streams using datalog. It implements an ad-hoc query engine using simplified version of general logic programming paradigm. The library facilitates development of data integration, information exchange and semantic web applications.	Apache v2
Go	Mangle			Mangle is a programming language for deductive database programming. It is an extension of Datalog, with various extensions like aggregation, function calls and optional type-checking.	Apache v2
Haskell	Dyna^[39]			Dyna is a declarative programming language for statistical AI programming. The language is based on Datalog, supports both forward and backward chaining, and incremental evaluation.	GNU AGPL v3
Java	AbcDatalog^[40]			AbcDatalog is an open-source implementation of the logic programming language Datalog written in Java. It provides ready-to-use implementations of common Datalog evaluation algorithms, as well as some experimental multi-threaded evaluation engines. It supports language features beyond core Datalog such as explicit (dis-)unification of terms and stratified negation. Additionally, AbcDatalog is designed to be easily extensible with new evaluation engines and new language features.	BSD
	IRIS^[41]			IRIS extends Datalog with function symbols, built-in predicates, locally stratified or un-stratified logic programs (using the well-founded semantics), unsafe rules and XML schema data types	GNU LGPL v2.1
	Jena			a Semantic Web framework which includes a Datalog implementation as part of its general purpose rule engine, which provides OWL and RDFS support.^[42]	Apache v2
	SociaLite^[43]			SociaLite is a datalog variant for large-scale graph analysis developed in Stanford	Apache v2
	Graal^[44]			Graal is a Java toolkit dedicated to querying knowledge bases within the framework of existential rules, aka Datalog+/-.	CeCILL v2.1
	Flix	Yes		A functional and logic programming language inspired by Datalog extended with user-defined lattices and monotone filter/transfer functions.	Apache v2
Lua	Datalog^[45]	Yes^[46]		a lightweight deductive database system.	GNU LGPL
OCaml	datalog^[47]			An in-memory datalog implementation for OCaml featuring bottom-up and top-down algorithms.	BSD 2-clause
Prolog	DES^[48]			an open-source implementation to be used for teaching Datalog in courses	GNU LGPL
Python	pyDatalog		11 dialects of SQL	adds logic programming to Python's toolbox. It can run logic queries on databases or Python objects, and use logic clauses to define the behavior of Python classes.	GNU LGPL
Racket	Datalog for Racket^[49]				GNU LGPL
Racket	Datafun^[50]			Generalized Datalog on Semilattices	GNU LGPL
Ruby	bloom / bud			A Ruby DSL for programming with data-centric constructs, based on the Dedalus extension of Datalog which adds a temporal dimension to the logic.	BSD 3-Clause
Rust	Crepe			Crepe is a library that allows you to write declarative logic programs in Rust, with a Datalog-like syntax. It provides a procedural macro that generates efficient, safe code and interoperates seamlessly with Rust programs. It also supports extensions like stratified negation, semi-naive evaluation, and calling external functions within Datalog rules.	MIT License / Apache 2.0
	Datafrog			Datafrog is a lightweight Datalog engine intended to be embedded in other Rust programs.	MIT License / Apache 2.0
	TerminusDB		In-memory	TerminusDB is an open source graph database and document store. Designed for collaboratively building data-intensive applications and knowledge graphs.	Apache v2
	DDlog^[51]			DDlog is an incremental, in-memory, typed Datalog engine. It is well suited for writing programs that incrementally update their output in response to input changes. The DDlog programmer specifies the desired input-output mapping in a declarative manner, using a dialect of Datalog. The DDlog compiler then synthesizes an efficient incremental implementation in Rust. DDlog is based on the differential dataflow^[52] library. It offers bindings for Java, C, and Go.	MIT License
Tcl	tclbdd^[53]			Implementation based on binary decision diagrams. Built to support development of an optimizing compiler for Tcl.	BSD
Other or Unknown Languages	bddbddb^[54]			an implementation of Datalog done at Stanford University. It is mainly used to query Java bytecode including points-to analysis on large Java programs. It uses BDDs internally.	GNU LGPL
	ConceptBase^[55]			a deductive and object-oriented database system based on a Datalog query evaluator : Prolog for triggered procedures and rewrites, axiomatized Datalog called « Telos » for (meta)modeling. It is mainly used for conceptual modeling and metamodeling	BSD 2-Clause

Non-free software

Datomic is a distributed database designed to enable scalable, flexible and intelligent applications, running on new cloud architectures. It uses Datalog as the query language.
FoundationDB provides a free-of-charge database binding for pyDatalog, with a tutorial on its use.^[56]
Leapsight Semantic Dataspace (LSD) is a distributed deductive database that offers high availability, fault tolerance, operational simplicity, and scalability. LSD uses Leaplog (a Datalog implementation) for querying and reasoning and was create by Leapsight.^[57]
LogicBlox, a commercial implementation of Datalog used for web-based retail planning and insurance applications.
Profium Sense is a native RDF compliant graph database written in Java. It provides Datalog evaluation support of user defined rules.
.QL, a commercial object-oriented variant of Datalog created by Semmle for analyzing source code to detect security vulnerabilities.^[58]
SecPAL a security policy language developed by Microsoft Research.^[59]
Stardog is a graph database, implemented in Java. It provides support for RDF and all OWL 2 profiles providing extensive reasoning capabilities, including datalog evaluation.
StrixDB: a commercial RDF graph store, SPARQL compliant with Lua API and Datalog inference capabilities. Could be used as httpd (Apache HTTP Server) module or standalone (although beta versions are under the Perl Artistic License 2.0).

Uses and influence

Datalog is quite limited in its expressivity. It is not Turing-complete, and doesn't include basic data types such as integers or strings. This parsimony is appealing from a theoretical standpoint, but it means Datalog per se is rarely used as a programming language or knowledge representation language.^[60] Most Datalog engines implement substantial extensions of Datalog. However, Datalog has a strong influence on such implementations, and many authors don't bother to distinguish them from Datalog as presented in this article. Accordingly, the applications discussed in this section include applications of realistic implementations of Datalog-based languages.

Datalog has been applied to problems in data integration, information extraction, networking, security, cloud computing and machine learning.^[61]^[62] Google has developed an extension to Datalog for big data processing.^[63]

Datalog has seen application in static program analysis.^[64] The Soufflé dialect has been used to write pointer analyses for Java and a control-flow analysis for Scheme.^[65]^[66] Datalog has been integrated with SMT solvers to make it easier to write certain static analyses.^[67] The Flix dialect is also suited to writing static program analyses.^[68]

Some widely used database systems include ideas and algorithms developed for Datalog. For example, the SQL:1999 standard includes recursive queries, and the Magic Sets algorithm (initially developed for the faster evaluation of Datalog queries) is implemented in IBM's DB2.^[69]

History

The origins of Datalog date back to the beginning of logic programming, but it became prominent as a separate area around 1977 when Hervé Gallaire and Jack Minker organized a workshop on logic and databases.^[70] David Maier is credited with coining the term Datalog.^[71]

Notes

References

Ceri, S.; Gottlob, G.; Tanca, L. (March 1989). "What you always wanted to know about Datalog (and never dared to ask)" (PDF). IEEE Transactions on Knowledge and Data Engineering. 1 (1): 146–166. CiteSeerX 10.1.1.210.1118. doi:10.1109/69.43410. ISSN 1041-4347.
Abiteboul, S. (1995). Foundations of databases. Richard Hull, Victor Vianu. Reading, Mass.: Addison-Wesley. ISBN 0-201-53771-0. OCLC 30546436.

v t e Query languages
In current use	.QL ALPHA CQL Cypher DAX DMX Datalog GraphQL Gremlin ISBL LDAP LINQ MQL MDX OQL OCL QUEL SMARTS SPARQL SQL XQuery XPath YQL
Proprietary	YQL LINQ
Superseded	CODASYL