In operator theory, a branch of mathematics, a positive-definite kernel is a generalization of a positive-definite function or a positive-definite matrix. It was first introduced by James Mercer in the early 20th century, in the context of solving integral operator equations. Since then, positive-definite functions and their various analogues and generalizations have arisen in diverse parts of mathematics. They occur naturally in Fourier analysis, probability theory, operator theory, complex function-theory, moment problems, integral equations, boundary-value problems for partial differential equations, machine learning, embedding problem, information theory, and other areas.

Definition

Let ${\mathcal {X))$ be a nonempty set, sometimes referred to as the index set. A symmetric function $K:{\mathcal {X))\times {\mathcal {X))\to \mathbb {R}$ is called a positive-definite (p.d.) kernel on ${\mathcal {X))$ if

\sum _{i=1}^{n}\sum _{j=1}^{n}c_{i}c_{j}K(x_{i},x_{j})\geq 0

(1.1)

holds for all $x_{1},\dots ,x_{n}\in {\mathcal {X))$ , $n\in \mathbb {N} ,c_{1},\dots ,c_{n}\in \mathbb {R}$ .

In probability theory, a distinction is sometimes made between positive-definite kernels, for which equality in (1.1) implies $c_{i}=0\;(\forall i)$ , and positive semi-definite (p.s.d.) kernels, which do not impose this condition. Note that this is equivalent to requiring that every finite matrix constructed by pairwise evaluation, $\mathbf {K} _{ij}=K(x_{i},x_{j})$ , has either entirely positive (p.d.) or nonnegative (p.s.d.) eigenvalues.

In mathematical literature, kernels are usually complex-valued functions. That is, a complex-valued function $K:{\mathcal {X))\times {\mathcal {X))\to \mathbb {C}$ is called a Hermitian kernel if $K(x,y)={\overline {K(y,x)))$ and positive definite if for every finite set of points $x_{1},\dots ,x_{n}\in {\mathcal {X))$ and any complex numbers $\xi _{1},\dots ,\xi _{n}\in \mathbb {C}$ ,

\sum _{i=1}^{n}\sum _{j=1}^{n}\xi _{i}{\overline {\xi ))_{j}K(x_{i},x_{j})\geq 0

where ${\displaystyle {\overline {\xi ))_{j))$ denotes the complex conjugate.^[1] In the rest of this article we assume real-valued functions, which is the common practice in applications of p.d. kernels.

Some general properties

For a family of p.d. kernels $(K_{i})_{i\in \mathbb {N} },\ \ K_{i}:{\mathcal {X))\times {\mathcal {X))\to \mathbb {R}$ $(K_{i})_{i\in \mathbb {N} },\ \ K_{i}:{\mathcal {X))\times {\mathcal {X))\to \mathbb {R}$
- The conical sum ${\displaystyle \sum _{i=1}^{n}\lambda _{i}K_{i))$ is p.d., given $\lambda _{1},\dots ,\lambda _{n}\geq 0$
- The product $K_{1}^{a_{1))\dots K_{n}^{a_{n))$ is p.d., given $a_{1},\dots ,a_{n}\in \mathbb {N}$
- The limit ${\displaystyle K=\lim _{n\to \infty }K_{n))$ is p.d. if the limit exists.
If ${\displaystyle ({\mathcal {X))_{i})_{i=1}^{n))$ is a sequence of sets, and $(K_{i})_{i=1}^{n},\ \ K_{i}:{\mathcal {X))_{i}\times {\mathcal {X))_{i}\to \mathbb {R}$ a sequence of p.d. kernels, then both $K((x_{1},\dots ,x_{n}),(y_{1},\dots ,y_{n}))=\prod _{i=1}^{n}K_{i}(x_{i},y_{i})$ and $K((x_{1},\dots ,x_{n}),(y_{1},\dots ,y_{n}))=\sum _{i=1}^{n}K_{i}(x_{i},y_{i})$ are p.d. kernels on ${\displaystyle {\mathcal {X))={\mathcal {X))_{1}\times \dots \times {\mathcal {X))_{n))$ .
Let ${\mathcal {X))_{0}\subset {\mathcal {X))$ . Then the restriction ${\displaystyle K_{0))$ of $K$ to ${\displaystyle {\mathcal {X))_{0}\times {\mathcal {X))_{0))$ is also a p.d. kernel.

Examples of p.d. kernels

Common examples of p.d. kernels defined on Euclidean space ${\displaystyle \mathbb {R} ^{d))$ ${\displaystyle \mathbb {R} ^{d))$ include:
- Linear kernel: ${\displaystyle K(\mathbf {x} ,\mathbf {y} )=\mathbf {x} ^{T}\mathbf {y} ,\quad \mathbf {x} ,\mathbf {y} \in \mathbb {R} ^{d))$ .
- Polynomial kernel: $K(\mathbf {x} ,\mathbf {y} )=(\mathbf {x} ^{T}\mathbf {y} +r)^{n},\quad \mathbf {x} ,\mathbf {y} \in \mathbb {R} ^{d},r\geq 0,n\geq 1$ .
- Gaussian kernel (RBF kernel): $K(\mathbf {x} ,\mathbf {y} )=e^{-{\frac {\|\mathbf {x} -\mathbf {y} \|^{2)){2\sigma ^{2)))),\quad \mathbf {x} ,\mathbf {y} \in \mathbb {R} ^{d},\sigma >0$ .
- Laplacian kernel: $K(\mathbf {x} ,\mathbf {y} )=e^{-\alpha \|\mathbf {x} -\mathbf {y} \|},\quad \mathbf {x} ,\mathbf {y} \in \mathbb {R} ^{d},\alpha >0$ .
- Abel kernel: $K(x,y)=e^{-\alpha |x-y|},\quad x,y\in \mathbb {R} ,\alpha >0$ .
- Kernel generating Sobolev spaces $W_{2}^{k}(\mathbb {R} ^{d})$ : $K(x,y)=\|x-y\|_{2}^{k-{\frac {d}{2))}B_{k-{\frac {d}{2))}(\|x-y\|_{2})$ , where ${\displaystyle B_{\nu ))$ is the Bessel function of the third kind.
- Kernel generating Paley–Wiener space: $K(x,y)=\operatorname {sinc} (\alpha (x-y)),\quad x,y\in \mathbb {R} ,\alpha >0$ .
If $H$ is a Hilbert space, then its corresponding inner product $(\cdot ,\cdot )_{H}:H\times H\to \mathbb {R}$ is a p.d. kernel. Indeed, we have $\sum _{i,j=1}^{n}c_{i}c_{j}(x_{i},x_{j})_{H}=\left(\sum _{i=1}^{n}c_{i}x_{i},\sum _{j=1}^{n}c_{j}x_{j}\right)_{H}=\left\|\sum _{i=1}^{n}c_{i}x_{i}\right\|_{H}^{2}\geq 0$
Kernels defined on ${\displaystyle \mathbb {R} _{+}^{d))$ and histograms: Histograms are frequently encountered in applications of real-life problems. Most observations are usually available under the form of nonnegative vectors of counts, which, if normalized, yield histograms of frequencies. It has been shown ^[2] that the following family of squared metrics, respectively Jensen divergence, the $\chi$ -square, Total Variation, and two variations of the Hellinger distance: $\psi _{JD}=H\left({\frac {\theta +\theta '}{2))\right)-{\frac {H(\theta )+H(\theta ')}{2)),$ $\psi _{\chi ^{2))=\sum _{i}{\frac {(\theta _{i}-\theta _{i}')^{2)){\theta _{i}+\theta _{i}')),\quad \psi _{TV}=\sum _{i}\left|\theta _{i}-\theta _{i}'\right|,$ $\psi _{H_{1))=\sum _{i}\left|{\sqrt {\theta _{i))}-{\sqrt {\theta _{i}'))\right|,\psi _{H_{2))=\sum _{i}\left|{\sqrt {\theta _{i))}-{\sqrt {\theta _{i}'))\right|^{2},$ can be used to define p.d. kernels using the following formula $K(\theta ,\theta ')=e^{-\alpha \psi (\theta ,\theta ')},\alpha >0.$

History

Connection with reproducing kernel Hilbert spaces and feature maps

Further information: Reproducing kernel Hilbert space

Positive-definite kernels provide a framework that encompasses some basic Hilbert space constructions. In the following we present a tight relationship between positive-definite kernels and two mathematical objects, namely reproducing Hilbert spaces and feature maps.

Let $X$ be a set, $H$ a Hilbert space of functions $f:X\to \mathbb {R}$ , and $(\cdot ,\cdot )_{H}:H\times H\to \mathbb {R}$ the corresponding inner product on $H$ . For any $x\in X$ the evaluation functional $e_{x}:H\to \mathbb {R}$ is defined by $f\mapsto e_{x}(f)=f(x)$ . We first define a reproducing kernel Hilbert space (RKHS):

Definition: Space $H$ is called a reproducing kernel Hilbert space if the evaluation functionals are continuous.

Every RKHS has a special function associated to it, namely the reproducing kernel:

Definition: Reproducing kernel is a function $K:X\times X\to \mathbb {R}$ such that
$K_{x}(\cdot )\in H,\forall x\in X$ , and

$(f,K_{x})=f(x)$ , for all $f\in H$ and $x\in X$ .
The latter property is called the reproducing property.

The following result shows equivalence between RKHS and reproducing kernels:

Theorem — Every reproducing kernel $K$ induces a unique RKHS, and every RKHS has a unique reproducing kernel.

Now the connection between positive definite kernels and RKHS is given by the following theorem

Theorem — Every reproducing kernel is positive-definite, and every positive definite kernel defines a unique RKHS, of which it is the unique reproducing kernel.

Thus, given a positive-definite kernel $K$ , it is possible to build an associated RKHS with $K$ as a reproducing kernel.

As stated earlier, positive definite kernels can be constructed from inner products. This fact can be used to connect p.d. kernels with another interesting object that arises in machine learning applications, namely the feature map. Let $F$ be a Hilbert space, and ${\displaystyle (\cdot ,\cdot )_{F))$ the corresponding inner product. Any map $\Phi :X\to F$ is called a feature map. In this case we call $F$ the feature space. It is easy to see ^[10] that every feature map defines a unique p.d. kernel by $K(x,y)=(\Phi (x),\Phi (y))_{F}.$ Indeed, positive definiteness of $K$ follows from the p.d. property of the inner product. On the other hand, every p.d. kernel, and its corresponding RKHS, have many associated feature maps. For example: Let $F=H$ , and ${\displaystyle \Phi (x)=K_{x))$ for all $x\in X$ . Then $(\Phi (x),\Phi (y))_{F}=(K_{x},K_{y})_{H}=K(x,y)$ , by the reproducing property. This suggests a new look at p.d. kernels as inner products in appropriate Hilbert spaces, or in other words p.d. kernels can be viewed as similarity maps which quantify effectively how similar two points $x$ and $y$ are through the value $K(x,y)$ . Moreover, through the equivalence of p.d. kernels and its corresponding RKHS, every feature map can be used to construct a RKHS.

Kernels and distances

Kernel methods are often compared to distance based methods such as nearest neighbors. In this section we discuss parallels between their two respective ingredients, namely kernels $K$ and distances $d$ .

Here by a distance function between each pair of elements of some set $X$ , we mean a metric defined on that set, i.e. any nonnegative-valued function $d$ on ${\mathcal {X))\times {\mathcal {X))$ which satisfies

$d(x,y)\geq 0$ , and $d(x,y)=0$ if and only if $x=y$ ,
$d(x,y)=d(y,x),$
$d(x,z)\leq d(x,y)+d(y,z).$

One link between distances and p.d. kernels is given by a particular kind of kernel, called a negative definite kernel, and defined as follows

Definition: A symmetric function $\psi :{\mathcal {X))\times {\mathcal {X))\to \mathbb {R}$ is called a negative definite (n.d.) kernel on ${\mathcal {X))$ if
$\sum _{i,j=1}^{n}c_{i}c_{j}\psi (x_{i},x_{j})\leq 0$ (1.4)

holds for any $n\in \mathbb {N} ,x_{1},\dots ,x_{n}\in {\mathcal {X)),$ and $c_{1},\dots ,c_{n}\in \mathbb {R}$ such that ${\textstyle \sum _{i=1}^{n}c_{i}=0}$ .

The parallel between n.d. kernels and distances is in the following: whenever a n.d. kernel vanishes on the set ${\displaystyle \{(x,x):x\in {\mathcal {X))\))$ , and is zero only on this set, then its square root is a distance for ${\mathcal {X))$ .^[11] At the same time each distance does not correspond necessarily to a n.d. kernel. This is only true for Hilbertian distances, where distance $d$ is called Hilbertian if one can embed the metric space $({\mathcal {X)),d)$ isometrically into some Hilbert space.

On the other hand, n.d. kernels can be identified with a subfamily of p.d. kernels known as infinitely divisible kernels. A nonnegative-valued kernel $K$ is said to be infinitely divisible if for every $n\in \mathbb {N}$ there exists a positive-definite kernel ${\displaystyle K_{n))$ such that ${\displaystyle K=(K_{n})^{n))$ .

Another link is that a p.d. kernel induces a pseudometric, where the first constraint on the distance function is loosened to allow $d(x,y)=0$ for $x\neq y$ . Given a positive-definite kernel $K$ , we can define a distance function as: $d(x,y)={\sqrt {K(x,x)-2K(x,y)+K(y,y)))$

Some applications

Kernels in machine learning

Further information: Kernel method

Positive-definite kernels, through their equivalence with reproducing kernel Hilbert spaces (RKHS), are particularly important in the field of statistical learning theory because of the celebrated representer theorem which states that every minimizer function in an RKHS can be written as a linear combination of the kernel function evaluated at the training points. This is a practically useful result as it effectively simplifies the empirical risk minimization problem from an infinite dimensional to a finite dimensional optimization problem.

Kernels in probabilistic models

There are several different ways in which kernels arise in probability theory.

Nondeterministic recovery problems: Assume that we want to find the response $f(x)$ of an unknown model function $f$ at a new point $x$ of a set ${\mathcal {X))$ , provided that we have a sample of input-response pairs $(x_{i},f_{i})=(x_{i},f(x_{i}))$ given by observation or experiment. The response ${\displaystyle f_{i))$ at ${\displaystyle x_{i))$ is not a fixed function of ${\displaystyle x_{i))$ but rather a realization of a real-valued random variable $Z(x_{i})$ . The goal is to get information about the function $E[Z(x_{i})]$ which replaces $f$ in the deterministic setting. For two elements $x,y\in {\mathcal {X))$ the random variables $Z(x)$ and $Z(y)$ will not be uncorrelated, because if $x$ is too close to $y$ the random experiments described by $Z(x)$ and $Z(y)$ will often show similar behaviour. This is described by a covariance kernel $K(x,y)=E[Z(x)\cdot Z(y)]$ . Such a kernel exists and is positive-definite under weak additional assumptions. Now a good estimate for $Z(x)$ can be obtained by using kernel interpolation with the covariance kernel, ignoring the probabilistic background completely.

Assume now that a noise variable $\epsilon (x)$ , with zero mean and variance ${\displaystyle \sigma ^{2))$ , is added to $x$ , such that the noise is independent for different $x$ and independent of $Z$ there, then the problem of finding a good estimate for $f$ is identical to the above one, but with a modified kernel given by ${\displaystyle K(x,y)=E[Z(x)\cdot Z(y)]+\sigma ^{2}\delta _{xy))$ .

Density estimation by kernels: The problem is to recover the density $f$ of a multivariate distribution over a domain ${\mathcal {X))$ , from a large sample $x_{1},\dots ,x_{n}\in {\mathcal {X))$ including repetitions. Where sampling points lie dense, the true density function must take large values. A simple density estimate is possible by counting the number of samples in each cell of a grid, and plotting the resulting histogram, which yields a piecewise constant density estimate. A better estimate can be obtained by using a nonnegative translation invariant kernel $K$ , with total integral equal to one, and define $f(x)={\frac {1}{n))\sum _{i=1}^{n}K\left({\frac {x-x_{i)){h))\right)$ as a smooth estimate.

Numerical solution of partial differential equations

Further information: Meshfree methods

One of the greatest application areas of so-called meshfree methods is in the numerical solution of PDEs. Some of the popular meshfree methods are closely related to positive-definite kernels (such as meshless local Petrov Galerkin (MLPG), Reproducing kernel particle method (RKPM) and smoothed-particle hydrodynamics (SPH)). These methods use radial basis kernel for collocation.^[12]

Stinespring dilation theorem

Further information: Stinespring dilation theorem

Other applications

In the literature on computer experiments ^[13] and other engineering experiments, one increasingly encounters models based on p.d. kernels, RBFs or kriging. One such topic is response surface methodology. Other types of applications that boil down to data fitting are rapid prototyping and computer graphics. Here one often uses implicit surface models to approximate or interpolate point cloud data.

Applications of p.d. kernels in various other branches of mathematics are in multivariate integration, multivariate optimization, and in numerical analysis and scientific computing, where one studies fast, accurate and adaptive algorithms ideally implemented in high-performance computing environments.^[14]

References

^ Berezanskij, Jurij Makarovič (1968). Expansions in eigenfunctions of selfadjoint operators. Providence, RI: American Mathematical Soc. pp. 45–47. ISBN 978-0-8218-1567-0.
^ Hein, M. and Bousquet, O. (2005). "Hilbertian metrics and positive definite kernels on probability measures". In Ghahramani, Z. and Cowell, R., editors, Proceedings of AISTATS 2005.
^ Mercer, J. (1909). “Functions of positive and negative type and their connection with the theory of integral equations”. Philosophical Transactions of the Royal Society of London, Series A 209, pp. 415–446.
^ Hilbert, D. (1904). "Grundzuge einer allgemeinen Theorie der linearen Integralgleichungen I", Gott. Nachrichten, math.-phys. K1 (1904), pp. 49–91.
^ Young, W. H. (1909). "A note on a class of symmetric functions and on a theorem required in the theory of integral equations", Philos. Trans. Roy.Soc. London, Ser. A, 209, pp. 415–446.
^ Moore, E.H. (1916). "On properly positive Hermitian matrices", Bull. Amer. Math. Soc. 23, 59, pp. 66–67.
^ Moore, E.H. (1935). "General Analysis, Part I", Memoirs Amer. Philos. Soc. 1, Philadelphia.
^ Krein. M (1949/1950). "Hermitian-positive kernels on homogeneous spaces I and II" (in Russian), Ukrain. Mat. Z. 1(1949), pp. 64–98, and 2(1950), pp. 10–59. English translation: Amer. Math. Soc. Translations Ser. 2, 34 (1963), pp. 69–164.
^ Loève, M. (1960). "Probability theory", 2nd ed., Van Nostrand, Princeton, N.J.
^ Rosasco, L. and Poggio, T. (2015). "A Regularization Tour of Machine Learning – MIT 9.520 Lecture Notes" Manuscript.
^ Berg, C., Christensen, J. P. R., and Ressel, P. (1984). "Harmonic Analysis on Semigroups". Number 100 in Graduate Texts in Mathematics, Springer Verlag.
^ Schaback, R. and Wendland, H. (2006). "Kernel Techniques: From Machine Learning to Meshless Methods", Cambridge University Press, Acta Numerica (2006), pp. 1–97.
^ Haaland, B. and Qian, P. Z. G. (2010). "Accurate emulators for large-scale computer experiments", Ann. Stat.
^ Gumerov, N. A. and Duraiswami, R. (2007). "Fast radial basis function interpolation via preconditioned Krylov iteration". SIAM J. Scient. Computing 29/5, pp. 1876–1899.