Gaussian Process Regression for Data Fulfilling Linear Differential Equations with Localized Sources †

Specialized Gaussian process regression is presented for data that are known to fulfill a given linear differential equation with vanishing or localized sources. The method allows estimation of system parameters as well as strength and location of point sources. It is applicable to a wide range of data from measurement and simulation. The underlying principle is the well-known invariance of the Gaussian probability distribution under linear operators, in particular differentiation. In contrast to approaches with a generic covariance function/kernel, we restrict the Gaussian process to generate only solutions of the homogeneous part of the differential equation. This requires specialized kernels with a direct correspondence of certain kernel hyperparameters to parameters in the underlying equation and leads to more reliable regression results with less training data. Inhomogeneous contributions from linear superposition of point sources are treated via a linear model over fundamental solutions. Maximum likelihood estimates for hyperparameters and source positions are obtained by nonlinear optimization. For differential equations representing laws of physics the present approach generates only physically possible solutions, and estimated hyperparameters represent physical properties. After a general derivation, modeling of source-free data and parameter estimation is demonstrated for Laplace’s equation and the heat/diffusion equation. Finally, the Helmholtz equation with point sources is treated, representing scalar wave data such as acoustic pressure in the frequency domain.


Introduction
The larger context of the present work is the goal to construct reduced complexity models as emulators or surrogates that retain mathematical and physical properties of the underlying system. In recent terminology, such approaches are examples of "physics informed machine learning". Similar to usual numerical models, the aim here is to represent infinite systems by exploiting finite information in some optimal sense. In the spirit of structure preserving numerics, one tries to move errors to the "right place" to retain laws such as conservation of mass, energy, or momentum. Here, we treat data known to fulfill a given linear differential equation. This article is an extended version of a conference paper [1] presented at the MaxEnt workshop 2019. The revised text adds hyperparameter optimization, results for the heat equation and detailed comparisons to existing methods.
This article deals with Gaussian process (GP) regression on data with additional information known in the form of linear, generally partial differential equations (PDEs). An illustrative example is (the more general case of complex valued fields and vector fields is left open for future investigations in this context). A GP with mean m(x) and covariance function or kernel k(x, x ) is denoted as u(x) ∼ G(m(x), k(x, x )). (1) The choice of an appropriate kernel k(x, x ) restricts realizations of (1) to respect regularity properties of u(x) such as continuity or characteristic length scales. Often regularity of u does not appear by chance, but rather reflects an underlying law. We will exploit such laws in the construction and application of GPs describing u for the case described by linear (partial) differential equations: whereL is a linear differential operator and q(x) is a source term. In the laws of physics, dimensions of x usually consist of space and/or time. Physical scalar fields u include, e.g., electrostatic potential Φ, temperature T, or pressure p. Corresponding laws include Gauss' law of electrostatics for Φ with weighted LaplacianL = ε∆, thermodynamics for T with heat/diffusion operatorL = ∂ ∂t − D∆ and frequency-domain acoustics for p with Helmholtz operatorL = ∆ + k 2 0 . These operators contain free parameters, namely, permeability ε, wavenumber k 0 , and diffusivity D, respectively. While ε may be absorbed inside q in a uniform material model of electrostatics, estimation of parameters such as D or k 0 is useful for material characterization.
Consider first the source-free (homogeneous) casê An unknown field u h (x) that fulfills (3) shall be modeled by the Gaussian process u h (x) ∼ G(0, k(x, x )).
Application of a linear operatorL yields a modified Gaussian procesŝ Lu h (x) ∼ G(0,Lk(x, x )L ), (5) whereL acts from the right side with respect to x . In order to fulfill (3) we require (5) to vanish identically, i.e., yield a deterministic zero. Consequently, the kernel k(x, x ) needs to satisfŷ A discussion on derivation of such kernels is found in Section 2. For the general case (2), with unknown source density q(x), we introduce a linear model with basis functions ϕ i (x) and a normally distributed prior with mean q 0 and prior covariance Σ q for coefficients q i representing source strengths. For a particulary solution u p (x) fulfilling the inhomogeneous Equation (2) with source model (8), a linear model induced by the operatorL follows as Here, coefficients q i remain the same as in (8) and new basis functions h i (x) fulfil the differential equation with source density ϕ i (x). In case of point monopole sources ϕ i (x) = δ(x − x q i ) placed at positions x q i , each h i (x) represents a fundamental solution evaluated for the respective source, so where G(x, x q i ) is a Green's function for operatorL. In the remaining work with localized sources we take this approach. As G(x, x q i ) is usually only available for simple geometries and boundary conditions the discussed linear model alone is limited in its application. We can however represent much more general fields by a superposition of a locally source-free background u h (x) and point source contributions u p (x). Boundary conditions induced by external sources are then covered by u h (x), and internal sources entering u p (x) are treated via simple free-field Green's functions. Following the technique of [16] discussed in [15] (Chapter 2.7), the superposition u(x) = u h (x) + u p (x) of the GP u h (x) and the linear model u p (x) is distributed according to the Gaussian process We will now verify that (11) indeed models the original differential Equation (2) correctly, thereby generalizing the analysis for a deterministic source density in [3]. WithLk(x, x )L = 0, we obtain This is indeed the GP representing the linear source model (8) that we assumed and yields a consistent representation of u(x) and q(x) inside (2). Using the limit of a vague prior with q 0 = 0 and |Σ −1 q | → 0, i.e., minimum information / infinite prior covariance [15,16], posteriors for meanū and covariance matrix cov(u, u) based on given training data y = u(X) + σ n with measurement noise variance σ 2 n arē cov(u(X ), u(X )) = K − K T K −1 where X = (x 1 , x 2 , . . . x N ) contains the training points and X = (x 1 , x 2 , . . . , x N ) the evaluation or test points. Functions of X and X are to be understood as vectors or matrices resulting from evaluation at different positions, i.e.,ū(X ) ≡ (ū(x 1 ),ū(x 2 ), . . . ,ū(x N )) is a tuple of predicted expectation values. The matrix K ≡ k(X, X) is the covariance of the training data with entries K ij ≡ k(x i , x j ). Entries of the predicted covariance matrix for u evaluated at the test points . Posterior mean and covariance of source strengths are given from the linear model [16] in the limit of a vague prior, cov(q, q) = (HK −1 In the absence of sources, the matrix R vanishes, and (13) and (14) reduce to posteriors of a GP with zero prior mean and are directly used to model homogeneous solutions u h (x) of (3).

Construction of Kernels for Homogeneous PDEs
For the representation of solutions u h (x) of homogeneous differential Equations (3), the weight-space view ( [15] Chapter 2.1) of Gaussian process regression is useful. There the kernel k is represented via a tuple φ(x) = (φ 1 (x), φ 2 (x), . . . ) of basis functions φ i (x) that underlie a linear regression model Bayesian inference starting from a Gaussian prior with covariance matrix Σ p for weights w yields a Mercer kernel The existence of such a representation is guaranteed by Mercer's theorem in the context of reproducing kernel Hilbert spaces (RKHS) [14]. More generally one can also define kernels on an uncountably infinite number of basis functions in analogy to (17) via whereφ is a linear operator acting on elements w(ζ) of an infinite-dimensional weight space parametrized by an auxiliary index variable ζ, that may be multidimensional. We representφ via an inner product φ(x, ζ), w(ζ) in the respective function space given by an integral over ζ. The infinite-dimensional analog to the prior covariance matrix is a prior covariance operatorΣ p that defines the kernel as a bilinear form Kernels of the form (20) are known as convolution kernels. Such a kernel is at least positive semidefinite, and positive definiteness follows in the case of linearly independent basis functions φ(x, ζ) [14]. For treatment of PDEs, the possible choices of index variables in Equation (18) or Equation (20) include separation constants of analytical solutions, or the frequency variable of an integral transform. In accordance with [10], using basis functions that satisfy the underlying PDE, a probabilistic meshless method (PMM) is constructed. In particular, if ζ parameterizes positions of sources, and (20) is chosen to be a fundamental solution/Green's function G(x, ζ) of the PDE, one may call the resulting scheme a probabilistic method of fundamental solutions (pMFS). In [10], sources are placed across the whole computational domain, and the resulting kernel is called natural. Here, we will instead place sources in the exterior to fulfill the homogeneous interior problem, as in the classical MFS [12][13][14]. Technically, this is also achieved by setting Σ p (ζ, ζ ) = 0 for either ζ or ζ lies in the interior. For discrete sources localized at ζ = ζ i one obtains again discrete basis functions

Application Cases
Here, the general results described in the previous sections are applied to specific equations. First, a specialized kernel fulfilling the given linear differential equation is constructed according to (18), and second, numerical experiments on physical examples are performed comparing the specialized kernel to a squared exponential kernel. Regression is performed based on values measured at a set of sampling points x i and may also include optimization of hyperparameters θ appearing as auxiliary variables inside the kernel k(x, x ; θ). The optimization step is, as usually, performed such that the marginal likelihood of the GP is maximized (maximum likelihood or ML values). In the Bayesian sense, this corresponds to a maximum a-posteriori (MAP) estimate for a flat prior. Accordingly, θ ML is fixed rather than providing a joint probability distribution function including θ as random variables. We note that depending on the setting this choice may lead to underestimation of uncertainties in the reconstruction of u(x), in particular for sparse, low-quality measurements.

Laplace's Equation in Two Dimensions
First, we explore construction of kernels fulfilling (5) for a homogeneous problem in a finite and infinite dimensional index space, depending on the mode of separation. Consider Laplace's equation: In contrast to the Helmholtz equation, Laplace's equation has no scale, i.e., permits all length scales in the solution. In the 2D case using polar coordinates the Laplacian becomes A well-known family of solutions for this problem based on the separation of variables is with separation constant m, leading to real-valued combinations r m cos(mθ), r m sin(mθ), r −m cos(mθ), r −m sin(mθ).
As our aim is to work in bounded regions, we discard the solutions with negative exponent that diverge at r = 0. Choosing a diagonal prior that weights sine and cosine terms equivalently [9] and introducing a length scale as a free parameter we obtain a kernel according to (18) with A flat prior σ 2 m = σ 2 u for all polar harmonics and a characteristic length scale as another hyperparameter yields This kernel is not stationary, but isotropic around a fixed coordinate origin. Introducing a mirror point x with polar angleθ = θ and radiusr = 2 /r we notice that (26) can be written as making a dipole singularity apparent at x =x . In addition, k is normalized to 1 at x = 0. Choosing > R 0 larger than the radius R 0 of a circle centered in the origin and enclosing the computational domain, we haver > 2 / = > R 0 . Thus, all mirror points and the according singularities are moved outside the domain. This behavior is illustrated in Figure 1 where computing the covariance kernel with respect to point x = (0.8, 0) leads to distances > 1 everywhere inside the unit circle.
Choosing a slowly decaying σ 2 m = σ 2 u /m, excluding m = 1 and adding a constant term yields a logarithmic kernel instead [9] with Instead of a dipole singularity that expression features a monopole singularity at x −x that is again avoided by placing it outside the domain for any pair of x and x ( Figure 1). Using instead Cartesian coordinates x, y to separate the Laplacian provides harmonic functions like u(x, y) = e ±κx e ±iκy .
Here, all solutions yield finite values at x = 0, so we do not have to exclude any of them a priori. Introducing, again, a diagonal covariance operator in (20) and taking the real part yields Setting σ 2 (κ) ≡ e −2κ 2 and choosing a characteristic length scale together with a possible rotation angle θ 0 of the coordinate frame yields the kernel Other sign combinations do not yield a positive definite kernel -similar to the polar kernel (27) before we couldn't obtain an fully stationary expression that depends only on differences between coordinates of x and x . Inside Ω the solution is represented with errors below 5%. This is also reflected in the error predicted by the posterior variance of the GP that remains small in the region enclosed by measurement points. The analogy in classical analysis is the theorem that the solution of a homogeneous elliptic equation is fully determined by boundary values.
In comparison, a reconstruction using a generic squared exponential kernel yields a much worse approximation quality in Figures 2 and 3. This is in contrast to earlier investigations [1] where a fixed length scale hyperparamter = 2 was used. Although the specialized GP with kernel (27) could identify this length scale during hyperparameter optimization, using a generic kernel (33) leads to an underestimation of and requires twice the number of training points to achieve a similar fit quality and profits from scattered training points, as it has no information about the nature of the boundary value problem (Figures 4 and 5).
In addition, the posterior covariance of that reconstruction is not able to capture the vanishing error inside the enclosed domain due to given boundary data.      Figure 3 for 15 training points on a circle (top) and for quasi-random points (bottom). As the generic squared exponential kernel does not fulfill the given differential equation, even for a larger number of training points, the source density doesn't vanish in the domain.

Heat Equation: Physical Parameter Estimation
Let us now consider the 1D homogeneous heat/diffusion equation over position x and time t, for (x, t) ∈ R × R + . Here, the diffusivity D is a physical parameter determining how fast solutions spread in space. Integrating the fundamental solution from ξ = −∞ to ∞ at τ = 0, i.e., placing sources everywhere in space at a single initial time, and adding a scale hyperparameter σ u leads to the convolution kernel In terms of x, this is a stationary squared exponential kernel and the natural kernel over the domain x ∈ R. The kernel broadens with increasing t and t . Nonstationarity in time can also be considered natural to the heat equation, as its solutions show a preferred time direction on each side of the singularity t = 0. The only difference of (36) to the fundamental solution (35) is the positive sign between t and t . As both t and t are positive, k is guaranteed to take finite values and, in contrast to (35), does not become singular at (x, t) = (x , t ).
As for the Laplace equation it is also convenient to define a non-stationary kernel by cutting out a domain that is known to be free of sources. In case heat sources are known to exist only left of the origin we evaluate the integral over the fundamental solution over (−∞, 0) to where is defined via the error function erf. Choosing instead a source-free region domain interval (a, b) we integrate over R\(a, b) and obtain Incorporating the prior knowledge that there are no domain sources is expected to improve the reconstruction.
As a physical example, we consider a rod with temperatures held fixed at two ends and a given initial temperature distribution. We model this as an initial-boundary value problem for (34) on the interval x ∈ (0, 1) with Dirichlet boundary data u(0) = 1 and u(1) = 0. As initial conditions, we set u(x, 0) = 0 everywhere except at the left end where u(0, 0) = 1. The actual diffusivity is chosen as D = 0.1, and we let u(x, t) evolve from t 0 = 0 until t 1 = 1. With increasing t the initial conditions are smoothed out as u approaches the stationary solution u(x, t → ∞) = 1 − x. Measurements of u are performed at three positions x = 0, 0.1, 1 at four times t = 10 −5 , 0.25, 0.5, 0.75, yielding 12 training points in total. In Figure 6 the resulting reconstruction of u(x, t = 0.125) is plotted for each of the three kernels defined above. Kernel (39) allowing initial sources on both sides of the interval yields the best reconstruction. Furthermore, it is the only one that reproduces meaningful uncertainty bands based on the 95% confidence intervalū ± 1.96σ, whereas the ones for (36) and (36) span the whole plot domain. Estimation of diffusivity D is also most reliable with kernel (39). The according negative log likelihood can be seen on the right plot in Figure 6. Although all three kernels produce well posed optimization problems, only (39) has the minimum at the correct position D = 0.1.
The reason for the requirement of kernel (39) is clear from the statement of the problem: keeping u fixed on both sides of the interval can only be achieved by restricting the heat flux in a predefined way that requires sources on both sides at t = 0. However, the domain itself should not contain any heat sources at any time. If we had placed an open boundary condition on the right side, kernel (37) would have been the more natural choice instead.

Helmholtz Equation: Source and Wavenumber Reconstruction
Finally, to demonstrate the full method, we consider the Helmholtz equation with sources: In 1D, solutions for the homogeneous equation with x = x are given by linear combinations of cos(k 0 x), sin(k 0 x). Choosing a diagonal prior in (18) leads to a stationary kernel as presented in [11]. For the two-dimensional case in polar coordinates, a family of solutions based on the separation of variables is where J m and Y m are Bessel functions of first and second kind, respectively. Similar to the simpler 1D case, by applying Neumann's addition theorem, we obtain a specialized kernel In the 3D case, one would proceed in a similar fashion with spherical Bessel functions, which yields the kernel that was already postulated in [11]. In contrast to the case of Laplace's equation in the previous section, these source-free Helmholtz kernels do not possess singularities at any finite distance from the origin, i.e., no virtual exterior sources in the Mercer kernel (20). As a consequence they provide smoothing regularization on the order of the wavelength λ 0 = 2π/k 0 to reconstructed fields and boundary conditions that may or may not be desired. Internal sources at positions x q k are linearly modeled according to (10) with basis where H (2) 0 is the Hankel function of the second kind. The method of source strength reconstruction is improved compared to [11], as it constitutes a linear problem according to (15). Nonlinear optimization is instead applied to σ u and wavenumber k 0 as free hyperparameters to be estimated during the GP regression. The set-up is the same as in [11]: a 2D cavity with various boundary conditions and two sound sources of strengths 0.5 and 1.0, respectively. Results for sound pressure fulfilling (40) are normalized to have a maximum p/p 0 = 1. We compare three variants of GP regression for these data: (1) Superposition of specialized kernel GP for homogeneous part u h and linear source model for u p .
(2) Superposition of generic squared-exponential kernel GP for u h and linear source model for u p .
(3) Generic squared-exponential kernel GP model for the full field u.
Naturally, after the presented analysis, only (1) can be the "correct" way of regression for this kind of data from a PDE with point sources. Variant (2) is a "hybrid" that should be able to identify point sources, while polluting the source-free part with volumetric contributions. Considering that (2) helps to separate the effect from this pollution from the effect of adding a linear source model. Variant (3) is expected to show worse performance compared to (1) and (2), as neither source-free part nor singularities of u at point source positions can be modeled correctly. Figure 7 shows the local absolute field reconstruction error based on 12 training data points with artificial noise of σ n = 0.01. Hyperparameters k 0 and σ u are set to optimized ML values, and σ n is fixed to its actual value. The upper left plot shows results for variant (1) with the specialized kernel (43). Variant (3) with a generic squared exponential kernel (33) of length scale = π/( √ 2k 0 ) to model u yields a much higher field reconstruction error as depicted in the lower left of Figure 7. The field reconstruction using the generic kernel is improved when a linear model for the inhomogeneous term is included (variant (2)), as shown in the upper right of Figure 7. However, the original differential Equation (40) is only fulfilled exactly when using a specialized kernel withLk(x, x )L = 0. As expected, variant (1) produces the best reconstruction at a given number of training points (Figure 8). There the first 12 points are chosen as marked in Figure 7, and more points are generated from a quasi-random Halton sequence. The obtained negative log-likelihood (Figure 7, lower right) depending on k 0 and σ u at its ML value demonstrates the well-posedness of estimating k 0 having the physical meaning of a wavenumber. Variants (2) and (3) lead to a slightly less peaked estimate for a spatial length scale hyperparameter without a direct physical interpretation.   (15) and (16). Negative log likelihood (bottom right) with optimum k ML 0 = 9.19 for specialized kernel (solid line), sq.exp. kernel with linear source model (dashed), and sq.exp. kernel alone (dash-dotted).
For estimation of source positions, nonlinear optimization is applied to source positions as free hyperparameters within the given boundaries, employing an evolutionary algorithm CMA-ES [17]. The results of source strength and position estimation using (15) and (16) in the configuration with 12 training points is given in Table 1. Both estimates match the exact values reasonably well. At an increasing number of training data the reconstruction becomes more accurate, stagnating at an error between 0.1% and 1% and showing the advantage of the specialized kernel more clearly (Figures 8  and 9). The relative L 2 error in source positions for specialized and generic squared exponential kernel with linear source model is depicted in the left plot of Figure 9. Again, results from the specialized kernel are usually more accurate and stable compared to using a squared-exponential kernel for the source-free part of the field at a given number of training points.

Summary and Outlook
A framework for application of Gaussian process regression to data from underlying linear partial differential equations with localized sources has been presented. The method is based on superposition of a Gaussian process that generates exact solutions of the homogeneous equation, complemented by a linear model for sources. For the homogeneous part, specialized kernels are constructed from fundamental solutions via Mercer's theorem. For source contributions, fundamental solutions are used as basis functions in the linear model. Examples for suitable kernels have been given for Laplace's equation, heat equation and Helmholtz equation. Regression has been shown to yield better results compared to using a squared exponential kernel at the same number of training points in the considered application cases. Advantages of the specialized kernel approach are the possibility to represent exact absence of sources as well as physical interpretability of hyperparameters. This comes at the cost of requiring non-standard, possibly nonstationary kernels. The presented method has been demonstrated to be able to accurately estimate system parameters such as diffusivity and wavenumber, as well as position and strength of point sources using only around 10 training data points in two-dimensional domains.
In a next step, reconstruction of vector fields via GPs could be formulated, taking laws such as Maxwell's equations or Hamilton's equations of motion into account. A starting point could be squared exponential kernels for divergence-and curl-free vector fields [18]. Such kernels have been used in [19] to perform statistical reconstruction, and [20] apply them to GPs for source identification in the Laplace/Poisson equation. To model Hamiltonian dynamics in phase-space, vector-valued GPs could possibly be extended to represent not only volume-preserving (divergence-free) maps but retain full symplectic properties, conserving all integrals of motion such as energy or momentum.