STT 997 — Lecture 03

Gaussian Processes, Projection View of Kriging, and Covariance Functions
Wednesday, January 21, 2026

1. Gaussian Prediction as Projection

Assume all Gaussian random variables have mean zero.

Let

$ Y_0 $ be the target random variable (e.g., value at a new location),
$ Y = (Y_1, \dots, Y_n)^\top $ be observed Gaussian variables,
$ V = \operatorname{Cov}(Y, Y) $,
$ k = \operatorname{Cov}(Y_0, Y) $.

Then the Gaussian conditional expectation is $\mathbb{E}[Y_0 \mid Y] = k^\top V^{-1} Y.$

This is not merely the best linear predictor but the best predictor in mean squared error sense for Gaussian variables.

2. Hilbert Space Setup

Define the linear space $\mathcal{H} = \left\{ \sum_{i=1}^n c_i Y_i : c_i \in \mathbb{R} \right\} \subset L^2(\Omega, \mathcal{F}, \mathbb{P}).$

Equip $ L^2 $ with inner product $\langle X, Z \rangle = \mathbb{E}[XZ].$

Then:

$ L^2 $ is a Hilbert space,
$ \mathcal{H} $ is a finite-dimensional closed subspace,
$ Y_0 \in L^2 $, but generally $ Y_0 \notin \mathcal{H} $.

3. Projection Interpretation

We project $ Y_0 $ onto $ \mathcal{H} $.

Define $\hat Y_0 = \operatorname{Proj}_{\mathcal{H}}(Y_0).$

This projection satisfies $\|Y_0 - \hat Y_0\|^2 = \min_{Z \in \mathcal{H}} \|Y_0 - Z\|^2.$

Geometric properties:

$ Y_0 - \hat Y_0 \perp \mathcal{H} $
Orthogonality corresponds to uncorrelatedness: $\mathbb{E}[(Y_0 - \hat Y_0) Z] = 0 \quad \forall Z \in \mathcal{H}.$

Decomposition: $Y_0 = \hat Y_0 + (Y_0 - \hat Y_0),$ $\|Y_0\|^2 = \|\hat Y_0\|^2 + \|Y_0 - \hat Y_0\|^2.$

4. Claim: Conditional Expectation = Projection

Claim $\hat Y_0 = k^\top V^{-1} Y = \operatorname{Proj}_{\mathcal{H}}(Y_0).$

Proof Sketch

Let $ Z = c^\top Y \in \mathcal{H} $. Then $\langle Y_0 - k^\top V^{-1} Y, Z \rangle = \mathbb{E}[(Y_0 - k^\top V^{-1} Y)(c^\top Y)].$

Compute: $= c^\top \left( k - V V^{-1} k \right) = 0.$

Thus the difference is orthogonal to $ \mathcal{H} $, proving projection.

5. Prediction Error Variance

The mean squared prediction error is $\mathbb{E}[(Y_0 - \hat Y_0)^2] = \|Y_0 - \hat Y_0\|^2 = \operatorname{Var}(Y_0) - k^\top V^{-1} k.$

Sanity check:
Prediction variance must be smaller than the marginal variance.

6. Kriging with Measurement Error

Observation model: $Y(s) = X(s) + \varepsilon(s),$ where:

$ X(s) $ is the latent spatial process,
$ \varepsilon(s) \sim \text{white noise} $,
$ X(s) \perp \varepsilon(s) $,
$ \varepsilon(s) \sim \mathcal{N}(0, \tau^2) $.

If $ X(s) $ has kernel $ K $, then $K_Y(s, s') = K(s, s') + \tau^2 \mathbf{1}_{\{s = s'\}}.$

For observations $ Y(s_1), \dots, Y(s_n) $: $\Sigma = V + \tau^2 I.$

Prediction of the latent signal: $\hat X(s_0) = \mathbb{E}[X(s_0) \mid Y] = k^\top \Sigma^{-1} Y.$

Important: $\hat X(s_i) \neq Y(s_i) \quad \text{unless } \tau^2 = 0.$

7. Covariance / Kernel Functions

Linear Kernel

$K(x, x') = x^\top x'$ Rarely used in spatial statistics, since correlation does not decay with distance.

Gaussian (Squared Exponential) Kernel

$K(x, x') = \exp\left( -\frac{\|x - x'\|^2}{2\ell^2} \right).$

Infinitely smooth
Length-scale $ \ell $ controls decay
Often too smooth for real data

Exponential Kernel

$K(x, x') = \exp\left( -\frac{\|x - x'\|}{\ell} \right).$

Rougher sample paths
Special case of Matérn

8. Matérn Kernel

Smoothness parameter $ \nu $:

$ \nu = \tfrac12 $: exponential kernel
Larger $ \nu $: smoother processes
$ \nu \to \infty $: Gaussian kernel

Spectral density: $S(\omega) \propto \left( \frac{2\nu}{\ell^2} + \|\omega\|^2 \right)^{-(\nu + d/2)}, \quad \omega \in \mathbb{R}^d.$

Properties:

Valid in all dimensions
Parameters: variance, $ \ell $, $ \nu $

9. Estimation of Kernel Parameters

Methods

Maximum Likelihood Estimation (MLE)
Bayesian inference (priors on parameters)

Log-likelihood: $$ \ell(\theta) = -\tfrac12 \log|\Sigma_\theta|

\tfrac12 Y^\top \Sigma_\theta^{-1} Y. $$

Gradient identity: $\frac{\partial}{\partial \theta} \log|\Sigma| = \operatorname{tr}\left( \Sigma^{-1} \frac{\partial \Sigma}{\partial \theta} \right).$

10. Important Caveat: Consistency

Not all covariance parameters are consistently estimable
Some parameters do not affect prediction
Flat likelihood surfaces indicate weak identifiability
This motivated the move away from early least-squares variogram fitting

11. Stationarity and Isotropy

Stationarity: covariance depends only on displacement
Isotropy: depends only on distance

Empirical estimation:

Pairwise differences
Bin by distance
Fit theoretical curve (early approach)

Modern practice uses likelihood-based methods instead.

Closing Perspective

Kriging is orthogonal projection in $ L^2 $
Conditional expectation = projection
Covariance kernels encode geometry
Eigenfunctions correspond to linear operators on function spaces

(Functional analysis ideas will appear later in the course.)

Edit this page on GitHub