STT 997 — Lecture 03

Gaussian Processes, Projection View of Kriging, and Covariance Functions
Wednesday, January 21, 2026


1. Gaussian Prediction as Projection

Assume all Gaussian random variables have mean zero.

Let

  • $ Y_0 $ be the target random variable (e.g., value at a new location),
  • $ Y = (Y_1, \dots, Y_n)^\top $ be observed Gaussian variables,
  • $ V = \operatorname{Cov}(Y, Y) $,
  • $ k = \operatorname{Cov}(Y_0, Y) $.

Then the Gaussian conditional expectation is \(\mathbb{E}[Y_0 \mid Y] = k^\top V^{-1} Y.\)

This is not merely the best linear predictor but the best predictor in mean squared error sense for Gaussian variables.


2. Hilbert Space Setup

Define the linear space \(\mathcal{H} = \left\{ \sum_{i=1}^n c_i Y_i : c_i \in \mathbb{R} \right\} \subset L^2(\Omega, \mathcal{F}, \mathbb{P}).\)

Equip $ L^2 $ with inner product \(\langle X, Z \rangle = \mathbb{E}[XZ].\)

Then:

  • $ L^2 $ is a Hilbert space,
  • $ \mathcal{H} $ is a finite-dimensional closed subspace,
  • $ Y_0 \in L^2 $, but generally $ Y_0 \notin \mathcal{H} $.

3. Projection Interpretation

We project $ Y_0 $ onto $ \mathcal{H} $.

Define \(\hat Y_0 = \operatorname{Proj}_{\mathcal{H}}(Y_0).\)

This projection satisfies \(\|Y_0 - \hat Y_0\|^2 = \min_{Z \in \mathcal{H}} \|Y_0 - Z\|^2.\)

Geometric properties:

  • $ Y_0 - \hat Y_0 \perp \mathcal{H} $
  • Orthogonality corresponds to uncorrelatedness: \(\mathbb{E}[(Y_0 - \hat Y_0) Z] = 0 \quad \forall Z \in \mathcal{H}.\)

Decomposition: \(Y_0 = \hat Y_0 + (Y_0 - \hat Y_0),\) \(\|Y_0\|^2 = \|\hat Y_0\|^2 + \|Y_0 - \hat Y_0\|^2.\)


4. Claim: Conditional Expectation = Projection

Claim \(\hat Y_0 = k^\top V^{-1} Y = \operatorname{Proj}_{\mathcal{H}}(Y_0).\)

Proof Sketch

Let $ Z = c^\top Y \in \mathcal{H} $. Then \(\langle Y_0 - k^\top V^{-1} Y, Z \rangle = \mathbb{E}[(Y_0 - k^\top V^{-1} Y)(c^\top Y)].\)

Compute: \(= c^\top \left( k - V V^{-1} k \right) = 0.\)

Thus the difference is orthogonal to $ \mathcal{H} $, proving projection.


5. Prediction Error Variance

The mean squared prediction error is \(\mathbb{E}[(Y_0 - \hat Y_0)^2] = \|Y_0 - \hat Y_0\|^2 = \operatorname{Var}(Y_0) - k^\top V^{-1} k.\)

Sanity check:
Prediction variance must be smaller than the marginal variance.


6. Kriging with Measurement Error

Observation model: \(Y(s) = X(s) + \varepsilon(s),\) where:

  • $ X(s) $ is the latent spatial process,
  • $ \varepsilon(s) \sim \text{white noise} $,
  • $ X(s) \perp \varepsilon(s) $,
  • $ \varepsilon(s) \sim \mathcal{N}(0, \tau^2) $.

If $ X(s) $ has kernel $ K $, then \(K_Y(s, s') = K(s, s') + \tau^2 \mathbf{1}_{\{s = s'\}}.\)

For observations $ Y(s_1), \dots, Y(s_n) $: \(\Sigma = V + \tau^2 I.\)

Prediction of the latent signal: \(\hat X(s_0) = \mathbb{E}[X(s_0) \mid Y] = k^\top \Sigma^{-1} Y.\)

Important: \(\hat X(s_i) \neq Y(s_i) \quad \text{unless } \tau^2 = 0.\)


7. Covariance / Kernel Functions

Linear Kernel

\(K(x, x') = x^\top x'\) Rarely used in spatial statistics, since correlation does not decay with distance.


Gaussian (Squared Exponential) Kernel

\(K(x, x') = \exp\left( -\frac{\|x - x'\|^2}{2\ell^2} \right).\)

  • Infinitely smooth
  • Length-scale $ \ell $ controls decay
  • Often too smooth for real data

Exponential Kernel

\(K(x, x') = \exp\left( -\frac{\|x - x'\|}{\ell} \right).\)

  • Rougher sample paths
  • Special case of Matérn

8. Matérn Kernel

Smoothness parameter $ \nu $:

  • $ \nu = \tfrac12 $: exponential kernel
  • Larger $ \nu $: smoother processes
  • $ \nu \to \infty $: Gaussian kernel

Spectral density: \(S(\omega) \propto \left( \frac{2\nu}{\ell^2} + \|\omega\|^2 \right)^{-(\nu + d/2)}, \quad \omega \in \mathbb{R}^d.\)

Properties:

  • Valid in all dimensions
  • Parameters: variance, $ \ell $, $ \nu $

9. Estimation of Kernel Parameters

Methods

  • Maximum Likelihood Estimation (MLE)
  • Bayesian inference (priors on parameters)

Log-likelihood: $$ \ell(\theta) = -\tfrac12 \log|\Sigma_\theta|

  • \tfrac12 Y^\top \Sigma_\theta^{-1} Y. $$

Gradient identity: \(\frac{\partial}{\partial \theta} \log|\Sigma| = \operatorname{tr}\left( \Sigma^{-1} \frac{\partial \Sigma}{\partial \theta} \right).\)


10. Important Caveat: Consistency

  • Not all covariance parameters are consistently estimable
  • Some parameters do not affect prediction
  • Flat likelihood surfaces indicate weak identifiability
  • This motivated the move away from early least-squares variogram fitting

11. Stationarity and Isotropy

  • Stationarity: covariance depends only on displacement
  • Isotropy: depends only on distance

Empirical estimation:

  • Pairwise differences
  • Bin by distance
  • Fit theoretical curve (early approach)

Modern practice uses likelihood-based methods instead.


Closing Perspective

  • Kriging is orthogonal projection in $ L^2 $
  • Conditional expectation = projection
  • Covariance kernels encode geometry
  • Eigenfunctions correspond to linear operators on function spaces

(Functional analysis ideas will appear later in the course.)


Comments