STT 997 — Lecture 03
Gaussian Processes, Projection View of Kriging, and Covariance Functions
Wednesday, January 21, 2026
1. Gaussian Prediction as Projection
Assume all Gaussian random variables have mean zero.
Let
- $ Y_0 $ be the target random variable (e.g., value at a new location),
- $ Y = (Y_1, \dots, Y_n)^\top $ be observed Gaussian variables,
- $ V = \operatorname{Cov}(Y, Y) $,
- $ k = \operatorname{Cov}(Y_0, Y) $.
Then the Gaussian conditional expectation is \(\mathbb{E}[Y_0 \mid Y] = k^\top V^{-1} Y.\)
This is not merely the best linear predictor but the best predictor in mean squared error sense for Gaussian variables.
2. Hilbert Space Setup
Define the linear space \(\mathcal{H} = \left\{ \sum_{i=1}^n c_i Y_i : c_i \in \mathbb{R} \right\} \subset L^2(\Omega, \mathcal{F}, \mathbb{P}).\)
Equip $ L^2 $ with inner product \(\langle X, Z \rangle = \mathbb{E}[XZ].\)
Then:
- $ L^2 $ is a Hilbert space,
- $ \mathcal{H} $ is a finite-dimensional closed subspace,
- $ Y_0 \in L^2 $, but generally $ Y_0 \notin \mathcal{H} $.
3. Projection Interpretation
We project $ Y_0 $ onto $ \mathcal{H} $.
Define \(\hat Y_0 = \operatorname{Proj}_{\mathcal{H}}(Y_0).\)
This projection satisfies \(\|Y_0 - \hat Y_0\|^2 = \min_{Z \in \mathcal{H}} \|Y_0 - Z\|^2.\)
Geometric properties:
- $ Y_0 - \hat Y_0 \perp \mathcal{H} $
- Orthogonality corresponds to uncorrelatedness: \(\mathbb{E}[(Y_0 - \hat Y_0) Z] = 0 \quad \forall Z \in \mathcal{H}.\)
Decomposition: \(Y_0 = \hat Y_0 + (Y_0 - \hat Y_0),\) \(\|Y_0\|^2 = \|\hat Y_0\|^2 + \|Y_0 - \hat Y_0\|^2.\)
4. Claim: Conditional Expectation = Projection
Claim \(\hat Y_0 = k^\top V^{-1} Y = \operatorname{Proj}_{\mathcal{H}}(Y_0).\)
Proof Sketch
Let $ Z = c^\top Y \in \mathcal{H} $. Then \(\langle Y_0 - k^\top V^{-1} Y, Z \rangle = \mathbb{E}[(Y_0 - k^\top V^{-1} Y)(c^\top Y)].\)
Compute: \(= c^\top \left( k - V V^{-1} k \right) = 0.\)
Thus the difference is orthogonal to $ \mathcal{H} $, proving projection.
5. Prediction Error Variance
The mean squared prediction error is \(\mathbb{E}[(Y_0 - \hat Y_0)^2] = \|Y_0 - \hat Y_0\|^2 = \operatorname{Var}(Y_0) - k^\top V^{-1} k.\)
Sanity check:
Prediction variance must be smaller than the marginal variance.
6. Kriging with Measurement Error
Observation model: \(Y(s) = X(s) + \varepsilon(s),\) where:
- $ X(s) $ is the latent spatial process,
- $ \varepsilon(s) \sim \text{white noise} $,
- $ X(s) \perp \varepsilon(s) $,
- $ \varepsilon(s) \sim \mathcal{N}(0, \tau^2) $.
If $ X(s) $ has kernel $ K $, then \(K_Y(s, s') = K(s, s') + \tau^2 \mathbf{1}_{\{s = s'\}}.\)
For observations $ Y(s_1), \dots, Y(s_n) $: \(\Sigma = V + \tau^2 I.\)
Prediction of the latent signal: \(\hat X(s_0) = \mathbb{E}[X(s_0) \mid Y] = k^\top \Sigma^{-1} Y.\)
Important: \(\hat X(s_i) \neq Y(s_i) \quad \text{unless } \tau^2 = 0.\)
7. Covariance / Kernel Functions
Linear Kernel
\(K(x, x') = x^\top x'\) Rarely used in spatial statistics, since correlation does not decay with distance.
Gaussian (Squared Exponential) Kernel
\(K(x, x') = \exp\left( -\frac{\|x - x'\|^2}{2\ell^2} \right).\)
- Infinitely smooth
- Length-scale $ \ell $ controls decay
- Often too smooth for real data
Exponential Kernel
\(K(x, x') = \exp\left( -\frac{\|x - x'\|}{\ell} \right).\)
- Rougher sample paths
- Special case of Matérn
8. Matérn Kernel
Smoothness parameter $ \nu $:
- $ \nu = \tfrac12 $: exponential kernel
- Larger $ \nu $: smoother processes
- $ \nu \to \infty $: Gaussian kernel
Spectral density: \(S(\omega) \propto \left( \frac{2\nu}{\ell^2} + \|\omega\|^2 \right)^{-(\nu + d/2)}, \quad \omega \in \mathbb{R}^d.\)
Properties:
- Valid in all dimensions
- Parameters: variance, $ \ell $, $ \nu $
9. Estimation of Kernel Parameters
Methods
- Maximum Likelihood Estimation (MLE)
- Bayesian inference (priors on parameters)
Log-likelihood: $$ \ell(\theta) = -\tfrac12 \log|\Sigma_\theta|
- \tfrac12 Y^\top \Sigma_\theta^{-1} Y. $$
Gradient identity: \(\frac{\partial}{\partial \theta} \log|\Sigma| = \operatorname{tr}\left( \Sigma^{-1} \frac{\partial \Sigma}{\partial \theta} \right).\)
10. Important Caveat: Consistency
- Not all covariance parameters are consistently estimable
- Some parameters do not affect prediction
- Flat likelihood surfaces indicate weak identifiability
- This motivated the move away from early least-squares variogram fitting
11. Stationarity and Isotropy
- Stationarity: covariance depends only on displacement
- Isotropy: depends only on distance
Empirical estimation:
- Pairwise differences
- Bin by distance
- Fit theoretical curve (early approach)
Modern practice uses likelihood-based methods instead.
Closing Perspective
- Kriging is orthogonal projection in $ L^2 $
- Conditional expectation = projection
- Covariance kernels encode geometry
- Eigenfunctions correspond to linear operators on function spaces
(Functional analysis ideas will appear later in the course.)
Comments