STT 997 — Lecture 02

Gaussian Processes, Kriging, and Prediction

Gaussian Process Recap

Gaussian Process: $Y(s) \sim \text{GP}(\mu(s), k(s,s'))$
Expectation: $\mathbb{E}(Y_0) = \mathbb{E}(Y(s_0))$
Given observations $Y$ with:
- mean $\mathbb{E}(Y)$
- covariance matrix $V = \mathrm{Var}(Y)$
Predict $Y_0$ using the mean squared error (MSE) optimal predictor: $\hat{Y}_0 = \mathbb{E}(Y_0) + k^T V^{-1}(Y - \mathbb{E}(Y))$

Transcript alignment:
The instructor emphasizes that many Gaussian processes exist that interpolate the data exactly, and that this flexibility is both a strength and a danger (overfitting).

Zero-Mean Special Case

If all means are zero: $\hat{Y}_0 = k^T V^{-1} Y$ (explicitly noted as useful for homework)

Motivation: Why Gaussian Processes?

GP is a very rich class:
- Can interpolate any finite set of points
- Nonparametric (not a fixed-degree polynomial)
Many functions pass through the same observed points
Overfitting concern:
- Zero training error ≠ good prediction
- Analogy to polynomial interpolation and double descent

Transcript:
“Any surface, any curve… there are numerous processes that go through the same points.”

Kriging: Best Linear Unbiased Prediction (BLUP)

Kriging = BLUP
Invented by A. Krige (mining engineer)
Historically developed outside statistics
Uses covariance structure, but historically framed via:
- Covariogram
- Analogous to periodogram in time series

Transcript:
Terminology differs because the field developed in geostatistics, not statistics.

Linear Predictor Form

We restrict attention to linear predictors: $\hat{Y}(s_0) = a + \sum_{i=1}^n \lambda_i Y(s_i)$

Unbiasedness Condition

$\mathbb{E}[\hat{Y}(s_0)] = \mathbb{E}[Y(s_0)]$

If mean $\mu$ is known constant:

Forces $a = 0$
Coefficients must sum appropriately

Mean Squared Error Minimization

Define: $\mathrm{MSE} = \mathbb{E}\left(\sum_i \lambda_i Y(s_i) - Y(s_0)\right)^2 = \mathrm{Var}(\lambda^T Y - Y_0)$

Expands to: $\lambda^T V \lambda - 2 \lambda^T k + \mathrm{Var}(Y_0)$

Taking derivatives: $2V\lambda - 2k = 0 \quad\Rightarrow\quad V\lambda = k$

Thus: $\lambda = V^{-1}k$

Simple Kriging vs Ordinary Kriging

Simple Kriging

Mean $\mu$ is known
Predictor: $\hat{Y}_0 = \mu + k^T V^{-1}(Y - \mu \mathbf{1})$

Ordinary Kriging

Mean $\mu$ unknown, but assumed constant
Impose unbiasedness for all $\mu$: $\sum_i \lambda_i = 1$
Solve constrained minimization problem
Leads to system: $\lambda = V^{-1}(k + m\mathbf{1})$ with Lagrange multiplier $m$

Final predictor: $\hat{Y}_0 = \hat{\mu} + \hat{b}^T (Y - \mathbf{1}\hat{\mu})$

where: $\hat{\mu} = \frac{\mathbf{1}^T V^{-1} Y}{\mathbf{1}^T V^{-1} \mathbf{1}} \quad \hat{b} = V^{-1}k$

Transcript emphasis:
Estimating the mean increases prediction variance, analogous to $n$ vs $n-1$ in variance estimation.

Universal Kriging

Assume mean varies spatially: $\mu(s) = X(s)^T \beta$

Model: $Y(s) = \sum_{j=1}^p \beta_j x_j(s) + \varepsilon(s)$

Mean is linear in known covariates (e.g. latitude)
Covariates must be observable at prediction location
Unbiasedness constraint: $X^T \lambda = X(s_0)$

Transcript:
If $X(s_0)$ is not in the column space of $X$, unbiased prediction may not exist.

Kriging vs Gaussian Process Prediction

Key insight:

Kriging uses only first two moments
Does not require Gaussianity
Gaussian assumption simplifies interpretation

Assume: $(Y, Y_0) \sim \mathcal{N}$

Then the optimal predictor over all measurable functions is: $g(Y) = \mathbb{E}(Y_0 \mid Y)$

For Gaussian variables: $\mathbb{E}(Y_0 \mid Y) = k^T V^{-1} Y$

Thus:

BLUP = conditional expectation when Gaussian

Geometric Interpretation

Prediction is an orthogonal projection
Space:
- Hilbert space $L^2$
- Project $Y_0$ onto span of observed $Y$

Analogy:

Linear regression
Conditional expectation as projection
Error orthogonal to predictor space

Transcript:
“First linear prediction is really a projection.”

Conditional Multivariate Normal

Partition: $\begin{pmatrix} Y_1 \\ Y_2 \end{pmatrix} \sim \mathcal{N} \left( 0, \begin{pmatrix} V_{11} & V_{12} \\ V_{21} & V_{22} \end{pmatrix} \right)$

Then: $Y_1 \mid Y_2 \sim \mathcal{N} \left( V_{12} V_{22}^{-1} Y_2, \; V_{11} - V_{12}V_{22}^{-1}V_{21} \right)$

(board note ends here)

Big Picture Takeaways

Kriging = BLUP
Gaussian processes justify kriging via conditional expectation
Unknown mean increases prediction uncertainty
Universal kriging generalizes regression + spatial dependence
Everything reduces to projection in Hilbert space

Edit this page on GitHub