STT 997 — Lecture 02

Gaussian Processes, Kriging, and Prediction


Gaussian Process Recap

  • Gaussian Process: \(Y(s) \sim \text{GP}(\mu(s), k(s,s'))\)

  • Expectation: \(\mathbb{E}(Y_0) = \mathbb{E}(Y(s_0))\)

  • Given observations $Y$ with:
    • mean $\mathbb{E}(Y)$
    • covariance matrix $V = \mathrm{Var}(Y)$
  • Predict $Y_0$ using the mean squared error (MSE) optimal predictor: \(\hat{Y}_0 = \mathbb{E}(Y_0) + k^T V^{-1}(Y - \mathbb{E}(Y))\)

Transcript alignment:
The instructor emphasizes that many Gaussian processes exist that interpolate the data exactly, and that this flexibility is both a strength and a danger (overfitting).


Zero-Mean Special Case

If all means are zero: \(\hat{Y}_0 = k^T V^{-1} Y\) (explicitly noted as useful for homework)


Motivation: Why Gaussian Processes?

  • GP is a very rich class:
    • Can interpolate any finite set of points
    • Nonparametric (not a fixed-degree polynomial)
  • Many functions pass through the same observed points
  • Overfitting concern:
    • Zero training error ≠ good prediction
    • Analogy to polynomial interpolation and double descent

Transcript:
“Any surface, any curve… there are numerous processes that go through the same points.”


Kriging: Best Linear Unbiased Prediction (BLUP)

  • Kriging = BLUP
  • Invented by A. Krige (mining engineer)
  • Historically developed outside statistics
  • Uses covariance structure, but historically framed via:
    • Covariogram
    • Analogous to periodogram in time series

Transcript:
Terminology differs because the field developed in geostatistics, not statistics.


Linear Predictor Form

We restrict attention to linear predictors: \(\hat{Y}(s_0) = a + \sum_{i=1}^n \lambda_i Y(s_i)\)

Unbiasedness Condition

\(\mathbb{E}[\hat{Y}(s_0)] = \mathbb{E}[Y(s_0)]\)

If mean $\mu$ is known constant:

  • Forces $a = 0$
  • Coefficients must sum appropriately

Mean Squared Error Minimization

Define: \(\mathrm{MSE} = \mathbb{E}\left(\sum_i \lambda_i Y(s_i) - Y(s_0)\right)^2 = \mathrm{Var}(\lambda^T Y - Y_0)\)

Expands to: \(\lambda^T V \lambda - 2 \lambda^T k + \mathrm{Var}(Y_0)\)

Taking derivatives: \(2V\lambda - 2k = 0 \quad\Rightarrow\quad V\lambda = k\)

Thus: \(\lambda = V^{-1}k\)


Simple Kriging vs Ordinary Kriging

Simple Kriging

  • Mean $\mu$ is known
  • Predictor: \(\hat{Y}_0 = \mu + k^T V^{-1}(Y - \mu \mathbf{1})\)

Ordinary Kriging

  • Mean $\mu$ unknown, but assumed constant
  • Impose unbiasedness for all $\mu$: \(\sum_i \lambda_i = 1\)

  • Solve constrained minimization problem
  • Leads to system: \(\lambda = V^{-1}(k + m\mathbf{1})\) with Lagrange multiplier $m$

Final predictor: \(\hat{Y}_0 = \hat{\mu} + \hat{b}^T (Y - \mathbf{1}\hat{\mu})\)

where: \(\hat{\mu} = \frac{\mathbf{1}^T V^{-1} Y}{\mathbf{1}^T V^{-1} \mathbf{1}} \quad \hat{b} = V^{-1}k\)

Transcript emphasis:
Estimating the mean increases prediction variance, analogous to $n$ vs $n-1$ in variance estimation.


Universal Kriging

Assume mean varies spatially: \(\mu(s) = X(s)^T \beta\)

Model: \(Y(s) = \sum_{j=1}^p \beta_j x_j(s) + \varepsilon(s)\)

  • Mean is linear in known covariates (e.g. latitude)
  • Covariates must be observable at prediction location
  • Unbiasedness constraint: \(X^T \lambda = X(s_0)\)

Transcript:
If $X(s_0)$ is not in the column space of $X$, unbiased prediction may not exist.


Kriging vs Gaussian Process Prediction

Key insight:

  • Kriging uses only first two moments
  • Does not require Gaussianity
  • Gaussian assumption simplifies interpretation

Assume: \((Y, Y_0) \sim \mathcal{N}\)

Then the optimal predictor over all measurable functions is: \(g(Y) = \mathbb{E}(Y_0 \mid Y)\)

For Gaussian variables: \(\mathbb{E}(Y_0 \mid Y) = k^T V^{-1} Y\)

Thus:

BLUP = conditional expectation when Gaussian


Geometric Interpretation

  • Prediction is an orthogonal projection
  • Space:
    • Hilbert space $L^2$
    • Project $Y_0$ onto span of observed $Y$

Analogy:

  • Linear regression
  • Conditional expectation as projection
  • Error orthogonal to predictor space

Transcript:
“First linear prediction is really a projection.”


Conditional Multivariate Normal

Partition: \(\begin{pmatrix} Y_1 \\ Y_2 \end{pmatrix} \sim \mathcal{N} \left( 0, \begin{pmatrix} V_{11} & V_{12} \\ V_{21} & V_{22} \end{pmatrix} \right)\)

Then: \(Y_1 \mid Y_2 \sim \mathcal{N} \left( V_{12} V_{22}^{-1} Y_2, \; V_{11} - V_{12}V_{22}^{-1}V_{21} \right)\)

(board note ends here)


Big Picture Takeaways

  • Kriging = BLUP
  • Gaussian processes justify kriging via conditional expectation
  • Unknown mean increases prediction uncertainty
  • Universal kriging generalizes regression + spatial dependence
  • Everything reduces to projection in Hilbert space

Comments