STT 997 — Lecture 02
Gaussian Processes, Kriging, and Prediction
Gaussian Process Recap
-
Gaussian Process: \(Y(s) \sim \text{GP}(\mu(s), k(s,s'))\)
-
Expectation: \(\mathbb{E}(Y_0) = \mathbb{E}(Y(s_0))\)
- Given observations $Y$ with:
- mean $\mathbb{E}(Y)$
- covariance matrix $V = \mathrm{Var}(Y)$
- Predict $Y_0$ using the mean squared error (MSE) optimal predictor: \(\hat{Y}_0 = \mathbb{E}(Y_0) + k^T V^{-1}(Y - \mathbb{E}(Y))\)
Transcript alignment:
The instructor emphasizes that many Gaussian processes exist that interpolate the data exactly, and that this flexibility is both a strength and a danger (overfitting).
Zero-Mean Special Case
If all means are zero: \(\hat{Y}_0 = k^T V^{-1} Y\) (explicitly noted as useful for homework)
Motivation: Why Gaussian Processes?
- GP is a very rich class:
- Can interpolate any finite set of points
- Nonparametric (not a fixed-degree polynomial)
- Many functions pass through the same observed points
- Overfitting concern:
- Zero training error ≠ good prediction
- Analogy to polynomial interpolation and double descent
Transcript:
“Any surface, any curve… there are numerous processes that go through the same points.”
Kriging: Best Linear Unbiased Prediction (BLUP)
- Kriging = BLUP
- Invented by A. Krige (mining engineer)
- Historically developed outside statistics
- Uses covariance structure, but historically framed via:
- Covariogram
- Analogous to periodogram in time series
Transcript:
Terminology differs because the field developed in geostatistics, not statistics.
Linear Predictor Form
We restrict attention to linear predictors: \(\hat{Y}(s_0) = a + \sum_{i=1}^n \lambda_i Y(s_i)\)
Unbiasedness Condition
\(\mathbb{E}[\hat{Y}(s_0)] = \mathbb{E}[Y(s_0)]\)
If mean $\mu$ is known constant:
- Forces $a = 0$
- Coefficients must sum appropriately
Mean Squared Error Minimization
Define: \(\mathrm{MSE} = \mathbb{E}\left(\sum_i \lambda_i Y(s_i) - Y(s_0)\right)^2 = \mathrm{Var}(\lambda^T Y - Y_0)\)
Expands to: \(\lambda^T V \lambda - 2 \lambda^T k + \mathrm{Var}(Y_0)\)
Taking derivatives: \(2V\lambda - 2k = 0 \quad\Rightarrow\quad V\lambda = k\)
Thus: \(\lambda = V^{-1}k\)
Simple Kriging vs Ordinary Kriging
Simple Kriging
- Mean $\mu$ is known
- Predictor: \(\hat{Y}_0 = \mu + k^T V^{-1}(Y - \mu \mathbf{1})\)
Ordinary Kriging
- Mean $\mu$ unknown, but assumed constant
-
Impose unbiasedness for all $\mu$: \(\sum_i \lambda_i = 1\)
- Solve constrained minimization problem
- Leads to system: \(\lambda = V^{-1}(k + m\mathbf{1})\) with Lagrange multiplier $m$
Final predictor: \(\hat{Y}_0 = \hat{\mu} + \hat{b}^T (Y - \mathbf{1}\hat{\mu})\)
where: \(\hat{\mu} = \frac{\mathbf{1}^T V^{-1} Y}{\mathbf{1}^T V^{-1} \mathbf{1}} \quad \hat{b} = V^{-1}k\)
Transcript emphasis:
Estimating the mean increases prediction variance, analogous to $n$ vs $n-1$ in variance estimation.
Universal Kriging
Assume mean varies spatially: \(\mu(s) = X(s)^T \beta\)
Model: \(Y(s) = \sum_{j=1}^p \beta_j x_j(s) + \varepsilon(s)\)
- Mean is linear in known covariates (e.g. latitude)
- Covariates must be observable at prediction location
- Unbiasedness constraint: \(X^T \lambda = X(s_0)\)
Transcript:
If $X(s_0)$ is not in the column space of $X$, unbiased prediction may not exist.
Kriging vs Gaussian Process Prediction
Key insight:
- Kriging uses only first two moments
- Does not require Gaussianity
- Gaussian assumption simplifies interpretation
Assume: \((Y, Y_0) \sim \mathcal{N}\)
Then the optimal predictor over all measurable functions is: \(g(Y) = \mathbb{E}(Y_0 \mid Y)\)
For Gaussian variables: \(\mathbb{E}(Y_0 \mid Y) = k^T V^{-1} Y\)
Thus:
BLUP = conditional expectation when Gaussian
Geometric Interpretation
- Prediction is an orthogonal projection
- Space:
- Hilbert space $L^2$
- Project $Y_0$ onto span of observed $Y$
Analogy:
- Linear regression
- Conditional expectation as projection
- Error orthogonal to predictor space
Transcript:
“First linear prediction is really a projection.”
Conditional Multivariate Normal
Partition: \(\begin{pmatrix} Y_1 \\ Y_2 \end{pmatrix} \sim \mathcal{N} \left( 0, \begin{pmatrix} V_{11} & V_{12} \\ V_{21} & V_{22} \end{pmatrix} \right)\)
Then: \(Y_1 \mid Y_2 \sim \mathcal{N} \left( V_{12} V_{22}^{-1} Y_2, \; V_{11} - V_{12}V_{22}^{-1}V_{21} \right)\)
(board note ends here)
Big Picture Takeaways
- Kriging = BLUP
- Gaussian processes justify kriging via conditional expectation
- Unknown mean increases prediction uncertainty
- Universal kriging generalizes regression + spatial dependence
- Everything reduces to projection in Hilbert space
Comments