Lecture 06 — Efficient Leave-One-Out Prediction, Precision Matrices, and Simulation

Date: Wednesday, January 28, 2026
Course: STT997GP

Predictive Scores: CDF vs Indicator

Handwritten note

Score compares CDF F vs indicator 1{Y_i ≤ ŷ}

Lecture annotation Joe is contrasting proper scoring rules that use the entire predictive distribution (e.g. CDF-based scores, log predictive density) versus crude indicators or pointwise losses. This distinction matters because RMSE ignores uncertainty, which is unacceptable from a statistical perspective.

Setup: Drop-one Prediction

We can integrate over uncertainty.

Given observations
$Y = (Y_1, \dots, Y_n)$

Define the leave-one-out predictor: $\hat{Y}_i = \mathbb{E}(Y_i \mid Y_{-i})$

“Drop ith value”

Precision Matrix Formulation

Let
$Y \sim \mathcal{N}(0, V)$

Define the precision matrix $Q = V^{-1} = (Q_{ij})$

Lecture annotation Joe emphasizes that conditioning in multivariate normals is easiest in the precision matrix, not the covariance matrix. This is the key computational trick that avoids refitting $n$ times for LOOCV.

Conditional Mean via Precision Matrix

\[\hat{Y}_i = \mathbb{E}(Y_i \mid Y_j, j \neq i) = -\sum_{j \neq i} \frac{Q_{ij}}{Q_{ii}} Y_j\]

This gives the projection of $Y_i$ onto the span of the remaining variables.

Conditional Variance

\[\operatorname{Var}(Y_i \mid Y_{-i}) = \frac{1}{Q_{ii}}\]

Lecture annotation Joe notes this is a special and very useful property of the multivariate normal. You get the conditional variance for free once you know the precision matrix.

Orthogonality Argument (Why the Formula Works)

We want: $\langle Y_i - \hat{Y}_i, Y_k \rangle = 0 \quad (k \neq i)$

Plug in: $\left\langle Y_i + \sum_{j \neq i} \frac{Q_{ij}}{Q_{ii}} Y_j,\; Y_k \right\rangle$

Rewrite: $= \frac{1}{Q_{ii}} \left\langle \sum_{j=1}^n Q_{ij} Y_j,\; Y_k \right\rangle$

Inner Product Interpretation

\[\langle \xi, \eta \rangle = \operatorname{Cov}(\xi, \eta) = \mathbb{E}[\xi \eta]\]

So: $\frac{1}{Q_{ii}} \sum_{j=1}^n Q_{ij} \langle Y_j, Y_k \rangle = \frac{1}{Q_{ii}} \sum_{j=1}^n Q_{ij} V_{jk}$

Matrix form: $= \frac{1}{Q_{ii}} (QV)_{ik}$

Since $QV = I$: $(QV)_{ik} = \delta_{ik}$

Thus for $k \neq i$: $\langle Y_i - \hat{Y}_i, Y_k \rangle = 0$

Lecture annotation Joe explicitly connects this to linear projection in Hilbert space. The residual is orthogonal to the span of the conditioning variables.

Conditional Variance (Geometric View)

\[\operatorname{Var}(Y_i \mid Y_{-i}) = \|Y_i - \hat{Y}_i\|^2 = \|Y_i\|^2 - \|\hat{Y}_i\|^2 = \frac{1}{Q_{ii}}\]

Key Conclusion

K-fold cross-validation (including LOOCV) can be computed with ONE matrix inverse.

Lecture annotation This is the computational breakthrough:

Naive LOOCV requires $n$ inversions of $(n-1) \times (n-1)$ matrices
Using the precision matrix, you invert once

Joe explicitly notes the analogy with:

Linear models
Hat matrices
Influence functions

Model Validation Philosophy

Handwritten

Validation: least as for any algorithm

Lecture annotation

AIC/BIC: estimation-focused, penalized likelihood
CV / predictive scores: prediction-focused
RMSE is popular in ML but ignores uncertainty
Log predictive density and CDF-based scores are preferred

Simulation vs Emulation

Simulation

You know the model
Generate directly from it

Emulation

Model unknown or too expensive
Build a surrogate (often GP-based)

Joe emphasizes emulation is central in:

Computer experiments
Climate and weather modeling
Engineering design spaces

Multivariate Normal Simulation

Assume: $Y \sim \mathcal{N}(0, V)$

You can always center data so mean is zero.

Cholesky Decomposition

\[V = LL^\top\]

Where $L$ is lower triangular.

Generate: $Z \sim \mathcal{N}(0, I) \quad \Rightarrow \quad LZ \sim \mathcal{N}(0, V)$

Lecture annotation Joe stresses:

Cholesky is numerically stable
Much faster than general matrix inversion
Essential for large covariance matrices

Used internally by mvrnorm() in MASS, but you need to know this when it fails.

Practical Notes

Always set random seeds for reproducibility
Random number streams depend on OS and compiler
Large covariance matrices can be ill-conditioned
Precision matrices often behave better numerically

Big Picture Takeaway

Elegant theory (Hilbert space projections, precision matrices)
→ Computational efficiency
→ Practical scalable validation for spatial models

Joe explicitly frames this as an example of why theory matters for real applied work.

Edit this page on GitHub