Lecture 06 — Efficient Leave-One-Out Prediction, Precision Matrices, and Simulation

Date: Wednesday, January 28, 2026
Course: STT997GP


Predictive Scores: CDF vs Indicator

Handwritten note

Score compares CDF F vs indicator 1{Y_i ≤ ŷ}

Lecture annotation Joe is contrasting proper scoring rules that use the entire predictive distribution (e.g. CDF-based scores, log predictive density) versus crude indicators or pointwise losses. This distinction matters because RMSE ignores uncertainty, which is unacceptable from a statistical perspective.


Setup: Drop-one Prediction

We can integrate over uncertainty.

Given observations
\(Y = (Y_1, \dots, Y_n)\)

Define the leave-one-out predictor: \(\hat{Y}_i = \mathbb{E}(Y_i \mid Y_{-i})\)

“Drop ith value”


Precision Matrix Formulation

Let
\(Y \sim \mathcal{N}(0, V)\)

Define the precision matrix \(Q = V^{-1} = (Q_{ij})\)

Lecture annotation Joe emphasizes that conditioning in multivariate normals is easiest in the precision matrix, not the covariance matrix. This is the key computational trick that avoids refitting $n$ times for LOOCV.


Conditional Mean via Precision Matrix

\[\hat{Y}_i = \mathbb{E}(Y_i \mid Y_j, j \neq i) = -\sum_{j \neq i} \frac{Q_{ij}}{Q_{ii}} Y_j\]

This gives the projection of $Y_i$ onto the span of the remaining variables.


Conditional Variance

\[\operatorname{Var}(Y_i \mid Y_{-i}) = \frac{1}{Q_{ii}}\]

Lecture annotation Joe notes this is a special and very useful property of the multivariate normal. You get the conditional variance for free once you know the precision matrix.


Orthogonality Argument (Why the Formula Works)

We want: \(\langle Y_i - \hat{Y}_i, Y_k \rangle = 0 \quad (k \neq i)\)

Plug in: \(\left\langle Y_i + \sum_{j \neq i} \frac{Q_{ij}}{Q_{ii}} Y_j,\; Y_k \right\rangle\)

Rewrite: \(= \frac{1}{Q_{ii}} \left\langle \sum_{j=1}^n Q_{ij} Y_j,\; Y_k \right\rangle\)


Inner Product Interpretation

\[\langle \xi, \eta \rangle = \operatorname{Cov}(\xi, \eta) = \mathbb{E}[\xi \eta]\]

So: \(\frac{1}{Q_{ii}} \sum_{j=1}^n Q_{ij} \langle Y_j, Y_k \rangle = \frac{1}{Q_{ii}} \sum_{j=1}^n Q_{ij} V_{jk}\)

Matrix form: \(= \frac{1}{Q_{ii}} (QV)_{ik}\)

Since $QV = I$: \((QV)_{ik} = \delta_{ik}\)

Thus for $k \neq i$: \(\langle Y_i - \hat{Y}_i, Y_k \rangle = 0\)

Lecture annotation Joe explicitly connects this to linear projection in Hilbert space. The residual is orthogonal to the span of the conditioning variables.


Conditional Variance (Geometric View)

\[\operatorname{Var}(Y_i \mid Y_{-i}) = \|Y_i - \hat{Y}_i\|^2 = \|Y_i\|^2 - \|\hat{Y}_i\|^2 = \frac{1}{Q_{ii}}\]

Key Conclusion

K-fold cross-validation (including LOOCV) can be computed with ONE matrix inverse.

Lecture annotation This is the computational breakthrough:

  • Naive LOOCV requires $n$ inversions of $(n-1) \times (n-1)$ matrices
  • Using the precision matrix, you invert once

Joe explicitly notes the analogy with:

  • Linear models
  • Hat matrices
  • Influence functions

Model Validation Philosophy

Handwritten

Validation: least as for any algorithm

Lecture annotation

  • AIC/BIC: estimation-focused, penalized likelihood
  • CV / predictive scores: prediction-focused
  • RMSE is popular in ML but ignores uncertainty
  • Log predictive density and CDF-based scores are preferred

Simulation vs Emulation

Simulation

  • You know the model
  • Generate directly from it

Emulation

  • Model unknown or too expensive
  • Build a surrogate (often GP-based)

Joe emphasizes emulation is central in:

  • Computer experiments
  • Climate and weather modeling
  • Engineering design spaces

Multivariate Normal Simulation

Assume: \(Y \sim \mathcal{N}(0, V)\)

You can always center data so mean is zero.


Cholesky Decomposition

\[V = LL^\top\]

Where $L$ is lower triangular.

Generate: \(Z \sim \mathcal{N}(0, I) \quad \Rightarrow \quad LZ \sim \mathcal{N}(0, V)\)

Lecture annotation Joe stresses:

  • Cholesky is numerically stable
  • Much faster than general matrix inversion
  • Essential for large covariance matrices

Used internally by mvrnorm() in MASS, but you need to know this when it fails.


Practical Notes

  • Always set random seeds for reproducibility
  • Random number streams depend on OS and compiler
  • Large covariance matrices can be ill-conditioned
  • Precision matrices often behave better numerically

Big Picture Takeaway

Elegant theory (Hilbert space projections, precision matrices)
→ Computational efficiency
→ Practical scalable validation for spatial models

Joe explicitly frames this as an example of why theory matters for real applied work.

Comments