Lecture 06 — Efficient Leave-One-Out Prediction, Precision Matrices, and Simulation
Date: Wednesday, January 28, 2026
Course: STT997GP
Predictive Scores: CDF vs Indicator
Handwritten note
Score compares CDF F vs indicator 1{Y_i ≤ ŷ}
Lecture annotation Joe is contrasting proper scoring rules that use the entire predictive distribution (e.g. CDF-based scores, log predictive density) versus crude indicators or pointwise losses. This distinction matters because RMSE ignores uncertainty, which is unacceptable from a statistical perspective.
Setup: Drop-one Prediction
We can integrate over uncertainty.
Given observations
\(Y = (Y_1, \dots, Y_n)\)
Define the leave-one-out predictor: \(\hat{Y}_i = \mathbb{E}(Y_i \mid Y_{-i})\)
“Drop ith value”
Precision Matrix Formulation
Let
\(Y \sim \mathcal{N}(0, V)\)
Define the precision matrix \(Q = V^{-1} = (Q_{ij})\)
Lecture annotation Joe emphasizes that conditioning in multivariate normals is easiest in the precision matrix, not the covariance matrix. This is the key computational trick that avoids refitting $n$ times for LOOCV.
Conditional Mean via Precision Matrix
\[\hat{Y}_i = \mathbb{E}(Y_i \mid Y_j, j \neq i) = -\sum_{j \neq i} \frac{Q_{ij}}{Q_{ii}} Y_j\]This gives the projection of $Y_i$ onto the span of the remaining variables.
Conditional Variance
\[\operatorname{Var}(Y_i \mid Y_{-i}) = \frac{1}{Q_{ii}}\]Lecture annotation Joe notes this is a special and very useful property of the multivariate normal. You get the conditional variance for free once you know the precision matrix.
Orthogonality Argument (Why the Formula Works)
We want: \(\langle Y_i - \hat{Y}_i, Y_k \rangle = 0 \quad (k \neq i)\)
Plug in: \(\left\langle Y_i + \sum_{j \neq i} \frac{Q_{ij}}{Q_{ii}} Y_j,\; Y_k \right\rangle\)
Rewrite: \(= \frac{1}{Q_{ii}} \left\langle \sum_{j=1}^n Q_{ij} Y_j,\; Y_k \right\rangle\)
Inner Product Interpretation
\[\langle \xi, \eta \rangle = \operatorname{Cov}(\xi, \eta) = \mathbb{E}[\xi \eta]\]So: \(\frac{1}{Q_{ii}} \sum_{j=1}^n Q_{ij} \langle Y_j, Y_k \rangle = \frac{1}{Q_{ii}} \sum_{j=1}^n Q_{ij} V_{jk}\)
Matrix form: \(= \frac{1}{Q_{ii}} (QV)_{ik}\)
Since $QV = I$: \((QV)_{ik} = \delta_{ik}\)
Thus for $k \neq i$: \(\langle Y_i - \hat{Y}_i, Y_k \rangle = 0\)
Lecture annotation Joe explicitly connects this to linear projection in Hilbert space. The residual is orthogonal to the span of the conditioning variables.
Conditional Variance (Geometric View)
\[\operatorname{Var}(Y_i \mid Y_{-i}) = \|Y_i - \hat{Y}_i\|^2 = \|Y_i\|^2 - \|\hat{Y}_i\|^2 = \frac{1}{Q_{ii}}\]Key Conclusion
K-fold cross-validation (including LOOCV) can be computed with ONE matrix inverse.
Lecture annotation This is the computational breakthrough:
- Naive LOOCV requires $n$ inversions of $(n-1) \times (n-1)$ matrices
- Using the precision matrix, you invert once
Joe explicitly notes the analogy with:
- Linear models
- Hat matrices
- Influence functions
Model Validation Philosophy
Handwritten
Validation: least as for any algorithm
Lecture annotation
- AIC/BIC: estimation-focused, penalized likelihood
- CV / predictive scores: prediction-focused
- RMSE is popular in ML but ignores uncertainty
- Log predictive density and CDF-based scores are preferred
Simulation vs Emulation
Simulation
- You know the model
- Generate directly from it
Emulation
- Model unknown or too expensive
- Build a surrogate (often GP-based)
Joe emphasizes emulation is central in:
- Computer experiments
- Climate and weather modeling
- Engineering design spaces
Multivariate Normal Simulation
Assume: \(Y \sim \mathcal{N}(0, V)\)
You can always center data so mean is zero.
Cholesky Decomposition
\[V = LL^\top\]Where $L$ is lower triangular.
Generate: \(Z \sim \mathcal{N}(0, I) \quad \Rightarrow \quad LZ \sim \mathcal{N}(0, V)\)
Lecture annotation Joe stresses:
- Cholesky is numerically stable
- Much faster than general matrix inversion
- Essential for large covariance matrices
Used internally by mvrnorm() in MASS, but you need to know this when it fails.
Practical Notes
- Always set random seeds for reproducibility
- Random number streams depend on OS and compiler
- Large covariance matrices can be ill-conditioned
- Precision matrices often behave better numerically
Big Picture Takeaway
Elegant theory (Hilbert space projections, precision matrices)
→ Computational efficiency
→ Practical scalable validation for spatial models
Joe explicitly frames this as an example of why theory matters for real applied work.
Comments