Bayesian View of Activation Patterns in ReLU Networks

Motivation

In standard neural networks, the activation matrix $A = \mathbf{1}(Z > 0)$ is treated as a deterministic byproduct of parameters $W, b$.

However, this hides an important structure:

The network behaves as a mixture of linear models indexed by activation patterns.

We instead treat $A$ as a latent (Bayesian) variable, allowing us to analyze the model as a probabilistic system.


Generative Formulation

We define a latent-variable model:

\[A \sim P(A \mid X, \theta)\] \[Y \sim P(Y \mid X, A, \theta)\]

For ReLU networks, we approximate:

\[P(A \mid X, \theta) \approx \text{Bernoulli}(\sigma(Z))\]

where:

\[Z = XW^\top + b\]

This yields:

A mixture of linear models with input-dependent gating.


Marginal Likelihood

The predictive distribution becomes:

\[P(Y \mid X, \theta) = \sum_A P(Y \mid X, A, \theta) P(A \mid X, \theta)\]

This smooths the hard partitioning induced by ReLU and gives a probabilistic interpretation of activation regions.


Covariance Decomposition

We define conditional covariance:

\[\Sigma_X^{(A)} = \operatorname{Cov}(X \mid A)\]

Then:

\[\operatorname{Cov}(X) = \mathbb{E}_A[\operatorname{Cov}(X \mid A)] + \operatorname{Cov}_A(\mathbb{E}[X \mid A])\]

Interpretation:

  • First term: within-region geometry
  • Second term: between-region structure

Gradient Structure

Let:

\[G = \nabla_\theta \log P(Y \mid X, \theta)\]

Then:

\[G = \mathbb{E}_A[\nabla_\theta \log P(Y \mid X, A, \theta)]\]

So gradients are expectations over latent activation regimes.

The covariance decomposes as:

\[\operatorname{Cov}(G) = \mathbb{E}_A[\operatorname{Cov}(G \mid A)] + \operatorname{Cov}_A(\mathbb{E}[G \mid A])\]

Interpretation

  • Neural networks implicitly average over activation regimes
  • Optimization noise arises from:
    • data variability
    • switching between regimes

Practical Approximations

Soft gating

Replace hard activations:

\[A = \mathbf{1}(Z > 0)\]

with:

\[p = \sigma(Z / \tau)\]

where $\tau$ controls sharpness.


Sampling activation patterns

\[A \sim \text{Bernoulli}(p)\]
A_sample = (torch.rand_like(p) < p).float()

Variational formulation

Learn an approximate posterior:

\[q(A \mid X)\]

Optimize:

\[\mathbb{E}_{q(A \mid X)}[\log P(Y \mid X, A)] - \mathrm{KL}(q(A \mid X) \| P(A \mid X))\]

New Analytical Objects

Treat covariance as random:

\[\Sigma_X(A)\]

Study:

  • $\mathbb{E}[\Sigma_X(A)]$$
  • $\operatorname{Var}(\Sigma_X(A))$$
  • eigenvector stability across $A$$

Research Directions

Stability of geometry

\[\text{Compare eigenvectors of } \Sigma_X^{(A)} \text{ across } A\]

Information in gating

\[I(A; X), \quad I(A; Y)\]

Gradient variance decomposition

\[\operatorname{Var}(G) = \mathbb{E}_A[\operatorname{Var}(G \mid A)] + \operatorname{Var}_A(\mathbb{E}[G \mid A])\]

Conceptual Summary

Treating $A$ as a latent variable turns a neural network into:

  • a mixture model over linear regimes
  • a stochastic system governing both geometry and optimization

One-line Insight

ReLU networks can be understood as latent-variable models where activation patterns define a distribution over local linear geometries and optimization dynamics.

Comments