Model

Edit this page on GitHub

Model

Generator of random vectors $X$ from unknown $P(X)$.
Superviser (oracle) - assigns $Y$ for each $X$ according to $P(y x)$
Learning machine (algorithm) $\mathcal{F}={f(x,\alpha):\alpha\in\Lambda}$

dataset ${(X_i,Y_i), i=1,…,n}$ from $P(x,y)=P(y

x)P(x)$.

loss function $L(y,f(x;\alpha))$

risk function $R(\alpha)=\int{L(y,f(x,\alpha))dP(x,y)}$

class $\mathcal{C}$ of subsets of sample space $S$.

classification error (a.k.a. generalization error) $R(C)$

$R(C)=P(Y\ne I_C(X))$

this comes from $R(C)=E(L(y,I_C(x)))$ where $L(y,I_C(x))= \begin{cases} 1, & \text{if } y=I_C(x) \\ 0, & \text{if } y\ne I_C(x) \end{cases}$

Empirical Risk (a.k.a. training error)

$R_n(\alpha)=\int{L(y,f(x,\alpha))dP_n(x,y)} \text{where } P_n(x,y)=\frac{1}{n} \sum_{i=1}^{n}I(X_i\le x, Y_i\le y)$

Empirical Measure $P_n$

random variable $X_1 ,…,X_n$ iid from measure space $(S,\mathcal{A},P)$.

$P_n(A)=\frac(1){n}\sum_{i=1}^n I(X_i\in A), A\in\mathcal{A}$

properties

$P_n$ is an unbiased estimator of $P$.
- e.g. $\mathbb{E}(P_n)=P$
$Var(P_n) = \frac{1}{n}P(A)(1-P(A))$
SLLN: $P_n \rightarrow P \text{ as } n\rightarrow \infty$
CLT: $\sqrt{n}(P_n(A)-P(A))\rightarrow N(0,P(A)(1-P(A)))$
$\mathcal{C}={A_1 ,…,A_n}, then \max{A\in\mathcal{C}} P_n(A)-P(A) \rightarrow 0 \text{ as } n\rightarrow\infty$

Useful Inequalities

Markov’s inequality $P(|X|\ge a)\le \frac{\mathbb{E}|X|^k}{a^k }$

Jenson’s inequality $\psi(\mathbb{E}X)\le\mathbb{E}\psi(X)$ where $\psi$ is convex.

convexity $\psi(\lambda x+(1-\lambda)y)\le \lambda\psi(x)+(1-\lambda)\psi(y)\quad\forall x,y \text{ and } \lambda\in[0,1]$

Chebyshev’s inequality $P(|\sum_{i=1}^n X_i |\ge t)\le \frac{1}{n^2} \sum_{i=1}^n \sigma_i^2$ for $t>0, X_1,…,X_n \text{ ind. r.v. with }\mathbb{E}X_i=0 \text{ and } Var(X_i)=\sigma^2_i$

Hoeffding’s inequality $P(|\sum_{i=1}^n X_i|\ge t)\le 2 exp\{-\frac{-t^2}{\sum_{i=1}^n(b_i-a_i)^2}\}$ for $X_1,…,X_n \text{ ind. r.v. with }\mathbb{E}X_i=0, a_i\le X_i\le b_i \text { and } t>0$.

Lemma: $\mathbb{E}e^{\lambda X}\le exp{\frac{\lambda^2}{8}(b-a)^2} \text{ for } a\le X\le b, \mathbb{E}x=0.$

Corollary: $P(

\sum_{i=1}^n a_i\epsilon_i

\ge t)\le 2 exp{-\frac{-t^2}{\sum_{i=1}^na_i^2}} \text{ for } \epsilon_1,…,\epsilon_n \overset{\text{ind.}}{\sim} \text{Rademacher}\left(\tfrac12\right), a_1,…,a_n\in\mathbb{R}$.

McDiarmid’s inequality $P(f(X_1,...,X_n)-\mathbb{E}f(X_1,...,X_n)\ge t)\le exp\{-\frac{2t^2}{\sum_{i=1}^n}\}$ for function $f : A^n->\mathbb{R}$ s.t. $\sup{x_1,…,x_n,x’_i\in A}|f(x_1,…,x_i,…,x_n)-f(x_1,…,x’_i,…,x_n)|\le c_i, i=1,…,n$

Berstein’s inequality $P(|\sum_{i=1}^n X_i|\ge t)\le 2 exp\{\frac{t^2}{2B^2_n+tM/3}\}$ for $X_1,…,X_n$ ind. r.v. with $\mathbb{E}X_i=0, |X_i|\le M, \text{ and } \mathbb{E}X_i^2=\sigma^2i$ where $B^2_n=\sum{i=1}^n\sigma^2_i$ and $t>0$.

lemma: $\mathbb{E}e^{\lambda X} \le exp{\sigma^2/M^2(e^{\lambda }-1-\lambda M)}$

Chernoff’s Bounded for $\lambda>0$ $P(\sum_{i=1}^n X_i\ge t)=P(exp\{\lambda\sum_{i=1}^n X_i\}\ge exp\{\lambda t\})$ then minimize w.r.t. $\lambda$.

Bennett’s inequality $P(|\sum_{i=1}^n X_i|\ge t)\le 2 exp\{\frac{B^2_n}{M^2}\psi(\frac{tM}{B^2_n})\}$ for $t>0$ where $\psi(u)=u-(1+u)ln(1+u)$

Definitions

Orlicz Norm $\Vert X\Vert_\psi = inf\{C>0: \mathbb{B}\psi(\frac{|X|}{C})\le 1\}, inf\emptyset =\infty$ where $\psi$ is non-decreasing real function with $\psi(0)=0$.

$\psi(u)=u^p, \Vert X\Vert_\psi=\Vert X\Vert_{L^p}$
$\psi(u)=e^{u^2}-1$ then for sub-gaussian r.v. $P( X \ge t)\le A exp{-t^2/\Delta^2}$ then $\Vert X\Vert_\psi\le\Delta\sqrt{1+A}$

Maximal inequality $\mathbb{E}\max\limits_{1\le i\le n}|X_i|\le\psi^{-1}(n)\max\limits_{1\le i \le n}\Vert X_i\Vert_\psi$ where $\psi$ is non-decreasing real function with $\psi(0)=0$.

Normal: $\mathbb{E}\max\limits_{1\le i\le n} X_i \le\sqrt{6ln(1+n)}\max\limits_{1\le i \le n}\sigma_i$
- $\text{ for }X_i\sim N(0,\sigma^2_i)$
Rademacher: $\mathbb{E}\max\limits_{1\le i\le n} X_i \le\sqrt{6ln(1+n)}\max\limits_{1\le i \le n}(\sum_{i=1}^m a^2_{ij})^{1/2}$
- $\text{ for }X_i=\sum_{i=1}^m a_{ij}\epsilon_j$

Symmetrization inequalites

$\mathbb{E}\Phi(\Vert\sum_{i=1}^n (X_i-\mathbb{E}X_i)\Vert_T)\le \mathbb{E}\Phi(2\Vert\sum_{i=1}^n\epsilon_i X_i\Vert_T)$ $\mathbb{E}\Phi(\Vert\sum_{i=1}^n (X_i-\mathbb{E}X_i)\Vert_T)\ge \mathbb{E}\Phi(\frac{1}{2}\Vert\sum_{i=1}^n\epsilon_i (X_i-\mathbb{E}X_i)\Vert_T)$

where ${X_i(r),t\in T}$ is ind. stochastic process with $\mathbb{E}X_i<\infty$. $\Phi$ is non-decreasing function on $\mathbb{R}_+$. $\epsilon_1,…,\epsilon_n\overset{\text{i.i.d.}}{\sim}\text{ Rademacher. }X_i\perp!!!\perp\epsilon_j \forall i,j.$

Ghost Copy process by which we create an independent copy of a random variable under the expectation.

Uniform Deviation $\Vert P_n-P\Vert_\mathcal{C}$

$\Vert P_n-P\Vert_\mathcal{C}=\sup\limits_{C\in\mathcal{C}}\Vert P_n(C)-P(C)\Vert$ where $P(C)$ is the true classification error (generalization error), $P_n(C)$ is the empiricial classification error (training error).

Complexity $\Delta^{\mathcal{C}}(F)$

class $\mathcal{C}$ of subsets of sample space $S$. finite subset $F={x_1,…,x_n}$ with $card(F)=n$.

$\Delta^{\mathcal{C}}(F)=card(\{C\cap F : C\in\mathcal{C}\})$ Alternately, $\Delta^{\mathcal{C}}(F)=card(\{(I_C(x_1),...,I_C(x_n)): C\in\mathcal{C}\})$ that is, the count of distinct binary vectors constructed from the inclusion/exclusion of each variable x_i in elements of class $\mathcal{C}$

$\Delta^{\mathcal{C}}(F)=2^{card(F)}$ if $\mathcal{C}$ contain all subsets of the sample space.

Uniform Diviation/Complexity Theorem $\mathbb{E}\Vert P_n-P\Vert_\mathcal{C}\le K\mathbb{E}\sqrt{\frac{1}{n}ln(1+\Delta^\mathcal{C}(X_1,...,X_n))}$ for $K>0$, $X_1,…,X_n$ i.i.d.r.v. on probablity space $(S,\mathcal{A},P)$ where class $\mathcal{C}\subset \mathcal{A}$ $\sigma$-algebra.

Empirical Log-Growth function $h_\mathcal{C}(n)$ $h_\mathcal{C}(n)=ln\Delta^\mathcal{C}(\{X_1,...,X_n\})$

Vapnik-Chervonenkis (VC) Theorem $\frac{h_\mathcal{C}(n)}{n}\underset{ n\rightarrow\infty}{\overset{P}{\rightarrow}} 0\ \Leftrightarrow \Vert P_n-P\Vert_\mathcal{C}\underset{ n\rightarrow\infty}{\overset{a.s.}{\rightarrow}} 0$

proposition 1: $\Vert P_n-P\Vert_\mathcal{C}\underset{ n\rightarrow\infty}{\overset{P}{\rightarrow}} 0\Leftrightarrow \Vert P_n-P\Vert_\mathcal{C}\underset{ n\rightarrow\infty}{\overset{a.s.}{\rightarrow}} 0 \Leftrightarrow \mathbb{E}\Vert P_n-P\Vert_\mathcal{C}\underset{ n\rightarrow\infty}{\rightarrow} 0$
proposition 2: $\frac{h_\mathcal{C}(n)}{n}\underset{ n\rightarrow\infty}{\overset{P}{\rightarrow}} 0\ \Leftrightarrow \frac{h_\mathcal{C}(n)}{n}\underset{ n\rightarrow\infty}{\overset{a.s.}{\rightarrow}} 0\ \Leftrightarrow \mathbb{E}\frac{h_\mathcal{C}(n)}{n}\underset{ n\rightarrow\infty}{\rightarrow} 0$

Projection operator $\pi_J$ $\pi_J(t_1,...,t_n)=(t_{j_1},...,t_{j_k})$ for index set $J={j_1,…,j_k)\subset {1,…,n}.

shatters - class $\mathcal{C}$ shatters set $F$ if complexity $\Delta^C(F)=2^{card(F)}$

VC Growth function $m^\mathcal{C}(n)$ (a.k.a. shattering function) $m^\mathcal{C}(n)=max\{\Delta^\mathcal{C}(F) : F\subset S, card(F)=n\}$

$\forall n\ge 1 m^\mathcal{C}(n)=2^n\Leftrightarrow \forall n\ge 1 \exists F\subset S: card(F)=n, \mathcal{C} \text{ shatters }F$.
$\exists n\ge 1 m^\mathcal{C}(n)<2^n\Rightarrow \forall n’>n m^\mathcal{C}(n’)< 2^{n’}$

VC Dimension $V(\mathcal{C})=min\{n : m^\mathcal{C}(n)<2^n\}$ That is, VC dimension is the last n in which the growth function is ^2^n^.

$S=\mathbb{R}, \mathcal{C}={(-\infty,t] :t\in\mathbb{R}}\Rightarrow V(\mathcal{C})=2$
$S=\mathbb{R}, \mathcal{C}={(a,b) : a<b, a,b\in\mathbb{R}}\Rightarrow V(\mathcal{C})=3$
$S\in\mathbb{R}^d$, All linear halfspaces $\mathcal{H}\Rightarrow V(\mathcal{H})=d+1$
$S\in\mathbb{R}^d$, All affine halfspaces $\mathcal{H}\Rightarrow V(\mathcal{H})=d+2$
$S\in\mathbb{R}^d$, All closed balls $\mathcal{B}\Rightarrow V(\mathcal{B}) \le d+2$
$S\in\mathbb{R}^d$, k or less sided Polyhedra $\mathcal{H}\Rightarrow V(\mathcal{H})<\infty$ (therefore VC class).

Vapnik-Chervonenkis (VC) Class if $V(\mathcal{C})<\infty$

Growth Function/VC Class Theorem

If $\mathcal{C}$ is a VC class, then $m^\mathcal{C}(n)\le\binom{n}{\le k-1}$.
If $\mathcal{C}$ is a VC class, then $m^\mathcal{C}(n)=2^n$.

Glivenko-Cantelli (GC) class if $\Vert P_n-P\Vert_\mathcal{C}\overset{a.s.}{\rightarrow}0$

$\mathcal{C}$ is universial GC class $\Leftrightarrow \forall P\in\mathcal{P}(S)$ probability measures on $(S,\mathcal{A}), $\mathcal{C}$ is GC class w.r.t. $P$.
$\mathcal{C}$ is uniform GC class $\Leftrightarrow \sup\limits_{P\in\mathcal{P}(S)}\mathbb{E}\Vert P_n-P\Vert_\mathcal{C}\underset{n\rightarrow \infty}{\rightarrow}0$.

GC/VC Theorems

$\mathcal{C}$ is VC class $\Leftrightarrow \mathcal{C}$ is uniform GC class.
$\mathcal{C}$ is VC class $\Rightarrow \mathcal{C}$ is universal GC class.

Polynomial VC Dimension $\mathcal{P}_(k,d) = \{\text{All degree k or less polynomials on }\mathbb{R}^d\}$

Neural Network VC Dimension $m^{N^{\sigma_t}_{k,x}}\le \binom{n}{\le d+1}^k\binom{n}{\le k+1}$ where $N^{\sigma_t}_{k,x} : \mathbb{R}^d->{0,1}$ is a neural network with $\sigma_t$ activiation functions with a single hidden layer with $k$-neurons.

Edit this page on GitHub

Model

Useful Inequalities

Definitions

Comments