Homework 05

Joseph M. Weaver 2026-02-27


For the first three problems in this homework, the kernel is assumed to be the linear kernel
K(x,x)=xx,  x,xRd.K(x, x′) = x^\top x′, \; x, x′ \in \mathbb{R}^d.


Problem 1: RKHS of the Linear Kernel

Let K:Rd×RdRK : \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R} be defined by
K(x,x)=xx.K(x, x′) = x^\top x′.

(a) Define explicitly the reproducing kernel Hilbert space (RKHS) HKH_K associated with KK and show that for every fHf \in H, there exists a βRd\beta \in \mathbb{R}^d such that f(x)=xβf(x) = x^\top \beta. In addition, for two functions in HH, f(x)=xβf(x) = x^\top \beta and g(x)=xαg(x) = x^\top \alpha, the inner product is
f,gH=βα.\langle f, g \rangle_H = \beta^\top \alpha.

Solution (1a)

Claim 1: For all fHKf \in H_K, there exists βRd\beta \in \mathbb{R}^d such that

f(x)=xβ.f(x) = x^\top \beta.

Write

f(x)=i=1nciK(xi,x)=i=1ncixix=(i=1ncixi)x.f(x) = \sum_{i=1}^n c_i K(x_i, x) = \sum_{i=1}^n c_i x_i^\top x = \left( \sum_{i=1}^n c_i x_i \right)^\top x.

Thus define

β=i=1ncixi,\beta = \sum_{i=1}^n c_i x_i,

so that

f(x)=xβ.f(x) = x^\top \beta.


Claim 2: If f(x)=xβf(x) = x^\top \beta and g(x)=xαg(x) = x^\top \alpha, then

f,gH=βα.\langle f, g \rangle_H = \beta^\top \alpha.

Using bilinearity and the reproducing property,

f,gH=iciK(xi,),jdjK(yj,)H=i,jcidjK(xi,yj).\langle f, g \rangle_H = \left\langle \sum_i c_i K(x_i, \cdot), \sum_j d_j K(y_j, \cdot) \right\rangle_H = \sum_{i,j} c_i d_j K(x_i, y_j).

Since K(xi,yj)=xiyjK(x_i, y_j) = x_i^\top y_j,

i,jcidjxiyj=(icixi)(jdjyj)=βα.\sum_{i,j} c_i d_j x_i^\top y_j = \left( \sum_i c_i x_i \right)^\top \left( \sum_j d_j y_j \right) = \beta^\top \alpha.


(b) Verify the reproducing property, namely that for any fHKf \in H_K and any xRdx \in \mathbb{R}^d,

f(x)=f,K(x,)HK.f(x) = \langle f, K(x, \cdot) \rangle_{H_K}.

Solution (1b)

Let f(x)=i=1nciK(xi,x)f(x) = \sum_{i=1}^n c_i K(x_i, x). Then

f,K(x,)HK=iciK(xi,),K(x,)HK=iciK(xi,),K(x,)HK.\langle f, K(x, \cdot) \rangle_{H_K} = \left\langle \sum_i c_i K(x_i, \cdot), K(x, \cdot) \right\rangle_{H_K} = \sum_i c_i \langle K(x_i, \cdot), K(x, \cdot) \rangle_{H_K}.

By the reproducing property,

K(xi,),K(x,)HK=K(xi,x).\langle K(x_i, \cdot), K(x, \cdot) \rangle_{H_K} = K(x_i, x).

Thus,

f,K(x,)HK=iciK(xi,x)=f(x).\langle f, K(x, \cdot) \rangle_{H_K} = \sum_i c_i K(x_i, x) = f(x).


Problem 2: Ridge Regression and Kernel Ridge Regression

Let {(xi,yi)}i=1n\{(x_i, y_i)\}_{i=1}^n be training data with xiRdx_i \in \mathbb{R}^d and yiRy_i \in \mathbb{R}, and let λ>0\lambda > 0 be a regularization parameter.

(a) (Ridge regression: primal form) Consider the optimization problem

minβRd{i=1n(yixiβ)2+λβ22}.\min_{\beta \in \mathbb{R}^d} \left\{ \sum_{i=1}^n (y_i - x_i^\top \beta)^2 + \lambda \|\beta\|_2^2 \right\}.

Derive the closed-form solution β^\hat{\beta}.

Solution (2a)

Rewrite in matrix form:

YXβ22+λβ22.\|Y - X\beta\|_2^2 + \lambda \|\beta\|_2^2.

Taking derivatives:

β=2X(XβY)+2λβ=0.\frac{\partial}{\partial \beta} = 2X^\top (X\beta - Y) + 2\lambda \beta = 0.

Thus,

XXβ+λβ=XY,X^\top X \beta + \lambda \beta = X^\top Y,

so

β^=(XX+λI)1XY.\hat{\beta} = (X^\top X + \lambda I)^{-1} X^\top Y.


(b) (Kernel ridge regression) Let K(x,x)=xxK(x, x′) = x^\top x′ and let KRn×nK \in \mathbb{R}^{n\times n} be the kernel matrix with entries Kij=K(xi,xj)K_{ij} = K(x_i, x_j). Consider

minfHK{i=1n(yif(xi))2+λfHK2}.\min_{f \in H_K} \left\{ \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \|f\|_{H_K}^2 \right\}.

By the Representer Theorem, the solution has the form

f^(x)=i=1nαiK(xi,x).\hat{f}(x) = \sum_{i=1}^n \alpha_i K(x_i, x).

Derive the expression for α^.\hat{\alpha}.

Solution (2b)

Using the representer form, write

f(xi)=(Kα)i.f(x_i) = (K\alpha)_i.

Thus the objective becomes

YKα22+λαKα.\|Y - K\alpha\|_2^2 + \lambda \alpha^\top K \alpha.

Taking derivatives:

2K(YKα)+2λKα=0.-2K(Y - K\alpha) + 2\lambda K\alpha = 0.

Rearranging,

(K+λI)α=Y.(K + \lambda I)\alpha = Y.

Hence,

α^=(K+λI)1Y.\hat{\alpha} = (K + \lambda I)^{-1} Y.


(c) Show that the kernel ridge regression predictor can be written as

f^(x)=xβ^,\hat{f}(x) = x^\top \hat{\beta},

where β^\hat{\beta} coincides with the ridge regression solution obtained in part (a).

Hint: Apply the SVD of an n×dn \times d matrix XX to derive

X(XX+λIn)1=(XX+λId)1X.X^\top (XX^\top + \lambda I_n)^{-1} = (X^\top X + \lambda I_d)^{-1} X^\top.

Question 2(c)

Claim: f^(x)=xβ^.\hat{f}(x) = x^\top \hat{\beta}.

Let X=UΣVX = U \Sigma V^\top. Then

XX=UΣVVΣU=UΣ2U,XX^\top = U \Sigma V^\top V \Sigma U^\top = U \Sigma^2 U^\top,

XX=VΣUUΣV=VΣ2V.X^\top X = V \Sigma U^\top U \Sigma V^\top = V \Sigma^2 V^\top.

Now compute

X(XX+λIn)1.X^\top (XX^\top + \lambda I_n)^{-1}.

Using the SVD,

X(XX+λIn)1=VΣU(UΣ2U+λUU)1.X^\top (XX^\top + \lambda I_n)^{-1} = V \Sigma U^\top \left( U \Sigma^2 U^\top + \lambda U U^\top \right)^{-1}.

Since U1=UU^{-1} = U^\top, we factor:

(UΣ2U+λUU)1=(U(Σ2+λI)U)1=U(Σ2+λI)1U.\left( U \Sigma^2 U^\top + \lambda U U^\top \right)^{-1} = \left( U (\Sigma^2 + \lambda I) U^\top \right)^{-1} = U (\Sigma^2 + \lambda I)^{-1} U^\top.

Thus,

X(XX+λIn)1=VΣUU(Σ2+λI)1U=VΣ(Σ2+λI)1U.X^\top (XX^\top + \lambda I_n)^{-1} = V \Sigma U^\top U (\Sigma^2 + \lambda I)^{-1} U^\top = V \Sigma (\Sigma^2 + \lambda I)^{-1} U^\top.

Similarly,

(XX+λIn)1X=(VΣ2V+λVV)1VΣU.(XX^\top + \lambda I_n)^{-1} X^\top = \left( V \Sigma^2 V^\top + \lambda V V^\top \right)^{-1} V \Sigma U^\top.

Factoring,

=(V(Σ2+λI)V)1VΣU=V(Σ2+λI)1ΣU.= \left( V (\Sigma^2 + \lambda I) V^\top \right)^{-1} V \Sigma U^\top = V (\Sigma^2 + \lambda I)^{-1} \Sigma U^\top.

Since Σ\Sigma and (Σ2+λI)1(\Sigma^2 + \lambda I)^{-1} commute (both diagonal),

Σ(Σ2+λI)1=(Σ2+λI)1Σ.\Sigma (\Sigma^2 + \lambda I)^{-1} = (\Sigma^2 + \lambda I)^{-1} \Sigma.

Therefore,

X(XX+λIn)1=(XX+λId)1X.X^\top (XX^\top + \lambda I_n)^{-1} = (X^\top X + \lambda I_d)^{-1} X^\top.


Thus,

f^(x)=k(x)(K+λI)1Y=xX(XX+λI)1Y=x(XX+λI)1XY=xβ^.\hat{f}(x) = k(x)^\top (K + \lambda I)^{-1} Y = x^\top X^\top (XX^\top + \lambda I)^{-1} Y = x^\top (X^\top X + \lambda I)^{-1} X^\top Y = x^\top \hat{\beta}.


Problem 4

(a) Show for D=[0,1]2D = [0,1]^2 and the exponential or Gaussian kernel KK, the constant function f(x)=1f(x) = 1, xD\forall x \in D, is not in the RKHS associated with KK.

Solution (4a)

Assume for contradiction that

1=DK(x,x)dxfor all xD.1 = \int_D K(x, x') dx' \quad \text{for all } x \in D.

Let x0=(0,0)x_0 = (0,0) and x1/2=(1/2,1/2)x_{1/2} = (1/2,1/2).

Since Gaussian/exponential kernels depend on xx\|x - x'\|, points near the center have more nearby mass than points at the boundary.

Thus

DK(x0,x)dxDK(x1/2,x)dx.\int_D K(x_0, x') dx' \neq \int_D K(x_{1/2}, x') dx'.

Therefore the integral cannot be constant on DD, a contradiction.

Hence the constant function is not in the RKHS.


(b) Discuss how to generalize the Representer Theorem to better fit the data (i.e., to achieve smaller SSE). Derive explicit solutions if possible.

Question 4(b)

We generalize by introducing an intercept term bRb \in \mathbb{R} and solve

minfHK,bRi=1n(yif(xi)b)2+λfHK2.\min_{f \in H_K,\, b \in \mathbb{R}} \sum_{i=1}^n (y_i - f(x_i) - b)^2 + \lambda \|f\|_{H_K}^2.

By the Representer Theorem,

f(x)=j=1nαjK(x,xj).f(x) = \sum_{j=1}^n \alpha_j K(x, x_j).

Thus the objective becomes

YKαb122+λαKα.\|Y - K\alpha - b\mathbf{1}\|_2^2 + \lambda \alpha^\top K \alpha.


Gradient with respect to α\alpha

α=2(K)(YKαb1)+2λKα.\nabla_\alpha = 2(-K)(Y - K\alpha - b\mathbf{1}) + 2\lambda K\alpha.

Setting equal to zero,

2K(YKαb1)+2λKα=0.-2K(Y - K\alpha - b\mathbf{1}) + 2\lambda K\alpha = 0.

Rearranging,

KY=K(K+λI)α+bK1.KY = K(K+\lambda I)\alpha + bK\mathbf{1}.


Gradient with respect to bb

b=21(YKαb1).\nabla_b = 2\mathbf{1}^\top (Y - K\alpha - b\mathbf{1}).

Setting equal to zero,

1Y=1Kα+nb.\mathbf{1}^\top Y = \mathbf{1}^\top K\alpha + nb.


Combine the two equations

These conditions can be written as the linear system

[K+λI110][αb]=[Y0].\begin{bmatrix} K + \lambda I & \mathbf{1} \\ \mathbf{1}^\top & 0 \end{bmatrix} \begin{bmatrix} \alpha \\ b \end{bmatrix} = \begin{bmatrix} Y \\ 0 \end{bmatrix}.

Thus,

y^(x)=j=1nα^jK(x,xj)+b^.\hat{y}(x) = \sum_{j=1}^n \hat{\alpha}_j K(x, x_j) + \hat{b}.