Posted in Algorithm, Theory

Two sample problem(2): kernel function, feature space and reproducing kernel map

Find Two sample problem (1) here.

We will take a look at RHKS (Reproducing Hilbert Kernel Space ) in this post. You might think of it a very statistical term but it is amazing because of various applications. You will need to refresh your mind for some linear algebra computations. We start with some basic terms and definitions.

Statistical Distance

It measures the distance between two statistical objects, i.e two random variables, two probability distributions or samples.
It is simple for you to calculate the distance of two vectors, for example, cosine similarity is a common way. But when it comes to the divergence of two probability distributions, things are more complicated, the Kullback-Leibler Divergence(also called Relative Entropy) is a method. We use D(P||Q) to denote the amount of information lost where P is the true distribution and Q is used to approximate P, D(P||Q) \neq D(Q||P), note that it is not symmetric. Another impressive term is Maximum Mean Discrepancy (MMD), which is defined according to kernel embedding of distributions.

Kernel Function and Feature Space

14285696_865350770267159_1312863364_o

A feature mapping is a function \phi that maps data from an original space X to another space, where usually we call it feature space H. Keep in mind that we have x \rightarrow \phi(x). And x, x’ refer to different samples or dots.
From the above picture, we are mapping data from a 2-D to a 3-D feature space. Given the certain mapping function, if we calculate the inner product of two dots in the feature space, we surprisingly find that it is just the square of the inner product of the two dots (x and x’) from the original space. We could simply define a kernel function \kappa(x,x') = (<x,x'>)^2 and eliminate the mapping function \phi.

Proposition

Also called Cauchy-Schwarz Inequality for kernels:

|\kappa (x,z)|^{ 2 }\leq \kappa (x,x)\kappa (z,z)

Proof: The Gram Matrix/ Kernel Matrix for x and z is positive semi-definite matrix.

\kappa =\begin{bmatrix} \kappa (x,x) & \kappa (z,x) \\ \kappa (z,x) & \kappa (z,z) \end{bmatrix}=\begin{bmatrix} \kappa (x,x) & \kappa (x,z) \\ \kappa (x,z) & \kappa (z,z) \end{bmatrix}\geq 0

So determinant is non-negative:

det(\kappa )=\begin{vmatrix} \kappa (x,x) & \kappa (z,x) \\ \kappa (z,x) & \kappa (z,z) \end{vmatrix} = \kappa(x,x)\kappa(z,z)-\kappa^2(x,z)\geq0

Distance and angle in feature space

In feature space, we could define distance between two sample:

||\phi (x)-\phi (x')||^{ 2 }=(\phi (x)-\phi (x'))^{ T }(\phi (x)-\phi (x'))\\ =<\phi (x),\phi (x)>-2<\phi (x),\phi (x')>+<\phi (x'),\phi (x')>\\ =\kappa (x,x)-2\kappa (x,x')+\kappa (x',x')

Inner product is defined as:

<\phi (x),\phi (x')>=||\phi(x)||\cdot||\phi(x')||\cos\theta

So simply calculate:

\cos \theta =\frac { <\phi (x),\phi (x')> }{ ||\phi (x)||\cdot ||\phi (x')|| }\\ =\frac { <\phi (x),\phi (x')> }{ \sqrt { <\phi (x),\phi (x)> } \sqrt { <\phi (x'),\phi (x')> } }\\ =\frac{\kappa(x,x')}{\sqrt{\kappa(x,x)}\sqrt{\kappa(x',x')}}

So we will notice that both distance and angle are in the representation of kernel functions only, without mapping function \phi

Reproducting Kernel Map

A set of functions from feature space R^d to R.

R^{ R^{ d } }=\{ f:R^d\rightarrow R \}

The Reproducing kernel map is a map: \Phi: R^d \rightarrow R^{R^d}, while \Phi(z) = \phi_z = \kappa(\cdot,z). More concretely, \Phi(z)(x)=\phi_z(x)=\kappa(x,z). So “reproducing” simply “replace” a vector to get new functions,which is a process of re-producing new relations. We can treat it as, given a function who has a lot of parameters, we replace some of them but keep the rest. Every time we speficy a group of parameters, we will have a new function.
Here is an example of how it works:

14315903_866431730159063_2040856592_o

Given the kernel function \kappa(x,z) to be a Gaussian function who has two parameters x and z. In practice, x and z could be vectors, but in this example, we think them to be only a scalar. If we let z=0, we will have a function exp(-\frac{x^2}{2}). If we let z=1, we will have a function exp(-\frac{(x-1)^2}{2}). If you set a group of z values, you will get a set of corresponding functions.


1.https://en.wikipedia.org/wiki/Statistical_distance
2.https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
3.https://en.wikipedia.org/wiki/Kernel_embedding_of_distributions
4.https://www.youtube.com/watch?v=MahJ9iwReAM&index=12&list=PLt0SBi1p7xrRKE2us8doqryRou6eDYEOy

Author:

Keep calm and update blog.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s