Find Two sample problem (1) here.
We will take a look at RHKS (Reproducing Hilbert Kernel Space ) in this post. You might think of it a very statistical term but it is amazing because of various applications. You will need to refresh your mind for some linear algebra computations. We start with some basic terms and definitions.
Statistical Distance
It measures the distance between two statistical objects, i.e two random variables, two probability distributions or samples.
It is simple for you to calculate the distance of two vectors, for example, cosine similarity is a common way. But when it comes to the divergence of two probability distributions, things are more complicated, the Kullback-Leibler Divergence(also called Relative Entropy) is a method. We use to denote the amount of information lost where P is the true distribution and Q is used to approximate P,
, note that it is not symmetric. Another impressive term is Maximum Mean Discrepancy (MMD), which is defined according to kernel embedding of distributions.
Kernel Function and Feature Space
A feature mapping is a function that maps data from an original space X to another space, where usually we call it feature space H. Keep in mind that we have
. And x, x’ refer to different samples or dots.
From the above picture, we are mapping data from a 2-D to a 3-D feature space. Given the certain mapping function, if we calculate the inner product of two dots in the feature space, we surprisingly find that it is just the square of the inner product of the two dots (x and x’) from the original space. We could simply define a kernel function and eliminate the mapping function
.
Proposition
Also called Cauchy-Schwarz Inequality for kernels:
Proof: The Gram Matrix/ Kernel Matrix for x and z is positive semi-definite matrix.
So determinant is non-negative:
Distance and angle in feature space
In feature space, we could define distance between two sample:
Inner product is defined as:
So simply calculate:
So we will notice that both distance and angle are in the representation of kernel functions only, without mapping function
Reproducting Kernel Map
A set of functions from feature space to
.
The Reproducing kernel map is a map: , while
. More concretely,
. So “reproducing” simply “replace” a vector to get new functions,which is a process of re-producing new relations. We can treat it as, given a function who has a lot of parameters, we replace some of them but keep the rest. Every time we speficy a group of parameters, we will have a new function.
Here is an example of how it works:
Given the kernel function to be a Gaussian function who has two parameters x and z. In practice, x and z could be vectors, but in this example, we think them to be only a scalar. If we let
, we will have a function
. If we let
, we will have a function
. If you set a group of z values, you will get a set of corresponding functions.
1.https://en.wikipedia.org/wiki/Statistical_distance
2.https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
3.https://en.wikipedia.org/wiki/Kernel_embedding_of_distributions
4.https://www.youtube.com/watch?v=MahJ9iwReAM&index=12&list=PLt0SBi1p7xrRKE2us8doqryRou6eDYEOy