There is a nice tutorial from Alex. I expanded the math part to show you more details. I used latex then posted screenshots.
Two sets of samples from two distributions. How do we know the differences?
Now we want to test whether q = p. We can either estimate and from the observations or measure the distance and see the difference.
Same, the other two terms can be simplified in the same way:
So, finally we will reach:
Maximum Mean Discrepancy (MMD)
It is defined as the supremum of the set S, where S contains differences between expectations. Supremum is an element $a$,where for every element m in S, we have always . For example, = 3.
In a Reproducing Kernel Hilbert Space where is universal, we have the Theorem that iff , when is a unit ball in the space.
Not going to prove here (well because I don’t know how to do 😦 ). But for simplicity, let us understand this way: if p=q, that means the two sets of samples come from the exact same distribution, and there is no doubt that the discrepancy is zero. If pq, we try some ways to represent these data and map it into some feature spaces (RKHS), then we can always find a mapping function f that would contribute providing the mean discrepancy.
In another word, we can get a squared distance between embeddings in the RKHS, given the two distributions. Then the goal for us is to estimate:
Replacing with and
Now I will explain and .
We have a function who is mapping the data to a feature space F. For example, the quadratic features . An advantage of the kernels is that there is no need to compute explicitly, where we have eq2: .
In the RKHS with kernel k, the evaluation functions are eq3 . We also call it reproducing property. By using kernels,if you are familiar with that, we have defined as a kernal function . We take it into eq2, then we will finally have eq4:
(take eq3 in)
And we mark as .
Then let’s have a deep breath and start optimizing.
We take and :
Add kernels and calculate the squared distance:
We take the first term (same as the third):
The second term:
Finally, as . The goal is to estimate it.
That means, we do not need to know the real distribution p and q (usually we do not have that!). We could estimate the MMD from given indipendent i.i.d. data from both two datasets. If we know p and q, we could use Parzen Windows.