# Two sample problem(1): Parzen Windows, Maximum Mean Discrepancy

There is a nice tutorial from Alex. I expanded the math part to show you more details. I used latex then posted screenshots.

#### Parzen Windows

Two sets of samples from two distributions. How do we know the differences? Now we want to test whether q = p. We can either estimate $\hat { p }$ and $\hat { q }$ from the observations or measure the distance and see the difference.

#####  Same, the other two terms can be simplified in the same way: So, finally we will reach: #### Maximum Mean Discrepancy (MMD)

##### Defination $D(p,q,\mathcal{F}) =\sup_{f\in \mathcal{F}} { E }_{ p }^{ }[f(x)]-{ E }_{ q }^{ }[f(y)]$

It is defined as the supremum of the set S, where S contains differences between expectations. Supremum is an element $a$,where for every element m in S, we have always $a\le m$. For example, $\sup\{1,2,3\}$ = 3.

In a Reproducing Kernel Hilbert Space where $\mathcal{H}$ is universal, we have the Theorem that $D(p,q,\mathcal{F}) = 0$ iff $p=q$, when $\mathcal{F}$ is a unit ball in the space.
Not going to prove here (well because I don’t know how to do 😦 ). But for simplicity, let us understand this way: if p=q, that means the two sets of samples come from the exact same distribution, and there is no doubt that the discrepancy is zero. If p $\neq$q, we try some ways to represent these data and map it into some feature spaces (RKHS), then we can always find a mapping function f that would contribute providing the mean discrepancy.

In another word, we can get a squared distance between embeddings in the RKHS, given the two distributions. Then the goal for us is to estimate: $MMD(P,Q) = ||\mu_p-\mu_q||^2_\mathcal{H}$

##### Replacing with $\mu_p$ and $\mu_q$

Now I will explain $\mu_p$ and $\mu_q$.
We have a function $\phi(x)$ who is mapping the data to a feature space F. For example, the quadratic features $\phi(x)= (1,x,x^2)$. An advantage of the kernels is that there is no need to compute $\phi(x)$ explicitly, where we have eq2: $\langle\phi(x),\phi(x')\rangle=k(x,x')$.

In the RKHS with kernel k, the evaluation functions are eq3 $f(x)= \langle k(x,\cdot ),f \rangle$. We also call it reproducing property. By using kernels,if you are familiar with that, we have $\phi(x)$ defined as a kernal function $k(x,\cdot)$. We take it into eq2, then we will finally have eq4: $\langle k(x,\cdot),k(x',\cdot) \rangle=k(x,x')$ $E_p[f(x)]=E_p[\langle k(x,\cdot),f \rangle]= \langle E_p[k(x,\cdot)],f \rangle = \langle \mu_p,f \rangle$ (take eq3 in)
And we mark $E_p[k(x,\cdot)]$ as $\mu_p$.

##### Optimization Problem

Then let’s have a deep breath and start optimizing.

We take $\mu_p$ and $\mu_q$: $\sup_{f\in \mathcal{F}} { E }_{ p }^{ }[f(x)]-{ E }_{ q }^{ }[f(y)] = sup\langle \mu_p-\mu_q,f \rangle = ||\mu_p-\mu_q||_\mathcal{H}$

Add kernels and calculate the squared distance: $MMD(P,Q) = ||\mu_p-\mu_q||^2_\mathcal{H}$ $= \langle \mu_p-\mu_q,\mu_p-\mu_q \rangle$ $=\langle \mu_p,\mu_p -\mu_q \rangle -\langle \mu_q,\mu_p-\mu_q \rangle$ $=\langle \mu_p,\mu_p \rangle-2\langle \mu_p,\mu_q \rangle + \langle \mu_q,\mu_q \rangle$
(Remember $\langle a,b \rangle = \langle b,a \rangle$)

We take the first term (same as the third): $\langle \mu_p,\mu_p \rangle$ $=\langle E_p[k(x,\cdot)], E_p[k(x',\cdot)]\rangle$ $=E_{pp}k(x,x')$
(Use eq4)

The second term: $\langle \mu_p,\mu_q \rangle$ $=\langle E_p[k(x,\cdot)], E_q[k(y,\cdot)]\rangle$ $=E_{pq}k(x,y)$
(Use eq4)

Finally, $||\mu_p-\mu_q||^2_\mathcal{H} = E_{pp}k(x,x')-2E_{pq}k(x,y)+E_{qq}k(y,y')$ as $D(p,q,\mathcal{F})$. The goal is to estimate it.

That means, we do not need to know the real distribution p and q  (usually we do not have that!). We could estimate the MMD from given indipendent i.i.d. data from both two datasets. If we know p and q, we could use Parzen Windows.