However, the post is mainly about Graph Partitioning.
Two mins before I was ready for publish, I added Gibbs Sampling part.
What is Graph Partitioning?
Another important topic from Advanced Machine Learning. I will give a brief summary here.
When you are facing a huge graph, like relations in Facebook, or phone call records, what you will do with a limited memory? Distributed dbs and new algorithms nowadays have the ability to process the huge datasets. But first it is better to separate the dataset, and allocate them to different nodes(or VMs), a kind of “Divide and Conquer” method.
From application level, if you want to talk about machine learning, graph partitioning is also a community finding algorithm. Say, you want to divide all the Facebook users and make group recommendations to different communities.
Many definitions on partitioning, but the basic idea is to put similar vertices into a group, and the groups are supposed to be dissimilar. Which means we want high level of internal connections and low level of external connections. A way to measure similarity is to calculate coherence based on ratio of edges, internal and external. (Lancichinetti, 2009)
It provides you the similarity of a community, higher is better.
Well, it is not easy to separate a graph, here I mean to completely separate it. If you are given two arrays, and I ask for the dot product. we assume that both of them has 100 elements, and 10 cores can be used. Naturally, you can assign each core to calculate a dot product separately for 10 elements of each array, and at last sum up all the dot products together. For each dot product, it has no impact on others. It is safe to do so.
But when it comes to the graph, because of the relations, there are more things that you need to consider.
Cutting the edges means, technically, you are going to lose the edges. Trying to minimising the cut size helps to separate a graph, and at the same time, keeping the internal similarity and external dissimilarity.
Applied to Gibbs Sampling (large-scale models)
Suppose the Gibbs sampling job is to be complete on a large-scale dataset, say the undirected graphs.
Shown in the following images. For the single node mode, here I mean single core, mixed with the parallel graph colouring idea, the time is estimated as to be about sampling time for only two nodes (assume only two groups).
Here is my idea of a partitioning on Gibbs Sampling.
By splitting dataset into smaller ones, it is easier to upload the data into a distributed DB (HDFS, Titan, etc). Each node will do part of sampling, but must be careful with the edge nodes. Conflicts might happen to these vertices.
Unlike the descriptions before, we have to keep the border vertices, since they contains information that the ajacent vertices care about. So, apply grouping method first, then split dataset, do sampling at last.
Enjoy my poor English descriptions. 🙂