Tinkerpop3 GraphComputer: VertexPrograms

GraphComputer
TP3 provides OLTP and OLAP means of interacting with a graph.
OLTP-based graph system provides query in real-time, with a limited set of data and respond on the order of milliseconds or seconds. (Only a part). The graph is walked by moving from vertex to another, via incident edges.

OLAP-based graph system, the entire graph is processed, every vertex and edge is analysed. The results are not returned in real-time, if with large cluster of machines, results can take hours. A VertexProgram provides in every vertex. Sending messages with the topological structure. All messages are passed independently, in parallel. Once a vertex program is finished computing, OLAP engine supports any number MapReduce jobs over the resultant graph.

VertexProgram
GraphComputer takes a VertexGraph. So if we focus on each vertex, in logically parallel manner until some terminal condition.
A submitted VertexProgram is copied to all vertices. (Since all vetices are executing the same program.) Then GraphComputer orchestrates VertexProgram.execute( ) on all vertices in an BSP(bulk synchronous parallel) fashion. Communicating by messages.
In Gremlin OLAP: LocalMessage and GlobalMessage.

MapReduce
To aggregate resultant properties into a single result set.Answer some statistical questions (like vertex numbers, degree distribution in each cluster).
Following is the interface provided by API.

public interface MapReduce<MK, MV, RK, RV, R> {
  public void map(final Vertex vertex, final MapEmitter<MK, MV> emitter);
  public void reduce(final MK key, final Iterator<MV> values, final ReduceEmitter<RK, RV> emitter);
  // there are more methods
}

PageRankVertexProgram
A popular OLAP-oriented graph algorithm.
It converges to a steady state distribution, with iteratively running.

More examples please find here.

Conclusion
For large graph data sets, it is more important to focus on the level of vertices, to think from here, instead of take the whole graph into consideration in a time. In the real world, it might be very complecated even to calculate a single node.
So when the entire graph is computed in a sequential order, well, I cannot image how long it will be. We want to do parallel computing on top of it. If we can choose a subset each time, to make sure that they are totally independent (conditional), then they can be computed simultaniously, with each ot them run the code of VertexProgram.
The difficulty will be the synchronization without latencies.


About BFS, DFS in BluePrints, find here, thanks to contributors of open source:
https://github.com/Yufan-l/database-benchmark/tree/master/GraphdatabaseBenchmarks/src/benchmarks

References:
http://www.tinkerpop.com/docs/3.0.0.M1/#graphcomputer

Published by Irene

Keep calm and update blog.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: