Deep Learning 20: graph batching in PyTorch Geometric

PyG or PyTorch Geometric is a deep learning framework for GNNs (graph neural networks). When dealing with graphs, to speed up the computation, we need to do the batch computation, even though the graphs are in different “shapes”.

In PyG, it is possible to pack the data in batches. According to the documentation. “Adjacency matrices are stacked in a diagonal fashion (creating a giant graph that holds multiple isolated subgraphs), and node and target features are simply concatenated in the node dimension.” Shown as the following:

Use our own data

So to run in small batches, we need a list of node features (X) and a list of adjacency matrices (A). In PyG, we first need to pack them as a torch_geometric.data.Data object. Then put a list of Data objects into a torch_geometric.loader.DataLoader.

import torch

from torch_geometric.loader import DataLoader
from torch_geometric.data import Batch
from torch_geometric.data import Data

# make some toy data
x1 = torch.Tensor([[1], [2], [3]])
edge_index1 = torch.tensor([[0, 1, 1, 2], [1, 0, 2, 1]])

x2 = torch.Tensor([[1], [2]])
edge_index2 = torch.tensor([[0, 1, 1], [1, 0, 1]])

# make Data object
data1 = Data(x=x1, edge_index=edge_index1)
data2 = Data(x=x2, edge_index=edge_index2)

# dataloader
my_loader = DataLoader([data1, data2], batch_size=1, shuffle=False)
for batch in my_loader:
    print (batch)


'''outputs:
DataBatch(x=[3, 1], edge_index=[2, 4], y=[1], z=[1], name=[1], face=[3, 1], batch=[3], ptr=[2])
DataBatch(x=[3, 1], edge_index=[2, 4], y=[1], z=[1], name=[1], face=[3, 1], batch=[3], ptr=[2])
'''

In the previous example, we create two graphs, defined as x1,x2 and edge_index1, edge_index2. Note that, the shape of node feature X is (num_nodes, Dim), and the adjacency matrix A shape is (2, num_relations). This A matrix defines relations from the source to the target node ids.

Then to make the Data object, we need to define node features and an adjacency matrix. In the dataloader, we set the batch size to be 1.

Pack in a single batch

If you do not want to use the for-loop, and only need a single batch. Then here is the correct way to pack the data:

# first get a dataloader
my_loader = DataLoader([data1, data2], batch_size=1, shuffle=False)
batch=next(iter(my_loader))

# access the batch
batch.x
batch.edge_index

This is very useful when you wish to plug a graph within other models (i.e., together with a Transformer), and you only need one batch every time.

References

https://www.programcreek.com/python/example/126447/torch_geometric.data.DataLoader
https://pytorch-geometric.readthedocs.io/en/latest/_modules/torch_geometric/loader/dataloader.html#DataLoader
https://github.com/pyg-team/pytorch_geometric/issues/965

Published by Irene

Keep calm and update blog.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: