PyG or PyTorch Geometric is a deep learning framework for GNNs (graph neural networks). When dealing with graphs, to speed up the computation, we need to do the batch computation, even though the graphs are in different “shapes”.
In PyG, it is possible to pack the data in batches. According to the documentation. “Adjacency matrices are stacked in a diagonal fashion (creating a giant graph that holds multiple isolated subgraphs), and node and target features are simply concatenated in the node dimension.” Shown as the following:
Use our own data
So to run in small batches, we need a list of node features (X) and a list of adjacency matrices (A). In PyG, we first need to pack them as a torch_geometric.data.Data
object. Then put a list of Data objects into a torch_geometric.loader.DataLoader
.
import torch
from torch_geometric.loader import DataLoader
from torch_geometric.data import Batch
from torch_geometric.data import Data
# make some toy data
x1 = torch.Tensor([[1], [2], [3]])
edge_index1 = torch.tensor([[0, 1, 1, 2], [1, 0, 2, 1]])
x2 = torch.Tensor([[1], [2]])
edge_index2 = torch.tensor([[0, 1, 1], [1, 0, 1]])
# make Data object
data1 = Data(x=x1, edge_index=edge_index1)
data2 = Data(x=x2, edge_index=edge_index2)
# dataloader
my_loader = DataLoader([data1, data2], batch_size=1, shuffle=False)
for batch in my_loader:
print (batch)
'''outputs:
DataBatch(x=[3, 1], edge_index=[2, 4], y=[1], z=[1], name=[1], face=[3, 1], batch=[3], ptr=[2])
DataBatch(x=[3, 1], edge_index=[2, 4], y=[1], z=[1], name=[1], face=[3, 1], batch=[3], ptr=[2])
'''
In the previous example, we create two graphs, defined as x1,x2 and edge_index1, edge_index2. Note that, the shape of node feature X is (num_nodes, Dim), and the adjacency matrix A shape is (2, num_relations). This A matrix defines relations from the source to the target node ids.
Then to make the Data object, we need to define node features and an adjacency matrix. In the dataloader, we set the batch size to be 1.
Pack in a single batch
If you do not want to use the for-loop, and only need a single batch. Then here is the correct way to pack the data:
# first get a dataloader
my_loader = DataLoader([data1, data2], batch_size=1, shuffle=False)
batch=next(iter(my_loader))
# access the batch
batch.x
batch.edge_index
This is very useful when you wish to plug a graph within other models (i.e., together with a Transformer), and you only need one batch every time.
References
https://www.programcreek.com/python/example/126447/torch_geometric.data.DataLoader
https://pytorch-geometric.readthedocs.io/en/latest/_modules/torch_geometric/loader/dataloader.html#DataLoader
https://github.com/pyg-team/pytorch_geometric/issues/965