Faster distributed graph neural network training with GraphStorm v0.4

GraphStorm is a low-code enterprise graph machine learning (ML) framework that provides ML practitioners a simple way of building, training, and deploying graph ML solutions on industry-scale graph data. Although GraphStorm can run efficiently on single instances for small graphs, it truly shines when scaling to enterprise-level graphs in distributed mode using a cluster of Amazon Elastic Compute Cloud (Amazon EC2) instances or Amazon SageMaker.

Today, AWS AI released GraphStorm v0.4. This release introduces integration with DGL-GraphBolt, a new graph storage and sampling framework that uses a compact graph representation and pipelined sampling to reduce memory requirements and speed up Graph Neural Network (GNN) training and inference. For the large-scale dataset examined in this post, the inference speedup is 3.6 times faster, and per-epoch training speedup is 1.4 times faster, with even larger speedups possible.

To achieve this, GraphStorm v0.4 with DGL-GraphBolt addresses two crucial challenges of graph learning:

Memory constraints – GraphStorm v0.4 provides compact and distributed storage of graph structure and features, which may grow in the multi-TB range. For example, a graph with 1 billion nodes with 512 features per node and 10 billion edges will require more than 4 TB of memory to store, which necessitates distributed computation.
Graph sampling – In multi-layer GNNs, you need to sample neighbors of each node to propagate their representations. This can lead to exponential growth in the number of nodes sampled, potentially visiting the entire graph for a single node’s representation. GraphStorm v0.4 provides efficient, pipelined graph sampling.

In this post, we demonstrate how GraphBolt enhances GraphStorm’s performance in distributed settings. We provide a hands-on example of using GraphStorm with GraphBolt on SageMaker for distributed training. Lastly, we share how to use Amazon SageMaker Pipelines with GraphStorm.

GraphBolt: Pipeline-driven graph sampling

GraphBolt is a new data loading and graph sampling framework developed by the DGL team. It streamlines the operations needed to sample efficiently from a heterogeneous graph and fetch the corresponding features. GraphBolt introduces a new, more compact graph structure representation for heterogeneous graphs, called fused Compressed Sparse Column (fCSC). This can reduce the memory cost of storing a heterogeneous graph by up to 56%, allowing users to fit larger graphs in memory and potentially use smaller, more cost-efficient instances for GNN model training.

GraphStorm v0.4 seamlessly integrates with GraphBolt, allowing users to take advantage of its performance improvements in their GNN workflows. The user just needs to provide the additional argument --use-graphbolt true when launching graph construction and training jobs.

Solution overview

A common model development process is to perform model exploration locally on a subset of your full data, and when you’re satisfied with the results, train the full-scale model. This setup allows for cheaper exploration before training on the full dataset. GraphStorm and SageMaker Pipelines allows you to do that by creating a model pipeline you can run locally to retrieve model metrics, and when you’re ready, run your pipeline on the full data on SageMaker, and produce models, predictions, and graph embeddings to use in downstream tasks. In the next section, we show how to set up such pipelines for GraphStorm.

We demonstrate such a setup in the following diagram, where a user can perform model development and initial training on a single EC2 instance, and when they’re ready to train on their full data, hand off the heavy lifting to SageMaker for distributed training. Using SageMaker Pipelines to train models provides several benefits, like reduced costs, auditability, and lineage tracking.

Prerequisites

To run this example, you will need an AWS account, an Amazon SageMaker Studio domain, and the necessary permissions to run BYOC SageMaker jobs.

Set up the environment for SageMaker distributed training

You will use the example code available in the GraphStorm repository to run through this example.

Setting up your environment should take around 10 minutes. First, set up your Python environment to run the examples:

conda init
eval $SHELL
# Create a new env for the post
conda create --name gsf python=3.10
conda activate gsf

# Install dependencies for local scripts
pip install torch==2.3.0 --index-url https://download.pytorch.org/whl/cpu
pip install sagemaker boto3 ogb pyarrow
# Verify installation, might take a few minutes for first run
python -c "import sagemaker; import torch"

# Clone the GraphStorm repository to access the example code
git clone https://github.com/awslabs/graphstorm.git ~/graphstorm

Build a GraphStorm SageMaker CPU image

Next, build and push the GraphStorm PyTorch Docker image that you will use to run the graph construction, training, and inference jobs for smaller-scale data. Your role will need to be able to pull images from the Amazon ECR Public Gallery and create Amazon Elastic Container Registry (Amazon ECR) repositories and push images to your private ECR registry.

# Enter you account ID here
ACCOUNT_ID=<aws-account-id>
REGION=us-east-1

cd ~/graphstorm
bash docker/build_graphstorm_image.sh --environment sagemaker --device cpu
bash docker/push_graphstorm_image.sh -e sagemaker -r $REGION -a $ACCOUNT_ID -d cpu
# This will create an ECR repository and push an image to
# ${ACCOUNT_ID}.dkr.ecr.us-east-1.amazonaws.com/graphstorm:sagemaker-cpu

Download and prepare datasets

In this post, we use two citation datasets to demonstrate the scalability of GraphStorm. The Open Graph Benchmark (OGB) project hosts a number of graph datasets that can be used to benchmark the performance of graph learning systems. For a small-scale demo, we use the ogbn-arxiv dataset, and for a demonstration of GraphStorm’s large-scale learning capabilities, we use the ogbn-papers100M dataset.

Prepare the ogbn-arxiv dataset

Download the smaller-scale ogbn-arxiv dataset to run a local test before launching larger-scale SageMaker jobs on AWS. This dataset has approximately 170,000 nodes and 1.2 million edges. Use the following code to download the data and prepare it for GraphStorm:

# Provide the S3 bucket to use for output
BUCKET_NAME=<your-s3-bucket>

You use the following script to directly download, transform and upload the data to Amazon Simple Storage Service (Amazon S3):

cd ~/graphstorm/examples/sagemaker-pipelines-graphbolt
python convert_arxiv_to_gconstruct.py 
--output-s3-prefix s3://$BUCKET_NAME/ogb-arxiv-input

This will create the tabular graph data in Amazon S3, which you can verify by running the following code:

aws s3 ls s3://$BUCKET_NAME/ogb-arxiv-input/ 
edges/
nodes/
splits/
gconstruct_config_arxiv.json

Finally, upload GraphStorm training configuration files for arxiv to use for training and inference:

# Upload the training configurations to S3
aws s3 cp ~/graphstorm/training_scripts/gsgnn_np/arxiv_nc.yaml 
s3://$BUCKET_NAME/yaml/arxiv_nc_train.yaml
aws s3 cp ~/graphstorm/inference_scripts/np_infer/arxiv_nc.yaml 
s3://$BUCKET_NAME/yaml/arxiv_nc_inference.yaml

Prepare the ogbn-papers100M dataset on SageMaker

The papers-100M dataset is a large-scale graph dataset, with 111 million nodes and 3.2 billion edges after adding reverse edges.

To download and preprocess the data as an Amazon SageMaker Processing step, use the following code. You can launch and let the job run in the background while proceeding through the rest of the post, and return to this dataset later. The job should take approximately 45 minutes to run.

# Navigate to the example code
cd ~/graphstorm/examples/sagemaker-pipelines-graphbolt

# Build and push a Docker image to download and process the papers100M data
bash build_and_push_papers100M_image.sh -a $ACCOUNT_ID -r $REGION

# This creates an ECR repository and pushes an image to
# $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/papers100m-processor

# Run a SageMaker job to do the processing and upload the output to S3
SAGEMAKER_EXECUTION_ROLE_ARN=<your-sagemaker-execution-role-arn>
aws configure set region $REGION
python sagemaker_convert_papers100m.py 
--output-bucket $BUCKET_NAME 
--execution-role-arn $SAGEMAKER_EXECUTION_ROLE_ARN 
--region $REGION 
--instance-type ml.m5.4xlarge 
--image-uri $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/papers100m-processor

This will produce the processed data in s3://$BUCKET_NAME/ogb-papers100M-input, which can then be used as input to GraphStorm. While this job is running, you can create the GraphStorm pipelines.

Create a SageMaker pipeline

Run the following command to create a SageMaker pipeline:

# Navigate to the example code
cd ~/graphstorm/examples/sagemaker-pipelines-graphbolt

PIPELINE_NAME="ogbn-arxiv-gs-pipeline"

bash deploy_arxiv_pipeline.sh 
--account $ACCOUNT_ID
--bucket-name $BUCKET_NAME --execution-role $SAGEMAKER_EXECUTION_ROLE_ARN 
--pipeline-name $PIPELINE_NAME 
--use-graphbolt false

Inspect the pipeline

Running the preceding code will create a SageMaker pipeline configured to run three SageMaker jobs in sequence:

A GConstruct job that converts the tabular file input to a binary partitioned graph on Amazon S3
A GraphStorm training job that trains a node classification model and saves the model to Amazon S3
A GraphStorm inference job that produces predictions for all nodes in the test set, and creates embeddings for all nodes

To review the pipeline, navigate to SageMaker AI Studio, choose the domain and user profile you used to create the pipeline, then choose Open Studio.

In the navigation pane, choose Pipelines. There should be a pipeline named ogbn-arxiv-gs-pipeline. Choose the pipeline, which will take you to the Executions tab for the pipeline. Choose Graph to view the pipeline steps.

Run the SageMaker pipeline locally for ogbn-arxiv

The ogbn-arxiv dataset is small enough that you can run the pipeline locally. Run the following command to start a local execution of the pipeline:

# Allow the local containers to inherit AWS credentials
export USE_SHORT_LIVED_CREDENTIALS=1
python ~/graphstorm/sagemaker/pipeline/execute_sm_pipeline.py 
--pipeline-name ogbn-arxiv-gs-pipeline 
--region us-east-1 
--local-execution | tee arxiv-local-logs.txt

We save the log output to arxiv-local-logs.txt. You will use that later to analyze the training speed.

Running the pipeline should take approximately 5 minutes. When the pipeline is complete, it will print a message like the following:

Pipeline execution 655b9357-xxx-xxx-xxx-4fc691fcce94 SUCCEEDED

You can inspect the mean epoch and evaluation time using the provided analyze_training_time.py script and the log file you created:

python analyze_training_time.py --log-file arxiv-local-logs.txt

Reading logs from file: arxiv-local-logs.txt

=== Training Epochs Summary ===
Total epochs completed: 10
Average epoch time: 4.70 seconds

=== Evaluation Summary ===
Total evaluations: 11
Average evaluation time: 1.90 seconds

These numbers will vary depending on your instance type; in this case, these are values reported on an m6in.4xlarge instance.

Create a GraphBolt pipeline

Now you have established a baseline for performance, you can create another pipeline that uses the GraphBolt graph representation to compare the performance.

You can use the same pipeline creation script, but change two variables, providing a new pipeline name and setting --use-graphbolt to “true”:

# Deploy a GraphBolt-enabled pipeline
PIPELINE_NAME_GB="ogbn-arxiv-gs-graphbolt-pipeline"
bash deploy_arxiv_pipeline.sh 
--account $ACCOUNT_ID 
--bucket-name $BUCKET_NAME --execution-role $SAGEMAKER_EXECUTION_ROLE_ARN 
--pipeline-name $PIPELINE_NAME_GB 
--use-graphbolt true

# Execute the pipeline locally
python ~/graphstorm/sagemaker/pipeline/execute_sm_pipeline.py 
--pipeline-name $PIPELINE_NAME_GB 
--region us-east-1 
--local-execution | tee arxiv-local-gb-logs.txt

Analyzing the training logs, you can see the per-epoch time has dropped somewhat:

python analyze_training_time.py --log-file arxiv-local-gb-logs.txt

Reading logs from file: arxiv-local-gb-logs.txt

=== Training Epochs Summary ===
Total epochs completed: 10
Average epoch time: 4.21 seconds

=== Evaluation Summary ===
Total evaluations: 11
Average evaluation time: 1.63 seconds

For such a small graph, the performance gains are modest, around 13% per epoch time. With large data, the potential gains are much greater. In the next section, you will create a pipeline and train a model for papers-100M, a citation graph with 111 million nodes and 3.2 billion edges.

Create a SageMaker pipeline for distributed training

After the SageMaker processing job that prepares the papers-100M data has finished processing and the data is stored in Amazon S3, you can set up a pipeline to train a model on that dataset.

Build the GraphStorm GPU image

For this job, you will use large GPU instances, so you will build and push the GPU image this time:

cd ~/graphstorm

bash ./docker/build_graphstorm_image.sh --environment sagemaker --device gpu

bash docker/push_graphstorm_image.sh -e sagemaker -r $REGION -a $ACCOUNT_ID -d gpu

Deploy and run pipelines for papers-100M

Before you deploy your new pipeline, upload the training YAML configuration for papers-100M to Amazon S3:

aws s3 cp 
~/graphstorm/training_scripts/gsgnn_np/papers100M_nc.yaml 
s3://$BUCKET_NAME/yaml/

Now you are ready to deploy your initial pipeline for papers-100M:

# Navigate to the example code 
cd ~/graphstorm/examples/sagemaker-pipelines-graphbolt 
PIPELINE_NAME="ogb-papers100M-pipeline" 
bash deploy_papers100M_pipeline.sh  
    --account $ACCOUNT_ID 
    --bucket-name $BUCKET_NAME --execution-role $SAGEMAKER_EXECUTION_ROLE_ARN 
    --pipeline-name $PIPELINE_NAME  
    --use-graphbolt false

Run the pipeline on SageMaker and let it run in the background:

# Navigate to the example code
cd ~/graphstorm/examples/sagemaker-pipelines-graphbolt

PIPELINE_NAME="ogb-papers100M-pipeline"
bash deploy_papers100M_pipeline.sh 
--account $ACCOUNT_ID 
--bucket-name $BUCKET_NAME --execution-role $SAGEMAKER_EXECUTION_ROLE_ARN 
--pipeline-name $PIPELINE_NAME 
--use-graphbolt false

Your account needs to meet the required quotas for the requested instances. For this post, the defaults are set to four ml.g5.48xlarge for training jobs and one ml.r5.24xlarge instance for a processing job. To adjust your SageMaker service quotas, you can use the Service Quotas console. To run both pipelines in parallel, i.e. without GraphBolt and with GraphBolt, you will need 8 x $TRAIN_GPU_INSTANCE and 2 x $GCONSTRUCT_INSTANCE.

Next, you can deploy and run another pipeline, with GraphBolt enabled:

# Deploy the GraphBolt-enabled pipeline
PIPELINE_NAME_GB="ogb-papers100M-graphbolt-pipeline"
bash deploy_papers100M_pipeline.sh 
--account $ACCOUNT_ID
--bucket-name $BUCKET_NAME --execution-role $SAGEMAKER_EXECUTION_ROLE_ARN 
--pipeline-name $PIPELINE_NAME_GB 
--use-graphbolt true

# Execute the GraphBolt pipeline on SageMaker
python ~/graphstorm/sagemaker/pipeline/execute_sm_pipeline.py 
--pipeline-name $PIPELINE_NAME_GB 
--region us-east-1 
--async-execution

Compare performance for GraphBolt-enabled training

After both pipelines are complete, which should take approximately 4 hours, you can compare the training times for both cases.

On the Pipelines page of the SageMaker console, there should be two new pipelines named ogb-papers100M-pipeline and ogb-papers100M-graphbolt-pipeline. Choose ogb-papers100M-pipeline, which will take you to the Executions tab for the pipeline. Copy the name of the latest successful execution and use that to run the training analysis script:

python analyze_training_time.py 
--pipeline-name $PIPELINE_NAME
--execution-name execution-1734404366941

Your output will look like the following code:

== Training Epochs Summary ===
Total epochs completed: 15
Average epoch time: 73.95 seconds

=== Evaluation Summary ===
Total evaluations: 15
Average evaluation time: 15.07 seconds

Now do the same for the GraphBolt-enabled pipeline:

python analyze_training_time.py 
--pipeline-name $PIPELINE_NAME_GB 
--execution-name execution-1734463209078

You will see the improved per-epoch and evaluation times:

== Training Epochs Summary ===
Total epochs completed: 15
Average epoch time: 54.54 seconds

=== Evaluation Summary ===
Total evaluations: 15
Average evaluation time: 4.13 seconds

Without loss in accuracy, the latest version of GraphStorm achieved a speedup of approximately 1.4 times faster per epoch for training, and a speedup of 3.6 times faster in evaluation time! Depending on the dataset, the speedups can be even greater, as shown by the DGL team’s benchmarking.

Conclusion

This post showcased how GraphStorm 0.4, integrated with DGL-GraphBolt, significantly speeds up large-scale GNN training and inference, by 1.4 and 3.6 times faster, respectively, as measured on the papers-100M dataset. As shown in the DGL benchmarks, even larger speedups are possible depending on the dataset.

We encourage ML practitioners working with large graph data to try GraphStorm. Its low-code interface simplifies building, training, and deploying graph ML solutions on AWS, allowing you to focus on modeling rather than infrastructure.

To get started, visit the GraphStorm documentation and GraphStorm GitHub repository.

About the author

Theodore Vasiloudis is a Senior Applied Scientist at Amazon Web Services, where he works on distributed machine learning systems and algorithms. He led the development of GraphStorm Processing, the distributed graph processing library for GraphStorm and is a core developer for GraphStorm. He received his PhD in Computer Science from the KTH Royal Institute of Technology, Stockholm, in 2019.

Xiang Song is a Senior Applied Scientist at Amazon Web Services, where he develops deep learning frameworks including GraphStorm, DGL, and DGL-KE. He led the development of Amazon Neptune ML, a new capability of Neptune that uses graph neural networks for graphs stored in a Neptune graph database. He is now leading the development of GraphStorm, an open source graph machine learning framework for enterprise use cases. He received his PhD in computer systems and architecture at the Fudan University, Shanghai, in 2014.

Florian Saupe is a Principal Technical Product Manager at AWS AI/ML research supporting science teams like the graph machine learning group, and ML Systems teams working on large scale distributed training, inference, and fault resilience. Before joining AWS, Florian lead technical product management for automated driving at Bosch, was a strategy consultant at McKinsey & Company, and worked as a control systems and robotics scientist—a field in which he holds a PhD.

Source: Read MoreÂ