Amazon Neptune Analytics is a high-performance analytics engine designed to extract insights and detect patterns within vast amounts of graph data stored in Amazon Simple Storage Service (Amazon S3) or Amazon Neptune Database. Using built-in algorithms, vector search, and powerful in-memory processing, Neptune Analytics efficiently handles queries on datasets with tens of billions of relationships, delivering results in seconds.
Graphs allow for in-depth analysis of complex relationships and patterns across various data types. They are widely used in social networks to uncover community structures, suggest new connections, and study information flow. In supply chain management, graphs support route optimization and bottleneck analysis, whereas in cybersecurity, they highlight network vulnerabilities and track potential threats. Beyond these areas, graph data plays a crucial role in knowledge management, financial services, digital advertising, and more—helping identify money laundering schemes in financial networks and forecasting potential security risks.
With Neptune Analytics, you gain a fully managed experience that offloads infrastructure tasks, allowing you to focus on building insights and solving complex problems. Neptune Analytics automatically provisions the compute resources necessary to run analytics workloads based on the size of the graph, so you can focus on your queries and workflows. Initial benchmarks show that Neptune Analytics can load data from Amazon S3 up to 80 times faster than other AWS solutions, enhancing speed and efficiency across analytics workflows.
Neptune Analytics now supports a new functionality that allows you to import Parquet (in addition to CSV) data, and export Parquet or CSV data, to and from your Neptune Analytics graph. The new Parquet import capability makes it simple to get Parquet data into Neptune Analytics for graph queries, analysis, and exploration of the results. Many data pipelines already output files into Parquet, and this new functionality makes it straightforward to quickly create a graph from it. Additionally, Parquet and CSV data can be loaded into the graph using the new neptune.read() procedure, which allows you to read Parquet and CSV data from Amazon S3 in any format and subsequently read, insert, or update using that data. Similarly, graph data can be exported from Neptune Analytics as Parquet or CSV files, allowing you to move data from Neptune Analytics to many data lakes and Neptune Database. The export functionality also supports use cases that require exporting graph data into other downstream data and machine learning (ML) platforms for exploring and analyzing the results.
In this two-part series, we show how you can import and export using Parquet and CSV to quickly gather insights from your existing graph data. Part 1 introduces the import and export functionalities, and walks you through how to quickly get started with them. In Part 2, we show how you can use the new data mobility improvements in Neptune Analytics to enhance fraud detection.
Solution overview
Exporting and importing graph data into Neptune Analytics using Parquet or CSV makes it faster to analyze relationships in your Parquet and CSV data and easier to use graph analytics with your data in Neptune Database.
A common scenario is using Neptune Database to serve graph transactional queries, or online transactional processing (OLTP), but periodically certain insights need to be calculated with graph analytical queries, or online analytical processing (OLAP). For example, a social networking use case may be using Neptune Database to serve real-time friend-of-friends recommendations. You can then use Neptune Analytics to generate additional insights such as identification of social communities (using clustering algorithms) or identification of the top influencers within a community (using centrality algorithms). These calculated insights can then be written back into Neptune Database to further enrich OLTP graph queries.
The new import/export feature for Neptune Analytics makes implementing these end-to-end pipelines straightforward and efficient. The general architecture pattern is depicted in the following diagram.
The workflow consists of the following steps:
- Create a Neptune Analytics graph using an import task. Because Neptune Analytics is built on top of a memory-optimized graph database engine, even graphs containing billions of nodes and edges only take minutes to load. You have the option to create an empty Neptune Analytics graph and perform a batch load or fresh import task. Alternatively, you can create a Neptune Analytics graph using an import task. The import task can be sourced from bulk load formatted files in Amazon S3 or from Neptune Database clusters and snapshots.
- Enrich your graph with additional data. You might have additional data that isn’t required for graph OLTP queries, but would be useful for graph OLAP queries. Using the read() procedure, you can read any data in Parquet or CSV format and subsequently use that data to insert or update your Neptune Analytics graph.
- Run your graph algorithms in Neptune Analytics. You can select from several popular, natively implemented, and optimized categories of graph algorithms such as PageRank, Degree, Label Propagation, and more, or you can run your own openCypher queries. You can write insights back into the Neptune Analytics graph through the mutate variation of native algorithms, or through mutation clauses in queries (such as openCypher’s SET or CREATE).
- Use the export feature of Neptune Analytics to export the newly derived insights to Amazon S3. You can define which format to use for the export (Parquet or CSV), and optionally adjust the scope of the export to only export certain data.
- Insights can be consumed directly from Amazon S3 by end-users.
- Alternatively, you can export your graph data to Amazon S3, and re-ingest (using bulk load) into Neptune Database for enrichment of graph transactional data.
After the insights have been calculated and consumed, you can delete the Neptune Analytics graph or keep it online as needed. The preceding architecture pattern is useful for many different use cases, for example fraud detection, where graphs are used to identify and locate specific patterns of behavior that can be indicative of fraud or illegal activities. In Part 2 of this series, we show how you can use this architecture pattern to enhance fraud detection workflows.
Prerequisites
Before you can use the import and export functionalities, you need to create an AWS Identity and Access Management (IAM) role and corresponding policy to allow the import and export job permissions to read and write to Amazon S3. We use the AWS managed AWS Key Management Service (AWS KMS) key for encryption at rest in Amazon S3. If you prefer to use your own customer managed key, you will need to create your own key then associate it with your S3 bucket. Then, modify the IAM policy to grant key permissions to the IAM role you create. For more details, refer to Create and configure IAM role and AWS KMS key.
Create an inline IAM policy
Create an inline IAM policy with the following template:
Replace the placeholder values for your S3 bucket name, as well as the Amazon Resource Name (ARN) of the AWS managed key for Amazon S3. You can find the value of the key on the AWS KMS console on the AWS managed keys page under aws/s3
.
Create an IAM role
Create an IAM role with the following custom trust policy:
Associate the IAM policy you created with this IAM role.
Import data into Neptune Analytics
As mentioned earlier in this post, there are multiple ways to create a Neptune Analytics graph with data.
For our example, we create a Neptune Analytics graph using an import task sourced from an S3 bucket. This method of loading now supports Parquet-formatted data in addition to CSV and RDF data (ntriples). Whichever format you use, make sure that it follows the appropriate header formats, as described in Data formats.
For our example, we use the air routes dataset in Parquet format.
- Using the AWS Command Line Interface (AWS CLI), copy the provided data into your own S3 bucket:
- Create a Neptune Analytics graph using an import task. You can use the following AWS CLI command:
Because the graph data we’re working with is small, we can deploy a graph with a capacity of 16 m-NCU. Additionally, because this isn’t a production workload, we deploy the graph with zero standbys for cost optimization purposes. We also enable public connectivity on the graph for ease of experimentation. Be sure to replace the placeholder values for with your S3 bucket and the ARN of the IAM role created previously.
- While the graph is being created, deploy a Neptune notebook, which will be used to run queries. Refer to Creating a new Neptune Analytics notebook using an AWS CloudFormation template for more details.
Add data with neptune.read()
When the graph is available, open your Neptune notebook. You can collect information about the schema of your graph using the neptune.graph.pg_schema() function. Run the following query in a new cell in the Neptune notebook to collect information about the different nodes, edges, and properties in the graph:
From the output, you can observe that there is currently no data on continents—only on airports and countries. Let’s enrich our graph with additional information on continents. Let’s say we want to create nodes for each continent, then create an edge between each country and the continent that it’s in. Complete the following steps:
- Copy the CSV file containing the continent information to your S3 bucket:
- Run the following query, which reads the file from Amazon S3 using read(), then creates continent nodes and corresponding edges between the continent and the countries it contains:
The neptune.read()
function will read in the given file, and output each row as an intermediary result within the query (such as a YIELD
row). Each result will then be fed into subsequent openCypher clauses where you can perform additional validations, inserts, and upserts.
You can also use the neptune.read()
function to read Parquet files in addition to CSV files. Although the input files don’t need to adhere to the Neptune bulk load file format, they still need to have properly formatted column headers. Refer to neptune.read() for additional details and examples.
Export data from Neptune Analytics
Now let’s export the data. Let’s say that we want to export the graph in a format that can be easily re-consumed into Neptune Database. We want to export in a CSV format, because the files will automatically adhere to the bulk load format required by Neptune Database bulk loads. You can initiate an export of the entire graph with the following AWS CLI command:
To check on the status of the export task, you can use the get-export-task API:
You will see output similar to the following:
In addition to exporting the data in CSV format, you can also export the data as Parquet. Parquet is a compressed columnar format that consumes less storage space and is more efficient and performant for analytics use cases in Amazon Redshift or Amazon Athena. To export the data as Parquet, use the same start-export-task
API, and specify the format
parameter to be PARQUET
and the parquet-type
parameter to be COLUMNAR
:
You can also import Parquet data into Neptune Analytics using the start-import-task
API, specifying the format
parameter to be PARQUET
. Make sure the Parquet data follows the appropriate header formats, as described in Using Parquet data.
Clean up
To avoid incurring future costs, make sure to delete the resources that are no longer needed. The resources to remove include:
- Neptune Analytics graph
- Neptune notebook
- S3 bucket for import/export
Conclusion
In this post, we showed you how to get started with the new Parquet import, query read with neptune.read(), and use the Parquet and CSV export capabilities. In Part 2 of this series, we show how you can use these new data mobility improvements in Neptune Analytics to enhance fraud detection.
Neptune Database is optimized for high-throughput graph transactional queries. In many cases, the majority of your workload is made up of graph OLTP queries. But perhaps you want to be able to calculate insights using graph algorithms, or extract all or part of your graph data to be used in downstream data and ML pipelines, such as training a graph neural network model. In these cases, combining the ease of ingesting data into Neptune Analytics with the power of its built-in graph algorithms and export functionality means you can provide the necessary data to these additional use cases quickly and efficiently.
Get started with Neptune Analytics today by launching a graph and deploying a Neptune notebook. Sample queries and sample data can be found within the notebook or the accompanying GitHub repo.
About the Authors
Melissa Kwok is a Senior Neptune Specialist Solutions Architect at AWS, where she helps customers of all sizes and verticals build cloud solutions according to best practices. When she’s not at her desk you can find her in the kitchen experimenting with new recipes or reading a cookbook.
Ozan Eken is a Product Manager at AWS, passionate about building cutting-edge Generative AI and Graph Analytics products. With a focus on simplifying complex data challenges, Ozan helps customers unlock deeper insights and accelerate innovation. Outside of work, he enjoys trying new foods, exploring different countries, and watching soccer.
Source: Read More