We often want to take an inventory of the types of data in our database. What is our schema? This is most useful in DEV or TEST databases whose content is created by several users or teams, is often experimental, or has multiple versions. Even in controlled environments like PROD, where the application validates data on ingest so that it obeys the schema, we may observe data with a different structure than expected. It is important for us to compare the actual schema with the intended schema. Additionally, having a picture of the schema helps onboard new users, and it accelerates development of new applications and dynamic UIs.
In this post, we demonstrate how to discover and visualize the actual graph schema directly from the data that resides within an Amazon Neptune database.
Neptune is a managed graph database service supporting two common graph representations: labeled property graph (LPG) and Resource Description Framework (RDF). In each case, we show how to introspect, or reverse engineer, the schema using queries and database statistics. We then use a diagram-as-code tool called PlantUML to visualize that schema. We demonstrate the solution in the Neptune Workbench, a Jupyter notebook, which runs on Amazon SageMaker.
Solution overview
The following diagram demonstrates the approach. As a data architect, you use the notebook instance to first discover the schema in the Neptune database. Then you visualize that schema using a PlantUML notebook plugin.
Prerequisites
To run this example, you need a Neptune cluster and notebook instance. If you don’t have these already in place, you can create them, as long as you have an AWS account with permission to create them.
Running this example incurs charges. For more details, refer to Amazon Neptune Pricing.
Provision resources
Complete the following steps to create your resources:
Create a Neptune cluster and Neptune Workbench notebook instance. If you have an existing cluster and notebook instance, you can use them instead.
Make a local clone of the GitHub repo containing the discovery tool.
Open the notebook instance Jupyter folder view.
Create a folder called discovery_draw.
Upload all files from the notebook/discovery_draw folder of your clone except README.md to the discovery_draw folder in Jupyter.
Discover and visualize an LPG schema
From Jupyter, open the notebook VisualizeModel-LPG.ipynb. Then follow the steps in that notebook.
Set up PlantUML
Install the PlantUML plugin by running the Python installer. Restart the kernel—using the Kernel, Restart menu item—after installing. Then import iplantuml for later use.
Set up the Neptune connection
Set up a connection to Neptune for discovery. Your notebook instance has environment variables already set for the Neptune endpoint host, port, AWS Region, and authentication mode. Pass these as input to lpg_discovery.set_neptune_env().
This function is defined in lpg_discovery.py. It initializes the Neptune data plane SDK client that you use to introspect the Neptune database.
Add sample data
If you have a new Neptune database set up, or if you want to experiment with new types of data in an existing instance, run the %seed steps to add sample airports, fraud, and knowledge graph data to your database. We visualize the airports dataset below. The others are included for variety and can be skipped. You can skip this step if you don’t require this data.
Get a summary
Begin your discovery by calling the Neptune Summary API. Using engine statistics, it reports node labels, edge labels, and properties in your graph.
Discover the schema
With the summary as input, reverse engineer the schema.
In this step, you invoke the discover() function in lpg_discovery.py. That function makes a series of OpenCypher queries against the Neptune database. Its result is a dictionary of classes, each of which has properties and relationships. A class represents a node label. Properties are node properties. Relationships are edges whose source node is nodes having that label; relationships can have properties too. The logic is similar to the discovery logic in the Amazon Neptune utility for GraphQL schemas and resolvers.
The following is an excerpt from the dictionary highlighting the airports dataset:
The excerpt shows a class called airport with properties (under props) such as country, longest, and code. Each property has types and a multival flag indicating whether its properties can have multiple values. The country property, for example, is a string and is not multivalued.
The airport class also has a relationship (under rels) called route, which connects to airport. Elsewhere in the result, we see the properties of that relationship:
The route relationship has a property dist that is an integer and is not multivalued.
Build a PlantUML representation
The next step is to map the discovery result to a form that can be rendered by PlantUML. PlantUML is a diagram-as-code tool that can visualize diagrams from relatively simple text markup. It supports several types of diagrams, including the UML class diagram. Because we have already organized our discovery as classes, a UML class diagram is a suitable visualization.
Run the cell to generate PlantUML text, which maps the dictionary to PlantUML in the to_plant_uml() function of lpg_discovery.py.
The following is an excerpt of the text, showing airport and its route relationship. We represent the relationship as an association. Edge properties are a shown in a note on the association. Another modeling approach would be to use an association class, but for simplicity we use a note.
Render PlantUML
Render this as a diagram using PlantUML.
The diagram renders directly in the notebook. PlantUML also saves a file called lpg_all.svg in the notebook folder. You can download this for later use.
The following is a snippet of our diagram. We have numerous types of data; the diagram is large.
Show a subset of the schema
To generate a simpler diagram, filter PlantUML down to a specific set of classes. The following visualizes the four classes in the airports dataset: continent, country, airport, version.
PlantUML saves the image as lpg_airport.svg in the current folder.
Discover and visualize the RDF schema
Discovery and visualization of an RDF schema is similar to that of LPG. From Jupyter, open the notebook VisualizeModel-RDF.ipynb. Then follow steps in that notebook.
Set up PlantUML
Setting up PlantUML is same as for LPG. Setting up the connection to Neptune is similar but uses rdf_discovery.py. It uses the Python requests library to connect to Neptune’s SPARQL endpoint to query RDF data in Neptune.
Add sample data
The notebook loads various types of RDF data to test conventional modeling approaches in RDF. For this post, we demonstrate using the airports dataset. To see this data as RDF, run the following cell.
Optionally, add data under the following headings:
Edge properties data – There are several modeling strategies in RDF to model properties of a relationship. We demonstrate singleton, reification, n-ary relation, and named graph approaches. The airports dataset uses named graphs. The cell under Edge properties data adds examples of the other three approaches. See What is RDF-star for a good summary of these approaches plus RDF-star, currently being drafted by a W3C working group.
Lots of reification and singleton instances – More reification and singleton, this time in greater numbers. This is mainly to stress-test discovery to find classes and properties defined more unconventionally than the other dataset.
Lists, multival, mulitype props – Examples of properties and relations with multiple values, possibly of different types. Some are in conventional structures such as lists and bags.
Same class name, different namespace – Adds an instance of a class with the same local name but a different prefix than another.
Ontology – Sample data that follows the W3C organization ontology. Our goal is not to reverse engineer an ontology from data in Neptune. For RDF, as for LPG, we visualize a UML class diagram as a helpful data inventory visualization. As we explain in Model-driven graphs using OWL in Amazon Neptune, ontology and UML have important differences. Still, as we demonstrate in this post, our visualization provides classes, properties, and relationships with ontological detail. To load the organization ontology and sample data, follow the steps in the notebook accompanying this post.
Get a summary
Begin your discovery by calling the Neptune Summary API.
Discover the schema
Run the next cell to query the Neptune database and build a class dictionary.
The cell calls the discover_observerational() function in rdf_discovery.py. That function issues a series of SPARQL queries to find RDF types, properties, and relationships. The following is an excerpt of airport and route:
The following are some important points about this data and how it is derived:
Classes and properties are named using Uniform Resource Indentifiers (URIs) rather than simple names. For example, the airport class is named http://kelvinlawrence.net/air-routes/class/Airport. In RDF, URIs are used to name classes, properties, and individuals.
Airport is included as a class because we found individual airports whose type is airport.
Properties of airport can be data type or object properties. Data type properties include http://kelvinlawrence.net/air-routes/datatypeProperty/runways (integer) and http://kelvinlawrence.net/air-routes/datatypeProperty/icao (string). To arrive at this, the discovery algorithm simply queries the properties that individual airports have, and what their type is.
http://kelvinlawrence.net/air-routes/objectProperty/route is an object property that links an airport to another. The discovery algorithm determines this by querying individual airport resources.
As the excerpt shows, there is metadata associated with a route relationship. In particular, the route has an integer-valued distance property: http://kelvinlawrence.net/air-routes/datatypeProperty/dist. In LPG, route distance is modeled as an edge property. In RDF, it is modeled using named graphs; What is RDF-star covers the mechanics of this approach. The discovery algorithm finds this pattern again by examining individual airport resources.
Discover and merge ontology
The discovery from the previous step built a schema from the bottom up by generalizing from individual resources. In the RDF family, a top-down approach is also possible. An ontology is a semantic model for RDF data that specifies classes and properties. Web Ontology Language (OWL) and RDF Schema (RDFS) are W3C specifications covering classes, properties, and their semantics. Many RDF practitioners use OWL and RDFS. Significantly, OWL and RDFS are expressed as RDF data and can be ingested into Neptune along with individuals. OWL and RDFS are data too!
Our tool does not attempt to reverse engineer OWL or RDFS, nor does it attempt to visualize ontology. RDF data architects can choose from several third-party tools, including Protégé, to design and visualize ontologies. But if you have ingested ontology into your Neptune database, the discovery tool seeks to find it and add ontological information into the PlantUML diagram. In particular, if a class or property previously observed is defined in also ontology, we indicate that in the diagram.
Run the next cell to merge ontological information with the observed schema.
If you ingested ontology into your Neptune database, you will notice some classes and properties returned from discover_and_merge_ontological() in this cell have the isOntology flag set. For example, the discovered Organization class comes from an ontology:
However, the airport class does not. The airport class definition returned is the same as it was previously. There is no ontology behind the airports dataset.
Build and render PlantUML
Run the next cell to create PlantUML text for the merged schema.
There are two calls to rdf_discovery:
load_prefixes() – Because RDF resources are identified by verbose URIs, it is conventional to use short-form prefixes. For example, the URI for an RDFS label is http://www.w3.org/2000/01/rdf-schema#label, but the conventional short form is rdfs:label, where rdfs is a prefix referring to http://www.w3.org/2000/01/rdf-schema#. To keep our PlantUML diagram tidy, we use prefixes rather than full URIs. This function sets up a map of several hundred common prefixes.
to_plant_uml() – This renders PlantUML text.
Run the cell that follows to render the schema.
The diagram renders directly in the notebook. PlantUML also saves a file called rdf_all.svg in the notebook folder. Depending on what’s in your graph, the diagram might be large. We’ll look at one small section: Organization.
Notice the following:
Class and property names are of the form prefix_localName. For example, org_Organization has the prefix org (short form for http://www.w3.org/ns/org#). We write org_Organization rather than the more conventional org:Organization because the colon is already used as a separator in the UML class definition.
The class, as well as properties org_identifier and org_purpose, were found in an ontology loaded in Neptune. We use the <<ontology>> stereotype to indicate this.
Airports schema is also in the diagram. Let’s examine that more closely.
Show a subset
To show only the airport classes, run the next cells.
As we did with the LPG schema, we filter down to airport, continent, country, and version classes. We call add_prefix() to specify URI prefixes airclass, airobj, and airdata. We see these prefixes used in the diagram. The airobj_route relationship between airports is drawn as an association. It has airdata_dist as metadata, the route distance in miles. This is shown as a note on the association. As the note indicates, this metadata is determined using the namedgraph edge property strategy.
Clean up
If you’re done with the solution and want to avoid future charges, delete the Neptune cluster and notebook instance.
Conclusion
In this post, we discussed why it is important to take an inventory of the types of data in your graph database. DEV and TEST databases are likely to have a variety of data; a picture of their contents brings much needed clarity. Even for a PROD database, whose contents are carefully controlled, this inventory may reveal surprises; the actual schema may differ from the intended schema. Consumers of this inventory include data architects and the application team (to view the model they are building against), new team members who are onboarding, and other stakeholders.
We reverse engineered the schema of a Neptune database by combining database statistics available from the Neptune Summary API with results of queries introspecting the structure of nodes, edges, and resources. We then visualized it as a class diagram using the diagram-as-code tool PlantUML.
To further investigate schema discovery and visualization, check out the following resources:
The post Diagram-as-code using generative AI to build a data mode for Amazon Neptune discusses the use of generative artificial intelligence (AI) to create a PlantUML as the basis for a graph database model.
Open source Graph Explorer visualizes data in your graph and presents a schema view of the data
The Amazon Neptune utility for GraphQL schemas and resolvers GitHub repo includes a tool to reverse engineer property graph schemas. It creates a GraphQL API that enforces the schema.
Our RDF discovery tool uses ideas from the post Model-driven graphs using OWL in Amazon Neptune
About the Author
Mike Havey is a Senior Solutions Architect for AWS with over 25 years of experience building enterprise applications. Mike is the author of two books and numerous articles. His Amazon author page is https://www.amazon.com/Michael-Havey/e/B001IO9JBI.
Source: Read More