Chapter 19. Introduction to Neo4j

19.1. What is a graph database?

A graph database is a storage engine that is specialized in storing and retrieving vast networks of data. It efficiently stores nodes and relationships and allows high performance traversal of those structures. Properties can be added to nodes and relationships.

Graph databases are well suited for storing most kinds of domain models. In almost all domain models, there are certain things connected to other things. In most other modeling approaches, the relationships between things are reduced to a single link without identity and attributes. Graph databases allow one to keep the rich relationships that originate from the domain, equally well-represented in the database without resorting to also modeling the relationships as "things". There is very little "impedance mismatch" when putting real-life domains into a graph database.

19.2. About Neo4j

Neo4j is a graph database. It is a fully transactional database (ACID) that stores data structured as graphs. A graph consists of nodes, connected by relationships. Inspired by the structure of the human brain, it allows for high query performance on complex data, while remaining intuitive and simple for the developer.

Neo4j has been in commercial development for 10 years and in production for over 7 years. Most importantly it has a helpful and contributing community surrounding it, but it also:

  • has an intuitive graph-oriented model for data representation. Instead of tables, rows, and columns, you work with a graph consisting of nodes, relationships, and properties.
  • has a disk-based, native storage manager optimized for storing graph structures with maximum performance and scalability.
  • is scalable. Neo4j can handle graphs with many billions of nodes/relationships/properties on a single machine, but can also be scaled out across multiple machines for high availability.
  • has a powerful traversal framework for traversing in the node space.
  • can be deployed as a standalone server or an embedded database with a very small distribution footprint (~700k jar).
  • has a Java API.

In addition, Neo4j has ACID transactions, durable persistence, concurrency control, transaction recovery, high availability, and more. Neo4j is released under a dual free software/commercial license model.

19.3. GraphDatabaseService

The interface org.neo4j.graphdb.GraphDatabaseService provides access to the storage engine. Its features include creating and retrieving nodes and relationships, managing indexes (via the IndexManager), database life cycle callbacks, transaction management, and more.

The EmbeddedGraphDatabase is an implementation of GraphDatabaseService that is used to embed Neo4j in a Java application. This implementation is used so as to provide the highest and tightest integration with the database. Besides the embedded mode, the Neo4j server provides access to the graph database via an HTTP-based REST API.

19.4. Creating nodes and relationships

Using the API of GraphDatabaseService, it is easy to create nodes and relate them to each other. Relationships are typed. Both nodes and relationships can have properties. Property values can be primitive Java types and Strings, or arrays of Java primitives or Strings. Node creation and modification has to happen within a transaction, while reading from the graph store can be done with or without a transaction.

Example 19.1. Neo4j usage

GraphDatabaseService graphDb = new EmbeddedGraphDatabase( "helloworld" );
Transaction tx = graphDb.beginTx();
try {
	Node firstNode = graphDb.createNode();
	Node secondNode = graphDb.createNode();
	firstNode.setProperty( "message", "Hello, " );
	secondNode.setProperty( "message", "world!" );

	Relationship relationship = firstNode.createRelationshipTo( secondNode,
		DynamicRelationshipType.of("KNOWS") );
	relationship.setProperty( "message", "brave Neo4j " );
	tx.success();
} finally {
	tx.finish();
}

19.5. Graph traversal

Getting a single node or relationship and examining it is not the main use case of a graph database. Fast graph traversal and application of graph algorithms are. Neo4j provides a DSL for defining TraversalDescriptions that can then be applied to a start node and will produce a lazy java.lang.Iterable result of nodes and/or relationships.

Example 19.2. Traversal usage

TraversalDescription traversalDescription = Traversal.description()
        .depthFirst()
        .relationships(KNOWS)
        .relationships(LIKES, Direction.INCOMING)
        .evaluator(Evaluators.toDepth(5));
for (Path position : traversalDescription.traverse(myStartNode)) {
    System.out.println("Path from start node to current position is " + position);
}

19.6. Indexing

The best way for retrieving start nodes for traversals is by using Neo4j's integrated index facilities. The GraphDatabaseService provides access to the IndexManager which in turn provides named indexes for nodes and relationships. Both can be indexed with property names and values. Retrieval is done with query methods on indexes, returning an IndexHits iterator.

Spring Data Neo4j provides automatic indexing via the @Indexed annotation, eliminating the need for manual index management.

Note

Modifying Neo4j indexes also requires transactions.

Example 19.3. Index usage

IndexManager indexManager = graphDb.index();
Index<Node> nodeIndex = indexManager.forNodes("a-node-index");
Node node = ...;
Transaction tx = graphDb.beginTx();
try {
    nodeIndex.add(node, "property","value");
    tx.success();
} finally {
    tx.finish();
}
for (Node foundNode : nodeIndex.get("property","value")) {
    // found node
}

19.7. Querying with Cypher

With version 1.4.M04 Neo4j introduced a textual query language called "Cypher" which draws from many sources. From graph matching like in SPARQL, some keywords and query structure that reminds of SQL and some iconic representation. A screencast presenting cypher queries on the cineasts.net dataset is available at video.neo4j.org. Cypher was written in Scala to leverage the high expressiveness for lazy sequence operations of the language and the great parser combinator library.

Cypher queries always begin with a start set of nodes. Those can be either expressed by their id's or by a index lookup expression. Those start-nodes are then related to other nodes in the match clause to other nodes. Start and match clause can introduce new identifiers for nodes and relationships. In the where clause additional filtering of the result set is applied by evaluating boolean expressions. The return clause defines which part of the query result will be available. Aggregation also happens in the return clause by using aggregation functions on some of the values. Sorting can happen in the order by clause and the skip and limit parts restrict the result set to a certain window.

Cypher can be executed on an embedded graph db using ExecutionEngine and CypherParser. This is encapsulated in Spring Data Neo4j with CypherQueryEngine. The Neo4j-REST-Server comes with a Cypher-Plugin that is accessible remotely and is available in the Spring Data Neo4j REST-Binding.

Example 19.4. Cypher Examples on the Cineasts.net Dataset

// Actors of Forrest Gump:
start movie=(Movie,id,'13') match (movie)<-[:ACTS_IN]-(actor)
    return actor.name, actor.birthplace?

// User-Ratings:
start user=(User,login,'micha') match (user)-[r,:RATED]->(movie) where r.stars > 3
    return movie.title, r.stars, r.comment

// Mutual Friend recommendations:
start user=(User,login,'micha') match (user)-[:FRIEND]-(friend)-[r,:RATED]->(movie) where r.stars > 3
    return friend.name, movie.title, r.stars, r.comment?

// Movie suggestions based on a movie:
start movie=(Movie,id,'13') match (movie)<-[:ACTS_IN]-()-[:ACTS_IN]->(suggestion)
    return suggestion.title, count(*) order by count(*) desc limit 5

// Co-Actors, sorted by count and name of Lucy Liu
start lucy=(1000) match (lucy)-[:ACTS_IN]->(movie)<-[:ACTS_IN]-(co_actor)
    return count(*), co_actor.name order by count(*) desc,co_actor.name limit 20

// recommendations including counts, grouping and sorting
start user=(User,login,'micha') match (user)-[:FRIEND]-(friend)-[r,:RATED]->(movie)
    return movie.title, AVG(r.stars), count(*) order by AVG(r.stars) desc, count(*) desc

19.8. Gremlin a Graph Traversal DSL

Gremlin is an expressive Groovy DSL developed by Marko Rodriguez as part of the tinkerpop stack. It builds on top of a pipe implementation (Blueprints Pipes) that uses connected operations to traverse a graph. Gremlin has a concise syntax but is turing complete.

Gremlin can be executed by including the tinkerpop and blueprints dependencies and then requesting a ScriptEngine of type "gremlin" from the javax.Script* facilities. In Spring Data Neo4j this is encapsulated in GremlinQueryEngine. The Neo4j-REST-Server also comes with a Gremlin-Plugin that is accessible remotely and is available in the Spring Data Neo4j REST-Binding.

Example 19.5. Sample Gremlin Queries

// Vertex with id 1
v = g.v(1)

// determine the name of the vertices that vertex 1 knows and that are older than 30 years of age
v.outE{it.label=='knows'}.inV{it.age > 30}.name

// calculate basic collaborative filtering for vertex 1
m = [:]
g.v(1).out('likes').in('likes').out('likes').groupCount(m)
m.sort{a,b -> a.value <=> b.value}