Continuing the introduction of the Neo4j Graph data science library graph catalog feature from the last blog post, where we took a deep dive in Native projections, we will take a look at Cypher projection in this blog post. It is the more expressive variant of the graph projection and allows us to project virtual graphs as well.

## Graph import

We will be using some procedures from the APOC library in our import statement. I have gotten so used to always having APOC around, that I rarely even mention it individually. It contains so many awesome procedures, it is hard to imagine life without it.

### Import nodes

LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/morethanbooks/projects/master/LotR/ontologies/ontology.csv" as row FIELDTERMINATOR "\t" WITH row, CASE row.type WHEN 'per' THEN 'Person' WHEN 'gro' THEN 'Group' WHEN 'thin' THEN 'Thing' WHEN 'pla' THEN 'Place' END as label CALL apoc.create.nodes(['Node',label], [apoc.map.clean(row,['type','subtype'],[null,""])]) YIELD node WITH node, row.subtype as class MERGE (c:Class{id:class}) MERGE (node)-[:PART_OF]->(c);

### Import relationships

UNWIND ['1','2','3'] as book LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/morethanbooks/projects/master/LotR/tables/networks-id-volume" + book + ".csv" AS row MATCH (source:Node{id:coalesce(row.IdSource,row.Source)}) MATCH (target:Node{id:coalesce(row.IdTarget,row.Target)}) CALL apoc.create.relationship(source, "INTERACTS_" + book, {weight:toInteger(row.Weight)}, target) YIELD rel RETURN distinct true;

## Cypher projection

The general syntax to use cypher projection is:

```
CALL gds.graph.create.cypher(
graphName: String,
nodeQuery: String,
relationshipQuery: String,
configuration: Map
)
```

Node query is a cypher statement used to describe the nodes we want to project. It must return the internal ids of the nodes and optionally any of their properties. Relationship query, on the other hand, describes the relationships we want to project. The cypher statement should return internal ids of source and target nodes of relationships, and optionally their type and any of their properties.

### Whole graph

We will begin with a simple scenario and project the entire graph in memory. Adding the column **type** in the relationship query allows the data science engine to distinguish between relationship types, which in turn gives us an option to filter relationships when executing algorithms.

CALL gds.graph.create.cypher( 'whole-graph', // nodeQuery 'MATCH (n) RETURN id(n) AS id', // relationshipQuery 'MATCH (n)-[r]->(m) RETURN id(n) AS source, id(m) AS target, type(r) as type'

As in the previous blog post, we will start with the weakly connected components algorithm. It is used to examine how many islands or disconnected components are there in our graph, which will help us better understand results from other graph algorithms. Also, sometimes we might want to run other graph algorithms only on the largest connected component.

CALL gds.wcc.stream('whole-graph') YIELD nodeId, componentId RETURN componentId, count(*) as size ORDER BY size DESC limit 10

Results

componentId | size |
---|---|

0 | 86 |

As there is only one connected component in our graph, we don’t have to worry about skewed results from other graph algorithms.

#### Drop the projected graph from the catalog

After each analysis, we will release the projected graph from memory.

CALL gds.graph.drop('whole-graph');

### Undirected weighted relationships graph

Next, we are going to project an undirected weighted graph. Let’s take a look at how does the native projection handle undirected relationships:

`UNDIRECTED`

: each relationship is projected in both natural and reverse orientation

To produce an undirected relationship with the cypher projection, we project a single connection in both directions, effectively allowing the graph data science engine to traverse the relationship in both directions. Let’s take a look at the following example to gain a better understanding. We create two nodes with a single connection between them.

CREATE (:Test)-[:REL]->(:Test);

To project the relationship in both directions, we only have to omit the direction of the relationship in our **MATCH** statement, and that’s it. A tiny but crucial detail!

MATCH (n:Test)-[r]-(m:Test) RETURN id(n) as source, id(m) as target;

Results

source | target |
---|---|

1565 | 1566 |

1566 | 1565 |

We will also demonstrate a favorable way to project multiple node labels using an **UNION** statement in the node query.

CALL gds.graph.create.cypher( 'undirected_interactions', // nodeQuery 'MATCH (n:Person) RETURN id(n) AS id UNION MATCH (n:Thing) RETURN id(n) as id', // relationshipQuery (notice no direction on relationship) 'MATCH (n)-[r:INTERACTS_1|:INTERACTS_2|:INTERACTS_3]-(m) RETURN id(n) AS source, id(m) AS target, type(r) as type, r.weight as weight')

#### Random walk algorithm

To stray away from the common graph algorithms like PageRank or Louvain, let’s use the Random Walk algorithm. In essence, it mimics how a drunk person would traverse our graph. It is commonly used in the node2vec algorithms. We define Frodo as the start node and then walk five random steps twice.

MATCH (n:Node{Label:'Frodo'}) CALL gds.alpha.randomWalk.stream('undirected_interactions', {start:id(n), steps:5, walks:2}) YIELD nodeIds RETURN [nodeId in nodeIds | gds.util.asNode(nodeId).Label] as result

Results

result |
---|

[“Frodo”, “Galadriel”, “Treebeard”, “Legolas”, “Shadowfax”, “Faramir”] |

[“Frodo”, “Faramir”, “Sam”, “Isildur”, “Elendil”, “Sauron”] |

You will get different results as it is a random walk algorithm after all.

#### Triangle count and clustering coefficient

Another useful algorithm for analyzing social networks is Triangle Count and Clustering Coefficient algorithm. A triangle composes of three nodes, where each node has a relationship to the other two. The clustering coefficient is a measure of the degree to which nodes in a graph tend to cluster together. This allows us to estimate how tightly-knit nodes in our graph are.

CALL gds.alpha.triangleCount.write('undirected_interactions', {relationshipTypes:['INTERACTS_1'], writeProperty:'triangles', clusteringCoefficientProperty:'clustering'}) YIELD nodeCount, triangleCount, averageClusteringCoefficient

Results

nodeCount | triangleCount | averageClusteringCoefficient |
---|---|---|

44 | 803 | 0.5420570114601906 |

The global or average clustering coefficient is 0.54, which means that the persons in our graph are quite tightly-knit. We can also look at individuals and their local clustering coefficients.

MATCH (p:Person) RETURN p.Label as person, p.clustering as coefficient, p.triangles as triangles ORDER BY coefficient DESC LIMIT 5

Results

person | coefficient | triangles |
---|---|---|

“Thráin” | 1.0 | 6 |

“Treebeard” | 1.0 | 1 |

“Goldberry” | 1.0 | 3 |

“Gildor” | 0.9333333333333333 | 14 |

“Bill” | 0.8571428571428571 | 24 |

Let’s visualize Thráin and his connections to observe how they form a tight community.

All of Thrain’s contacts have also interacted with each other, hence the 1.0 value of the clustering coefficient.

#### Drop the projected graph from the catalog

CALL gds.graph.drop('undirected_interactions');

## Categorical PageRank

Up until this point, all the above graph analysis could be done with the native projection. Cypher projection can project graphs that exist only at query time or when we want to use more advanced filtering than just node labels or relationship types.

Categorical PageRank is a concept first introduced by Kenny Bastani in his blog post. I have also written a blog post about it using the Graph algorithms library. Now it is time to demonstrate it with the Graph Data Science library as well.

The idea behind it is pretty simple. As shown in the example above, we have a graph of pages that have links between each other and might also belong to one or more categories. To better understand the global pagerank score of nodes in a network, we can breakdown our graph into several subgraphs, one for each category and execute the pagerank algorithm on each of that subgraphs. We store results as a relationship property between category and pages. This way, we can break down which are the contributing categories to page’s global pagerank score.

We will start by assuming that each interaction is a positive endorsement(I know it’s actually not, but let’s pretend). We will breakdown our graph into several subgraphs by the class(men, elves) that the characters belong to. For example, when calculating the pagerank for the category of men, all nodes will be considered. Still, only relationships that come from characters that belong to the class of men will be considered.

CALL gds.graph.create.cypher( 'categorical_men', // nodeQuery 'MATCH (n:Person) RETURN id(n) AS id UNION MATCH (n:Thing) RETURN id(n) as id', // relationshipQuery 'MATCH (c:Class)<-[:PART_OF]-(n)-[r:INTERACTS_1|:INTERACTS_2|:INTERACTS_3]-(m) // Use the parameter WHERE c.id = $class RETURN id(n) AS source, id(m) AS target, r.weight as weight,type(r) as type', {parameters: { class: 'men' }})

Let us now run the weighted pageRank on this graph.

CALL gds.pageRank.stream('categorical_men', {relationshipWeightProperty:'weight'}) YIELD nodeId, score RETURN gds.util.asNode(nodeId).Label as name, score ORDER BY score DESC LIMIT 5

Results

name | score |
---|---|

“Gandalf” | 0.49039578306834236 |

“Aragorn” | 0.4722053224572498 |

“Frodo” | 0.3216374519151103 |

“Pippin” | 0.31153302141635497 |

“Théoden” | 0.30782923403613094 |

Gandalf comes out on top, with Aragorn following very closely in the second place. I am wondering if Aragorn has the most support from men in the third book, as he becomes their king at the end.

CALL gds.pageRank.stream('categorical_men', {relationshipTypes:['INTERACTS_3'], relationshipWeightProperty:'weight'}) YIELD nodeId, score RETURN gds.util.asNode(nodeId).Label as name, score ORDER BY score DESC LIMIT 5

Results

name | score |
---|---|

“Aragorn” | 0.5174521665039286 |

“Gandalf” | 0.462588360118827 |

“Pippin” | 0.42312021984967696 |

“Faramir” | 0.39774636760597715 |

“Éomer” | 0.3670254192234562 |

As predicted, Aragorn takes the lead. Frodo is no longer on the list as he is quite isolated from everybody in the third book and walks alone with Sam to Mount Doom. To be honest, if you have Sam by your side, you are never lonely though.

#### Drop the projected graph from the catalog

CALL gds.graph.drop('categorical');

### Virtual categorical graph

We have only looked at the class of men and calculated the categorical pagerank for that specific subgraph. It would be very time consuming if we projected a subgraph for each class separately. That is why we will project a subgraph for each class in a single named graph using virtual relationship types. In the relationship query, we have the option to return whatever we feel like as the column **type**. We will return the original relationship type combined with the class of the person to create virtual relationship types. This will allow us to calculate the categorical pagerank for each class.

CALL gds.graph.create.cypher( 'categorical_virtual', 'MATCH (n:Person) RETURN id(n) AS id UNION MATCH (n:Thing) RETURN id(n) as id', 'MATCH (c:Class)<-[:PART_OF]-(n)-[r:INTERACTS_1|:INTERACTS_2|:INTERACTS_3]-(m) RETURN id(n) AS source, id(m) AS target, type(r) + c.id as type, r.weight as weight')

We can now calculate the categorical pagerank for the class of elves.

CALL gds.pageRank.stream('categorical_virtual', {relationshipTypes:['INTERACTS_1elves','INTERACTS_2elves','INTERACTS_3elves'], relationshipWeightProperty:'weight'}) YIELD nodeId, score RETURN gds.util.asNode(nodeId).Label as name, score ORDER BY score DESC LIMIT 5

Results

name | score |
---|---|

“Frodo” | 0.35386480254496355 |

“Aragorn” | 0.31640726328765784 |

“Gandalf” | 0.2517195241898796 |

“Elrond” | 0.2502563451744132 |

“Galadriel” | 0.24039259304369945 |

If we want to calculate categorical pagerank for each class and store the results, we can use the same approach as in the original blog post, where we saved the results in the relationship between a class and a person.

MATCH (c:Class) CALL gds.pageRank.stream('categorical_virtual', {relationshipTypes:['INTERACTS_1'+c.id,'INTERACTS_2'+c.id,'INTERACTS_3'+c.id], relationshipWeightProperty:'weight'}) YIELD nodeId, score WITH c, gds.util.asNode(nodeId) as node, score WHERE score > 0.151 CREATE (c)-[:PAGERANK{score:score}]->(node)

We can now get the top three members for each class based on their categorical pagerank score.

MATCH (c:Class)-[s:PAGERANK]->(p:Person) WITH c, p, s.score as pagerank ORDER BY pagerank DESC RETURN c.id as class, collect(p.Label)[..3] as top_3_members

Result

class | top_3_members |
---|---|

“hobbit” | [“Frodo”, “Sam”, “Gandalf”] |

“men” | [“Gandalf”, “Aragorn”, “Frodo”] |

“elves” | [“Frodo”, “Aragorn”, “Gandalf”] |

“dwarf” | [“Gandalf”, “Gimli”, “Legolas”] |

“ainur” | [“Frodo”, “Bombadil”, “Gandalf”] |

“animal” | [“Sam”, “Gandalf”, “Frodo”] |

“orcs” | [“Sam”, “Frodo”, “Shelob”] |

“thing” | [“Frodo”, “Gandalf”, “Sam”] |

“ents” | [“Gandalf”, “Merry”, “Pippin”] |

Frodo leads with four first places, Gandalf has three first places, and Sam has two. Most of the results are self-explanatory, like the support from dwarf class, where Gandalf leads, with Gimli himself being a dwarf in second place, so he probably has interactions with all the other dwarves. Because Legolas hangs out with Gimli so much, he gets to be in the third place. On the other hand, Sam and Frodo venture into the heart of Orc land by themselves, and have the most interactions with Orcs (not really endorsement).

#### Drop the projected graph from the catalog

CALL gds.graph.drop('categorical_virtual');

## Conclusion

Cypher projection is very expressive and allows us to project any subgraph of the stored graph as well as any virtual graph that exists only at query time. We are mostly limited only by our imagination. Try out the Graph Data Science library for yourselves and let us know if you have any suggestions or feedback.

As always, the code is available on GitHub.