Finding alternative routes in California road network with Neo4j

The focus of this blog post is to introduce Yen’s k-shortest path algorithm that was recently added to Neo4j graph algorithms. I will use Californian road network dataset made available by Feifei Li.

Next we will enrich our graph using Google’s reverse geocoding API and then proceed to find the shortest path with Dijkstra algorithm and the second shortest path using Yen’s k-shortest algorithm.

Graph schema

Our graph has a simple schema of nodes labeled Intersection connected with relationship CONNECTION to other intersections.

cas.png

Import

Lets first define the constraint in our graph schema.

CREATE CONSTRAINT ON (i:Intersection) ASSERT i.id IS UNIQUE;

Dataset is split into nodes and relationship files. Lets import them both to get all the data we need.

Import nodes

USING PERIODIC COMMIT
LOAD CSV FROM
"https://www.cs.utah.edu/~lifeifei/research/tpq/cal.cnode" 
as row fieldterminator " "
MERGE (i:Intersection{id:row[0]})
ON CREATE SET i.longitude = toFloat(row[1]),
              i.latitude = toFloat(row[2])

Import relationships

USING PERIODIC COMMIT
LOAD CSV FROM
"https://www.cs.utah.edu/~lifeifei/research/tpq/cal.cedge" 
as row fieldterminator " "
MATCH (start:Intersection{id:row[1]})
MATCH (end:Intersection{id:row[2]})
MERGE (start)-[c:CONNECTION{id:row[0]}]->(end)
ON CREATE SET c.length = toFloat(row[3])

Reverse geocode API

For every intersection in our graph we can get the address based on the GPS location with the help of Google’s reverse geocoding API . I used apoc.util.sleep(50) to throttle and wait 50 ms between each API call. It cost me around €7 to get this data as I couldn’t find a free version of the API :/.

MATCH (i:Intersection)
CALL apoc.util.sleep(50)
WITH "https://maps.googleapis.com/maps/api/geocode/json?latlng=" + 
toString(i.latitude) + "," + toString(i.longitude) + "&key={google_api_key}" as json,i
CALL apoc.load.json(json) yield value
SET i.name = value.results[0].formatted_address

Analysis

Lets start with visualizing Santa Barbara’s part of the road network with neovis.js.

 

santa_barbara.png

Neovis configuration
var config = {
   container_id: "viz",
   server_url: "bolt://localhost:7687",
   server_user: "neo4j",
   server_password: "neo4j",
   labels: {
     "Intersection": {
      "caption": "title"
      }
    },
   relationships: {
     "CONNECTION": {
      "caption": false
      }
    },
   initial_cypher: "MATCH p=(i1:Intersection)-[:CONNECTION]->(i2:Intersection)" +
     "WHERE i1.name contains 'Santa Barbara' AND i2.name contains 'Santa Barbara'" +
     "RETURN p limit 500"
 };

Shortest path

We use algo.shortestPath, that uses Dijkstra algorithm,  to find the shortest path between “Cathedral Peak Trail” and “5750 Stagecoach Rd”. We set direction:BOTH to treat our graph as undirected.

MATCH (start:Intersection{title:"Cathedral Peak Trail"}),
      (end:Intersection{title:"5750 Stagecoach Rd"})
CALL algo.shortestPath.stream(start,end,'length',{direction:'BOTH'}) YIELD nodeId,cost
MATCH (n) where id(n)= nodeId
RETURN n.title,cost

Visualization made with neovis.js.

shortest_path.png

Neovis configuration
var config = {
   container_id: "viz",
   server_url: "bolt://localhost:7687",
   server_user: "neo4j",
   server_password: "neo4j",
   labels: {
     "Intersection": {
       "caption": "title"
     }
   },
   relationships: {
     "CONNECTION": {
       "thickness": "length",
       "caption": false
     }
   },
   initial_cypher: "MATCH (start:Intersection{title:'Cathedral Peak Trail'}),(end:Intersection{title:'5750 Stagecoach Rd'}) " +
     "CALL algo.shortestPath.stream(start,end,'length',{direction:'BOTH',nodeQuery:'Intersection',relationshipQuery:'CONNECTION'}) YIELD nodeId,cost " +
     "MATCH (n) where id(n)=nodeId " + 
     "WITH collect(n) as nodes " +
     "UNWIND range(0, length(nodes)-2) as index " +
     "WITH nodes[index] as from, nodes[index + 1] as to " + 
     "MATCH p=(from)-[:CONNECTION]-(to) " +
     "RETURN p"
 };

Yen’s k-shortest paths

Yen’s k-shortest paths algorithm computes single-source K-shortest loopless paths for a graph with non-negative relationship weights.

It uses Dijkstra algorithm to find the shortest path and then proceeds to find k-1 deviations of the shortest paths. If we want to find the second shortest path we use k=2 as shown below.

MATCH (start:Intersection{title:"Cathedral Peak Trail"}),
(end:Intersection{title:"5750 Stagecoach Rd"})
CALL algo.kShortestPaths(start, end, 2, 'length' ,{})
YIELD resultCount
RETURN resultCount

Results are stored as paths in our graph.

MATCH p=()-[:PATH_0|:PATH_1]->()
RETURN p

Shortest path is visualized in red and second shortest path in yellow. We can easily observe that the paths diverge right at the start and join together 2 hops before the end.

yens-k2.png

Conclusion

With the addition of Yen’s k-shortest algorithm to the Neo4j graph algorithms library we can now search for alternative shortest routes in our graph. This can come in handy in various domains.

Advertisements

Neo4j Graph Visualizations using GoT dataset

Once again I will use the data made available by Andrew Beveridge to first demonstrate the use of categorical pageRank and breakdown pageRank by the sequence of the books, that will help us find the winners of game of thrones and secondly to show some visualizations options Neo4j community has to offer.

To find more details about the dataset check Analyzing the Graph of Thrones by William Lyon or my Neo4j GoT social graph analysis.

Import

Lets first define the schema of our graph.

We only need one constraint on Person label. This will speed up our import and later queries.

CREATE CONSTRAINT ON (p:Person) ASSERT p.id IS UNIQUE;

As the data of all five books is available on github we can import all 5 books using a single cypher query and APOC‘s load.json.

We will differentiate networks from different books using separate relationship types. We need to use apoc.merge.relationship  as Cypher does not allow using parameters for relationship types. Network from the first book will be stored as relationship type INTERACTS_1, second INTERACTS_2 and so on.

UNWIND ['1','2','3','45'] as book
LOAD CSV WITH HEADERS FROM 
'https://raw.githubusercontent.com/mathbeveridge/asoiaf/master/data/asoiaf-book' + book + '-edges.csv' as value
MERGE (source:Person{id:value.Source})
MERGE (target:Person{id:value.Target})
WITH source,target,value.weight as weight,book
CALL apoc.merge.relationship(source,'INTERACTS_' + book, {}, {weight:toFloat(weight)}, target) YIELD rel
RETURN distinct 'done'

Categorical pagerank

As described in my previous blog post, categorical pageRank is a concept where we break down the global pageRank into categories and run pageRank on each category subset of the graph separately to get a better understanding of the global pageRank.

Here we will use books as categories, so that we get character’s importance breakdown by the sequence of books.

UNWIND ['1','2','3','45'] as sequence
MERGE (book:Book{sequence:sequence})
WITH book,sequence
CALL algo.pageRank.stream(
 'MATCH (p:Person) WHERE (p)-[:INTERACTS_' + sequence + ']-() RETURN id(p) as id',
 'MATCH (p1:Person)-[INTERACTS_' + sequence + ']-(p2:Person) RETURN id(p1) as source,id(p2) as target',
 {graph:'cypher'})
YIELD node,score
// filter out nodes with default pagerank 
// for nodes with no incoming rels
WITH node,score,book where score > 0.16
MERGE (node)<-[p:PAGERANK]-(book)
SET p.score = score

 

Biggest winner of the game of thrones by books so far

Basically we will order pageRank values by the sequence of the books and return top ten characters with the highest positive changes in pageRank.

MATCH (person:Person)<-[pagerank:PAGERANK]-(book:Book)
// order by book sequence
WITH person,pagerank,book order by book asc
WITH person,collect(pagerank) as scores
RETURN person.id as person,
       scores[0].score as first_score,
       scores[-1].score as last_score,
       length(scores) as number_of_books 
ORDER BY last_score - first_score DESC LIMIT 10

While Jon Snow leads by absolute positive difference in pageRank, Victarion Greyjoy is very interesting. He had pageRank score 0.59 in the second book, was missing in third, and jumped to 4.43 in fourth and fifth book.

Stannis Baratheon is probably at the peak of his career judging by the show and is surprisingly in second place. Other than that the list is made out of the usual suspects.

goty.png

I also checked the characters with the biggest negative change, but it turns out that they are mostly dead so it’s not all that interesting.

Spoonjs

Thanks to Michael Hunger spoonJS is back. With it we can visualize charts directly in Neo4j browser.

Within a few clicks you can get it set up following the guide.

:play spoon.html

In our example we will visualize characters sorted by pageRank in the last two books combined.

MATCH (p:Person)<-[r:PAGERANK]-(:Book{sequence:'45'})
RETURN p.id as person,r.score as score
ORDER BY score DESC LIMIT 15

 

pagerank_book45

Three out of the first four places belong to the Lannisters, with the most important being Tyrion. If you think about it from this perspective what GoT is really about, you might think it’s just a family crisis of the Lannisters with huge collateral damage 🙂

3d force graph

Another cool visualization project by Michael is called 3d-force-graph. It lets us visualize and explore graphs.

We will use pageRank to define the size of the nodes, so that the most important nodes will be the biggest. To represent communities in the graph we use the color of the nodes.

We need to run label propagation or Louvain algorithm to find communities within our graph and store them as a property of the nodes.

We run label propagation using only the network of characters from the last two books.

CALL algo.labelPropagation('Person','INTERACTS_45','BOTH',{partitionProperty:'lpa',iterations:10})

I like this visualization because it is 3d and you can approach from different angles and zoom in and out of the graph while exploring it.

got_gif.gif

Neovisjs

We can also use neovis.js, developed by William Lyon, to visualize graphs. Similarly as before we use label propagation results to color the nodes. To mix it up a bit we will use betweenness centrality of the nodes, instead of pageRank, to represent the size of the nodes in the graph.

Run betweenness centrality algorithm.

CALL algo.betweenness('Person','INTERACTS_45',{direction:'BOTH',writeProperty:'betweenness'})

In the visualization we also defined the size of the relationships based on the weight and colored them according to the community of the pair of nodes they are connecting.

got_2.gif

 

Code for Neovis and 3d-force-graph visualization used in the post can be found on github. Have fun!

Neo4j GoT social graph analysis

Lately I have been showing how to project a bi-partite graph to mono-partite and run algorithms on it. To mix it up a little I will demonstrate how to project a network between nodes using cosine similarity of certain features that the node possess.  We will be using the Network of thrones dataset based on A Storm of Swords book. Check out also their analysis on the same network.

To find communities of people, that are in a similar position of power in the network, we will first calculate pagerank, harmonic centrality, triangles count and cluster coefficient for each node and use those values as features from which we will infer a similarity network using cosine similarity. This will allow us to find communities of people who are in a similar position of power.

Requirements:

Graph model:

got-datamodel

William Lyon has also wrote a blog post how to import and analyse this network. I have stolen both the graph model picture aswell as the cypher query for importing the data into Neo4j.

LOAD CSV WITH HEADERS FROM 
"https://www.macalester.edu/~abeverid/data/stormofswords.csv" AS row
MERGE (src:Character {name: row.Source})
MERGE (tgt:Character {name: row.Target})
MERGE (src)-[r:INTERACTS]->(tgt)
ON CREATE SET r.weight = toInt(row.Weight)

Centrality

In graph theory and network analysis, indicators of centrality identify the most important nodes within a graph.

Given this assumption we will use centrality indicators as features for the k-nearest neighbors algorithm (k-NN) that will infer a similarity network. As the centralities are not the focus of this blog post I will skip the theory and just run the write back versions of algorithms.

Calculate and write back triangles count and clustering coefficient.

CALL algo.triangleCount('Character', 'INTERACTS',
{write:true, writeProperty:'triangles',clusteringCoefficientProperty:'coefficient'}) 
YIELD nodeCount, triangleCount, averageClusteringCoefficient;

Calculate and write back pageRank score

CALL algo.pageRank(
 'MATCH (c:Character) RETURN id(c) as id',
 'MATCH (p1:Character)-[:INTERACTS]-(p2:Character) 
  RETURN id(p1) as source, id(p2) as target',
{graph:'cypher'});

Calculate and write back harmonic centrality

CALL algo.closeness.harmonic(
  'MATCH (c:Character) RETURN id(c) as id',
  'MATCH (c1:Character)-[:INTERACTS]-(c2:Character) 
   RETURN id(c1) as source, id(c2) as target',
{graph:'cypher', writeProperty: 'harmonic'});

k-NearestNeighbour

First we will normalize our features using the min-max method. Cluster coefficient is already normalized so there is no need to do it again. If you want to do something similar on bigger graphs I would suggest you use apoc.periodic.iterate  for batching.

WITH ["pagerank","harmonic","triangles"] as keys
UNWIND keys as key
MATCH (c:Character)
WITH max(c[key]) as max,min(c[key]) as min,key
MATCH (c1:Character)
WITH c1, key + "normalized" AS newKey, 
    (toFloat(c1[key]) - min) / (max - min) as normalized_value
CALL apoc.create.setProperty(c1, newKey, normalized_value) 
YIELD node
RETURN COUNT(*)

First we match all pairs of nodes and compare their cosine similarity. To avoid a complete graph where all nodes are connected between each other, we will set a similarity threshold, meaning that all relationships with cosine similarity less than 0.9 will be ignored and not stored. I am slightly cheating as with the kNN algorithm you define to how many nearest neighbours should the node be connected to and here we are defining how similar or close should the pair of nodes be to store a relationship.

Again for bigger graphs you should use APOC for batching. I wrote a blog post with a very similar example.

MATCH (c1:Character),(c2:Character) where id(c1) < id(c2)
WITH c1,c2,apoc.algo.cosineSimilarity([c1.pageranknormalized,
                                       c1.coefficient,
                                       c1.harmonicnormalized,
                                       c1.trianglesnormalized],
                                      [c2.pageranknormalized,
                                       c2.coefficient,
                                       c2.harmonicnormalized,
                                       c2.trianglesnormalized]) 
                                         as cosine_similarity
WHERE cosine_similarity > 0.9
MERGE (c1)-[s:SIMILAR_POSITION]-(c2)
SET s.cosine = cosine_similarity

Community Detection

A network is said to have community structure if the nodes of the network can be easily grouped into sets of nodes such that each set of nodes is densely connected internally. In the particular case of non-overlapping community finding, this implies that the network divides naturally into communities of nodes with dense connections internally and sparser connections between communities.

 

Let’s check what community structure will Louvain algorithm find in our network.

CALL algo.louvain.stream('Character', 'SIMILAR_POSITION',
{}) 
YIELD nodeId, community
MATCH (c:Character) where id(c) = nodeId
RETURN community,
       count(*) as communitySize,
       collect(c.name) as members 
ORDER BY communitySize ASC LIMIT 20;

Illyrio Mopatis is all alone in Pentos and probably has no network power at all. The most interesting group is community 8, where all the cream of the book is collected ranging from Starks, Targaryens and Lannisters to interestingly also Stannis, Mance and Davos.

Community 106 looks like a community of captains and maesters and differs from largest community in that the largest community has a higher average cluster coefficient.

GOTlouvain.png

 

Let’s try the Label Propagation algorithm and check what it finds.

CALL algo.labelPropagation(
'MATCH (c:Character) RETURN id(c) as id, 1 as weight, id(c) as value',
'MATCH (c1:Character)-[f:INTERACTS]->(c2:Character) 
RETURN id(c1) as source, id(c2) as target,f.weight as weight',
"OUT",{graph:'cypher',partitionProperty:'lpa',iterations:10});

gotlpa.png

We can immediately observe that LPA returns much more granular communities than Louvain in this network. The biggest community consists of Starks and Lannisters with an addition of Varys. It’s safe to say he deserves this spot.  On the other hand Jon is in the community with all members from Night Watch. Daenerys is also left out of the “strongest” community and shares community with Jorah and ser Barristan. She just wasn’t such a badass in season 3 as she has became in season 7 🙂

 

Neo4j Graph Algorithms projecting a virtual graph

Last time I started to talk about projecting virtual graphs with cypher queries and showed how to project a bidirected graph between nodes of only the largest component (island) in the graph. For now we learned how to project bidirected graphs and apply similarity threshold with “cypher loading”. Another common use that I haven’t talked about yet is projecting virtual relationships (not stored) to algorithms engine.

Requirements:

Dataset:

We will use the actors database that we get following :play movies guide in Neo4j browser. We have persons and movies in our graph and various relationships between them. I like to use this dataset as a example of what Neo4j is capable of.

matrix1.png

Projecting a virtual graph:

I will show how to project an undirected mono-partite graph from a bi-partite graph via cypher queries. This is pretty useful when you want to analyse different networks of your graph, but don’t want to actually save those networks to your database.

Louvain:

Projecting a mono-partite graph from a bi-partite graph can be easily undestood as:

(:Movie)(:Person)(:Movie)

to

(:Movie)←[{weight=Nb of Common persons}]→(:Movie)

We will project a virtual undirected graph with a similarity threshold of 2 and run Louvain algorithm on it. With it we will try to find groups of movies that share more than 2 common persons.

CALL algo.louvain.stream(
 'MATCH (m:Movie) RETURN id(m) as id',
 'MATCH (m1:Movie)<--(:Person)-->(m2:Movie)
 WITH m1,m2,count(*) as common_actors
 WHERE common_actors > 2
 RETURN id(m1) as source,id(m2) as target, common_actors as weight'
,{graph:'cypher'}) yield nodeId,community
MATCH (movie:Movie) where id(movie)=nodeId
RETURN  community,
        count(*) as communitySize,
        collect(movie.title) as movies
ORDER BY communitySize desc limit 5

kajapa.png

There aren’t many communities as this dataset is small. The largest community are movies that are directed by Lana Wachowski and Lilly Wachowski.

Pagerank

We can also project a person to person network from our bi-partite graph.

(:Person)(:Movie)(:Person)

to

(:Person)←[:{weight=Nb of common movies}]→(:Person)

Similarly like before we will project an undirected graph with a similarity threshold of 2 and run pageRank on it.

call algo.pageRank.stream(
  'MATCH (p:Person) return id(p) as id',
  'MATCH (p1:Person)-->(:Movie)<--(p2:Person)
   WITH p1,p2,count(*) as commonMovies        
   WHERE commonMovies > 2
   RETURN id(p1) as source, id(p2) as target',
{graph:'cypher'})
YIELD node,score WITH node,score
ORDER BY score desc limit 10
RETURN node.name as person, score

matrix.png

Camerow Crowe leads in first place by far margin followed by the Wachowski brothers. We have most of the Matrix crew in the first 10 spots as they collaborated in 3 movies and have stronger ties between each others.

Conclusion:

I skipped theory behind graph algorithms on purpose as I have explained it a couple of times now and don’t want to repeat myself too much. The mission of this blog post is to demonstrate the abilities cypher loading offers us in terms of projecting virtual graphs. If you want to learn more check previous posts and the documentation.

Have some feedback or questions?
Join the #neo4j-graph-algorithms channel on  Neo4j slack community

Neo4j Marvel Social Graph Algorithms Centralities

To top off the Marvel Social Graph series we will look at how to use centralities on a projected graph via cypher queries to find influencers or otherwise important nodes in our network using Neo4j and neo4j-graph-algorithms library.

To recap the series:

Graph projections via cypher queries:

As we noticed in the previous part using graph projections via cypher queries or for short “cypher loading”  is really great as it lets us filter and/or project virtual graphs easily and quickly. To let you fully take advantage of this awesome tool we need to get to know exactly how it works.

Unlike default label and relationship type of loading subsets of graphs, where we can in some cases define direction of the relationship to be either “incoming”,”outgoing” or “both”(birected/undirected) , cypher loading does not support loading single relationship as undirected.

While this may seem bad it’s actually not as cypher loading allows us to get creative with trying out graph algorithms on different virtual networks that we can project using cypher queries. We already did this in the previous post, but I didn’t describe it in detail yet.

Imagine that we have two hero nodes and a single directed relationship between them.
Only difference between loading this graph as undirected or directed is specifying the direction of the relationship in the cypher query. When we do not specify the direction of the relationship in the cypher query, cypher engine returns two patterns in both direction for each relationship in our graph. That in turn projects our network bidirected or undirected.

projecting directed network:

MATCH (u1:Hero)-[rel:KNOWS]->(u2:Hero)
RETURN id(u1) as source, id(u2) as target

projecting undirected network:

MATCH (u1:Hero)-[rel:KNOWS]-(u2:Hero)
RETURN id(u1) as source, id(u2) as target

Centralities

In graph theory and network analysis, indicators of centrality identify the most important nodes within a graph. Applications include identifying the most influential person(s) in a social network, key infrastructure nodes in the Internet or urban networks, and super-spreaders of disease. Centrality concepts were first developed in social network analysis, and many of the terms used to measure centrality reflect their sociological origin.[1]

Pagerank

PageRank is Google’s popular search algorithm. PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the node is. The underlying assumption is that more important nodes are likely to receive more links from other websites.

More in documentation.

We will use cypher loading to load only the nodes of the biggest component and set a weight threshold of 100 for relationships.

call algo.pageRank.stream(
// Match only the biggest component 
'MATCH (u:Hero) WHERE u.component = 136 RETURN id(u) as id
','
MATCH (u1:Hero)-[k:KNOWS]-(u2:Hero)
// Similarity threshold 
WHERE k.weight >= 100
RETURN id(u1) as source, id(u2) as target
',{graph:'cypher'}
) yield node,score 
WITH node,score ORDER BY score DESC limit 10 
return node.name as name, score;

Captain America has the highest pagerank score. He is located in the middle of the network with a total of 24 relations and also relations to most of the other important heroes in the network like Thor, Spiderman and Iron Man. If we check all heroes related to Captain America, we can notice, that they have on average higher Pagerank score just because of this relation to Captain America.

marvel_pagerank.png

* Node color from white (less) to red (more): Pagerank

Closeness centrality

Closeness centrality is defined as the total number of links separating a node from all others along the shortest possible paths. In other words, to calculate closeness, one begins by calculating, for each pair of nodes in the network, the length of the shortest path from one to the other (aka the geodesic distance). Then for each node, one sums up the total distance from the node to all other nodes.[2]

Closeness can be interpreted as an index of time-until-arrival of something flowing through the network. The greater the raw closeness score, the greater the time it takes on average for information originating at random points in the network to arrive at the node. Equally, one can interpret closeness as the potential ability of a node to reach all other nodes as quickly as possible.[2]

More in documentation.

We will use cypher loading to load only the nodes of the biggest component and set a weight threshold of 100 for relationships. With closeness centrality it is especially important that we load only a single component.

Unfortunately, when the graph is unconnected, closeness centrality appears to be useless because the distance between two nodes belonging to different components is infinite by convention, which makes the sum in 2 infinite too and therefore its inverse equal to zero. For every node of such a graph, there is always another node belonging to another component: indices of all vertices of the graph are therefore useless and the calculation of the index is limited to the largest component, omitting the roles played by individuals of other components.[3]

CALL algo.closeness.stream(
// Match only the biggest component 
'MATCH (u:Hero) WHERE u.component = 136 RETURN id(u) as id
','
MATCH (u1:Hero)-[k:KNOWS]-(u2:Hero) 
// Similarity threshold 
WHERE k.weight >= 100
RETURN id(u1) as source,id(u2) as target
',{graph:'cypher'}) YIELD nodeId, centrality
WITH nodeId,centrality 
ORDER BY centrality DESC LIMIT 10
MATCH (h:Hero) where id(h)=nodeId
RETURN h.name as hero, centrality

Captain America is in such a privileged position, that he will be leading in all categories of centralities. We can observe that nodes in more tight communities have higher closeness centrality indexes while those on the brinks and less connected have smaller values. Second thing we can notice is that also the overall position of nodes in the graph matter as the middle community has on average higher closeness centrality as others. As an example both Iron Man and Vision have higher closeness centrality than Spiderman, while Spiderman has higher Pagerank index than them.

marvel_closeness.png

* Node color from white (less) to red (more): Closeness centrality

Harmonic Centrality

The harmonic mean has been known since the time of Pythagoras and Plato as the mean expressing “harmonious and tuneful ratios”, and later has been employed by musicians to formalize the diatonic scale, and by architects as paradigm for beautiful proportions.[4]

Social network analysis is a rapid expanding interdisciplinary field, growing from work of sociologists, physicists, historians, mathematicians, political scientists, etc. Some methods have been commonly accepted in spite of defects, perhaps because of the rareness of synthetic work like (Freeman, 1978; Faust & Wasserman, 1992). Harmonic centrality was proposed as an alternative index of closeness centrality defined on undirected networks. Results show its computation on real cases are identical to those of the closeness centrality index, with same computational complexity and we give some interpretations. An important property is its use in the case of unconnected networks.[3]

CALL algo.closeness.harmonic.stream( 
// Match only the biggest component 
'MATCH (u:Hero) WHERE u.component = 136 RETURN id(u) as id 
',' 
MATCH (u1:Hero)-[k:KNOWS]-(u2:Hero) 
// Similarity threshold 
WHERE k.weight >= 100 
RETURN id(u1) as source,id(u2) as target '
,{graph:'cypher'}) YIELD nodeId, centrality 
WITH nodeId,centrality 
ORDER BY centrality DESC LIMIT 10 
MATCH (h:Hero) where id(h)=nodeId 
RETURN h.name as hero, centrality

Harmonic centrality was proposed as an alternative for closeness centrality to help solve the problem of disconnected components. Because of this we get back very similar results, given that we also have a single connected component.

marvel_harmonic

Betweenness Centrality

In graph theory, betweenness centrality is a measure of centrality in a graph based on shortest paths. For every pair of nodes in a connected graph, there exists at least one shortest path between the vertices such that either the number of relationships that the path passes through (for unweighted graphs) or the sum of the weights of the edges (for weighted graphs) is minimized. The betweenness centrality for each node is the number of these shortest paths that pass through the node.[6]

More in documentation.

We will use cypher loading to load only the nodes of the biggest component and set a weight threshold of 100 for relationships.

CALL algo.betweenness.stream(
// Match only the biggest component
'MATCH (u:Hero) WHERE u.component = 136 RETURN id(u) as id
','
MATCH (u1:Hero)-[k:KNOWS]-(u2:Hero) 
// Similarity threshold
WHERE k.weight >= 100
RETURN id(u1) as source,id(u2) as target
',{graph:'cypher'}) YIELD nodeId, centrality
WITH nodeId,centrality 
ORDER BY centrality DESC LIMIT 10
MATCH (h:Hero) where id(h)=nodeId
RETURN h.name as hero, centrality

As always Captain America is in first place and this time Beast being in the second place. This comes as no surprise as we can observe that he is the sole bridge between middle and right community. Spiderman and Incredible Hulk play a similar role as Beast, but have smaller communities behind them and hence also smaller betweenness scores.

marvel_betweeness.png

* Node color from white (less) to red (more): Betweenness centrality

References:

[1] https://en.wikipedia.org/wiki/Centrality

[2] http://qualquant.org/wp-content/uploads/networks/2008%201-7-3.pdf

[3] https://infoscience.epfl.ch/record/200525/files/[EN]ASNA09.pdf?

[4] https://arxiv.org/pdf/cond-mat/0008357.pdf

[6] https://en.wikipedia.org/wiki/Betweenness_centrality

Neo4j Marvel Social Graph Analysis

This is part 2 of Marvel series. In  previous post  I imported the data from kaggle competition and showed how to project a mono-partite graph from a bipartite graph. We used a topological similarity measure, that takes into account how many common comics any pair of heroes have.

For easier understanding we can represent this as the following function:

(:Hero)(:Comic)(:Hero)

to

(:Hero)←[:KNOWS{weight=Nb of Common Comics}]→(:Hero)

To find out more check this Gephi tutorial. We could have also projected mono-partite graph looking like:

(:Comic)←[{weight=Nb of Common Heroes}]→(:Comic)

You will need to have the graph imported to proceed with the following steps in analysis.

Requirements:

Graph model:

g.png

We end up with a simple graph model. We started with a bi-partite(two types/labels of nodes) network of Heroes and Comics and then inferred a mono-partite (single type/label of nodes) graph amongst Heroes based on the number of common Comics.

Analysis

We will analyse the inferred network of Heroes. I usually start with some global statistics to get a sense of the whole graph and then dive deeper.

Weight distribution

First we check the distribution of similarity weights between pairs of heroes. Weight value translates to the number of common Comics Heroes have appeared in.

MATCH ()-[k:KNOWS]->()
RETURN (k.weight / 10) * 10 as weight,count(*) as count 
ORDER BY count DESC LIMIT 20

You might wonder why we use (k.weight / 10) * 10 as it looks silly at first glance. If we divide an integer by an integer in Neo4j we get back integer. I use it as an bucketizing function,that groups numeric data into bins of 10, so that it is easier to read results.

weig

162489 out of total 171644 relationships (94%) in our network has weight 9 or less. This means that most of our Heroes have only briefly met.

Largest weight is between “THING/BENJAMIN J. GR” and “HUMAN TORCH/JOHNNY S” at value 724. I would assume they are good buddies.

Even though we have a well connected graph, most of relationships are “weak” judging by weight. I would assume that most comics have standard teams of heroes, where not necessarily all of the team appear in every comic.

Second assumption I would make is that there are occasinal comics where different “teams” of heroes appear in together, hence so many weak relationships.

To check my assumptions I start with this query to get a basic feel.

MATCH (u:Comic)
RETURN avg(apoc.node.degree(u,'APPEARED_IN')) as average_heroes,
stdev(apoc.node.degree(u,'APPEARED_IN')) as stdev_heroes,
max(apoc.node.degree(u,'APPEARED_IN')) as max_heroes,
min(apoc.node.degree(u,'APPEARED_IN')) as min_heroes

local

I personally prefer distribution over average. We use the “bucketizing” function as before:

MATCH (u:Comic)
RETURN (size((u)-[:APPEARED_IN]-()) / 10) * 10 as heroCount,
count(*) as times
ORDER BY times DESC limit 20

heroCount
Looks like my assumptions are plausible and quite possible. 8999 (71%) of comics have less than 10 heroes playing, with most of them probably at around 5 or less as the total average is 7,5. There are a few comics where we have “family gatherings” with more than 30 heroes. One of them is a comic named COC1, probably Contest of Champions, where 110 heroes play.

Normalize weight:

If we need to, we can normalize the weight using min-max method.
Notice how this time we use (toFloat(k1.weight) - min) / (max - min). By declaring k.weight a float, we are dividing a float by an integer, that returns a float and does not bucketize/group number into bins

MATCH (:Hero)-[k:KNOWS]->(:Hero) 
//get the the max and min value
WITH max(k.weight) as max,min(k.weight) as min
MATCH (:Hero)-[k1:KNOWS]->(:Hero) 
//normalize
SET k1.weight_minmax = (toFloat(k1.weight) - min) / (max - min)

Triangle count / Clustering coefficient

In graph theory, a clustering coefficient is a measure of the degree to which nodes in a graph tend to cluster together. Evidence suggests that in most real-world networks, and in particular social networks, nodes tend to create tightly knit groups characterised by a relatively high density of ties; this likelihood tends to be greater than the average probability of a tie randomly established between two nodes (Holland and Leinhardt, 1971;[1] Watts and Strogatz, 1998[2]).

Two versions of this measure exist: the global and the local. The global version was designed to give an overall indication of the clustering in the network, whereas the local gives an indication of the embeddedness of single nodes.[1]

CALL algo.triangleCount('Hero', 'KNOWS',
{write:true, writeProperty:'triangles',
clusteringCoefficientProperty:'coefficient'})
YIELD nodeCount, triangleCount, averageClusteringCoefficient;

Running this algorithms writes back local triangle count and clustering coefficient, while providing total triangle count and average clustering coefficient of the graph as return.
Find out more in documentation.

av

Average clustering is very high at 0.77, considering that 1 would mean we have a complete graph, where everybody knows each other. This comes as no surprise as we observed before that our network is very connected with most of relationships being “weak”.

Connected Components

Testing whether a graph is connected is an essential preprocessing step for every graph algorithm. Such tests can be performed so quickly and easily that you should always verify that your input graph is connected, even when you know it has to be. Subtle, difficult-to-detect bugs can result when your algorithm on a disconnected graph.

 

Connected components have other practical use cases, for example, if we are analysing a social network and we want to find all the disconnected groups of people that exist in our graph.

More in documentation.

CALL algo.unionFind.stream('Hero', 'KNOWS', {}) 
YIELD nodeId,setId
WITH setId,count(*) as communitySize
// filter our single node communities
WHERE communitySize > 1 
RETURN setId,communitySize
ORDER BY communitySize DESC LIMIT 20

com.png

Our graph consists of 22 total components with the largest covering almost all the graph(99,5%). There is 18 single node communities, and 3 very small ones with 9, 7 and 2 members.

Lets check who are the members of these small components.

CALL algo.unionFind.stream('Hero', 'KNOWS', {}) 
YIELD nodeId,setId
WITH setId,collect(nodeId) as membersId,count(*) as communitySize 
// skip the largest component
ORDER BY communitySize DESC SKIP 1 LIMIT 20
MATCH (h:Hero) WHERE id(h) in membersId
RETURN setId,collect(h.name) as members 
ORDER BY length(members) DESC

We need to match nodes by their nodeId from result so that we can get back the names of heroes.

othe

18 heroes never appeared in a comic with any other heroes. Some of the are RED WOLF II, RUNE, DEATHCHARGE etc. Second largest component with 9 members seems to be the ASHER family and some of their friends.

Weighted Connected Components

What if decided that two heroes co-appearing in a single comic is not enough of interaction for us to tag them as colleagues and raise the bar to 10 common comics.Here is where weighted Connected Component algorithm can help us.

If we define the property that holds the weight(weightProperty) and the threshold,it means the nodes are only connected, if the threshold on the weight of the relationship is high enough otherwise the relationship is thrown away.

In our case it means that two heroes need at least 10 common comics to be considered connected.

CALL algo.unionFind.stream('Hero', 'KNOWS', 
{weightProperty:'weight', threshold:10.0}) 
YIELD nodeId,setId
RETURN setId,count(*) as communitySize
ORDER BY communitySize DESC LIMIT 20;

Define weightProperty and threshold.

com10.png

As we could expect the biggest component drops significantly in size from 99,5% to 18,6% of total number of nodes. Using threshold can be used to find potentially resilient teams of heroes as more interactions between heroes (we take any co-appearance as positive) means longer history and more bonding, which can translate to resilience.

We can play around with different threshold settings.

Not only community sizes are interesting, but also their members. We check the members of the largest communities skipping the first one as showing 1200 names in a browser can be messy.

CALL algo.unionFind.stream('Hero', 'KNOWS', 
{weightProperty:'weight', defaultValue:0.0, threshold:10.0}) 
YIELD nodeId,setId
WITH setId,collect(nodeId) as membersId,count(*) as communitySize 
// skip the largest component
ORDER BY communitySize DESC SKIP 1 LIMIT 20
MATCH (h:Hero) WHERE id(h) in membersId
RETURN setId,collect(h.name) as members 
ORDER BY length(members) DESC

com10skip.png

Second largest community has really cool names like Bodybag,China Doll and Scatter Brain. As I have no clue of the domain I can’t really give any other comment.

 

Conclusion:

While this post is long, it does not contain everything I want to show on Marvel graph, so there will be one more post using centralities with cypher loading and maybe some community detection with Louvain and Label Propagation Algorithm.

Hope you learned something today!

Reference:

[1] https://en.wikipedia.org/wiki/Clustering_coefficient

Neo4j Facebook ego network analysis

Graph algorithms plugin for Neo4j is getting bigger and bigger by the day, so here are some new algorithms, that have been added to the collection. I will use Facebook’s ego network of friends to show triangle count algorithms and betweenness centrality algorithms on an undirected graph in Neo4j.

Requirements:

Graph model:

We will import a friends social network of Facebook’s users. There are 4039 nodes labeled User and 88234 unweighted relationships indicating friendship between users as shown below. You can always get the schema of your graph using CALL dbms.schema().

user

We create a constraint on User labels, so that our queries will run faster.

CREATE CONSTRAINT ON (u:User) ASSERT u.id is unique

Import:

Direct link to data is here. We will model this network as undirected. There is a twist here, that Neo4j supports storing only directed relationships. That is no problem as we can still treat it as undirected in our queries and algorithms.

USING PERIODIC COMMIT 5000
LOAD CSV FROM 
"file:///facebook_combined.txt" as row fieldterminator ' '
MERGE (u:User{id:row[0]})
MERGE (u1:User{id:row[1]})
MERGE (u)-[:FRIEND]-(u1)

We are using MERGE without specifying direction of the relationship as it does not matter in an undirected graph. Using MERGE like this will check for any existing relationships in both direction and if none exists it will choose a random direction to store the new relationship.

Graph Analysis:

Find (weakly) connected components in the graph ?

I always start with checking how many connected components are there in the graph. Most graph algorithms run best on an connected component, so I usually choose the largest component to run further algorithms on. Check documentation for more.

CALL algo.unionFind.stream('User', 'FRIEND', {})
YIELD nodeId,setId
return setId as component, count(*) as size

un

As we can see there exists only one connected component in this graph. This is very good as there are no disconnected components. Graph algorithms library also support finding strongly connected components with algo.scc. In undirected graph every weakly connected component is also strongly connected so we can skip it.

 Get average number of friends and standard deviation.

MATCH (u:User)
return avg(size((u)--())) as average_degree,
       stdev(size((u)--())) as stdev

st

A user has on average 43 friends. Standard deviation is quite high so the number of friends is spread out over a wider range of values. We can check now for users with most friends to start looking for influencers in our network.

Degree:

MATCH (u:User)
RETURN u.id as id,size((u)--()) as degree 
order by degree desc limit 10

deg

Most basic metric is degree, that we can get using plain cypher. This directly translates to the number of friends a user has. It can be used as a good starting ground to finding influencers in our network. Top 4 users have more than 10x the average friends with user 107 being really high at 1045 friends (26% of all users).

Calculate triangle count and clustering coefficient for each node and return total triangle count and average clustering coefficient of the network.

CALL algo.triangleCount('User', 'FRIEND',
{concurrency:4, write:true, writeProperty:'triangles', 
clusteringCoefficientProperty:'coefficient'})
YIELD nodeCount, triangleCount, averageClusteringCoefficient,
loadMillis, computeMillis, writeMillis

global

There are 4039 node with 1612010 triangles in our network. Average clustering is quite high at 0.6055. This is expected as evidence suggests that in most real-world networks, and in particular social networks, nodes tend to create tightly knit groups characterized by a relatively high density of relationships. [1]

Check user local triangle count and cluster coefficient.

The local clustering coefficient of a node in a graph quantifies how close its neighbours are to being a complete graph, where complete graph is a simple undirected graph in which every pair of distinct nodes is connected by a unique relationship.[1]

If put simple the clustering coeffient would be 1 when all of my friends knew eachother.
Above algorithms also saved values of individual nodes as properties of nodes, so we can now query the results.

MATCH (u:User)
return u.id,u.triangles as triangles,
            u.coefficient as cluster_coefficient 
order triangles desc limit 10

tri

User 1912 is a member of more triangles then user 107 , who has the most friends. They both have a small clustering coefficient, which indicates, that their friends don’t really know eachother and they have a great potencial for introducing people to one another. User 1912,107 were third and first by the number of friends, so combined with this results they are solidifying their importance in the network. There is quite a drop in triangle count to third place, while on the other hand their neighbour communities are more tightly knit.

Betweenness centrality:

Betweenness centrality is useful in finding nodes that serve as a bridge from one group of users to another in  a graph. Consequently, betweenness is a rudimentary measure of the control, that a specific node exerts over the information flow throughout the full graph.

We use direction:'BOTH'for algorithms on an undirected graph.

Check documentation for more.

CALL algo.betweenness.stream('User', 'FRIEND', 
{direction:'BOTH'}) 
YIELD nodeId,centrality
MATCH (u:User) where id(u) = nodeId
RETURN u.id, centrality order by centrality desc limit 15

betwennness

User 107 tops the betweenness centrality results, which is not surprising as he has the most amount of friends with a very low clustering coeffient. So he in the position to act as a broker of informations between different groups of users. He sure does appear as the main influencer in this network based on the metrics we checked. Based on the combined view of centrality metrics we can create a list with users, who are likely to be important in our network and do some further research on them.

Conclusion:

As you can see graph algorithms plugin greatly improves our ability to analyse the network in Neo4j. You should check them out! If you think some algorithms are still missing or find some bugs please let us know on github or the Neo4j slack community.

 

[1] https://en.wikipedia.org/wiki/Clustering_coefficient

Neo4j London tube system analysis

I recently came across London tube dataset, that was uploaded by Nicola Greco. I thought it would a cool example to show some algorithms from the new Neo4j graph algorithms plugin. They are really useful as they allow network analysis without using external services to run the algorithms on.

Requirements:

Graph model:

We will create a very simple graph model, with stations and connections between them as shown below. You can check the always check the schema of your graph with db.schema() procedure.

london

Lets define the schema for our graph model with a constraint and an index, so that all queries will run faster.

CREATE CONSTRAINT ON (s:Station) ASSERT s.id is unique;

CREATE INDEX ON :Station(name);

Import:

LOAD CSV can import data from local files or from the internet, which is really cool, so that we can access data on github easily without the need of any intermediate steps.

Import stations

LOAD CSV WITH HEADERS FROM
"https://raw.githubusercontent.com/nicola/tubemaps/master/datasets/london.stations.csv" as row
MERGE (s:Station{id:row.id})
ON CREATE SET s.name = row.name,
              s.latitude=row.latitude,
              s.longitude=row.longitude,
              s.zone=row.zone,
              s.total_lines=row.total_lines

Import connections between stations.
We create relationships in both directions as trains also travel in both directions. We save the time spent traveling between stations as a relationship property, so that we can use it as a weight in our algorithms.

LOAD CSV WITH HEADERS FROM
"https://raw.githubusercontent.com/nicola/tubemaps/master/datasets/london.connections.csv" as row
MATCH (s1:Station{id:row.station1})
MATCH (s2:Station{id:row.station2})
MERGE (s1)-[:CONNECTION{time:row.time,line:row.line}]->(s2)
MERGE (s1)<-[:CONNECTION{time:row.time,line:row.line}]-(s2)

Analysis:

Which stations have the most connections to other stations?

MATCH (n:Station)--(n1)
RETURN n.name as station,
       count(distinct(n1)) as connections 
order by connections desc LIMIT 15

Results:

connections

Find the fastest path based between two stations.

algo.shortestPath is a dijsktra algorithm, that returns shortest path between start and end node. For more info check the documentation.

transfers are not taken into account

MATCH (start:Station{name:"Baker Street"}),(end:Station{name:"Beckton Park"})
CALL algo.shortestPath.stream(start, end, 'time')
YIELD nodeId, cost
MATCH (s:Station) where id(s)=nodeId
RETURN s.name as station,cost as time

Results:

single_connection

Find out which stations take the longest to get to from Baker Street station.

algo.shortestPaths is a dijsktra algorithm, that return shortest paths from start node to all the other nodes in the network.

transfers are not taken into account

MATCH (n:Station {name:'Baker Street'})
CALL algo.shortestPaths.stream(n, 'time')
YIELD nodeId, distance
MATCH (s:Station) where id(s)=nodeId
RETURN s.name as station,distance as time 
ORDER BY time desc LIMIT 15

Results:

single_many

Find the longest shortest paths between stations in the network.

algo.allShortestPaths returns shortest paths between all pairs of nodes in the network

transfers are not taken into account

CALL algo.allShortestPaths.stream('time',
{nodeQuery:'Station',relationshipQuery:'CONNECTION',defaultValue:1.0})
YIELD sourceNodeId, targetNodeId, distance
WITH sourceNodeId, targetNodeId, distance 
ORDER BY distance desc LIMIT 20
// We filter out duplicates
WHERE sourceNodeId > targetNodeId
MATCH (s1:Station),(s2:Station) 
WHERE id(s1)=sourceNodeId AND id(s2)=targetNodeId
RETURN s1.name as station1,s2.name as station2,distance as time

Results:

all_paths.png

Find bridge stations in network.

We can use betweenness centrality to help us identify, which are the stations, where people will be passing by the most, based on the structure of the network.

algo.betweenness is a betweenness algorithm, that helps us find bridges in the network. For more info check documentation.

CALL algo.betweenness.exp1.stream('Station','CONNECTION')
YIELD nodeId, centrality
MATCH (n:Station) where id(n)=nodeId
RETURN n.name,centrality 
ORDER BY centrality desc
LIMIT 15

Results:

bett

Conclusion:

It is very easy to enhance Neo4j with graph algorithms. You just copy the .jar file to /plugins folder, restart Neo4j and you are good to go. They offer lots of cool algorithms, that can help you analyse your graph in Neo4j. You should try them out!

Neo4j apoc.load.jdbc

I recently made a simple query, that I used for syncing postgreSQL to Neo4j on an hourly basis with a cron job. My job was to make a simple scripts, that would automate the syncing of Neo4j with a SQL database. This is a very simple task thanks to APOC functions, which has loads of very useful functions, we can call with cypher.

I had to import a couple of SQL tables and join them together in a graph to have a better overview of the data. This is an example of query, that was used to import a table with sales data.

Graph Model:

jdbc2.png

As you can see we are importing some sales data from the postgreSQL database in my case. I made a very simple graph model to capture the data we get about our sales. Main focus is the invoice. We use domain instead of company name as an index as this works out better in our case.

You will need:

Just download them and put them into neo4j /plugins directory and that’s it!

Authorization:

Sample call looks like:

CALL apoc.load.jdbc('jdbc:mysql://localhost:3306/proddb1?user=root&password=football',
'select title from movies where studio=\'MGM Studios\'') 
YIELD row

with jdbc:mysql://localhost:3306/proddb1?user=root&password=football' being the connection string, that contains all the sensitive information such as ip, username and password of your SQL database.
If you want to hide/alias the connection string, this can be accomplished by adding to the conf/neo4j.conf a parameter similar to:

apoc.jdbc.myalias.url=jdbc:mysql://localhost:3306/proddb1?user=root&password=football

and now the above sample call can be rewritten as:

CALL apoc.load.jdbc('myalias',
'select title from movies where studio=\'MGM Studios\'') 
YIELD row

For more info check knowledge base post.

Query:

I made some logic with cypher, where I first get the date of the latest invoice in my Neo4j and then only select those that are were updated on the same day or later. I imagine there is a better way of doing this, but it serves our purpose.

// match existing invoices
MATCH (w:Invoice)
// get the latest date in the desired format
WITH apoc.date.format(max(w.unix),'s','yyyy-MM-dd') as last_invoice
// prepare SQL select statement
WITH 
'SELECT * FROM public.order WHERE updated_at::date >=?::date' as order_query,
last_invoice
// call load.jdbc with the latest date of existing data as a param
CALL apoc.load.jdbcParams(postgresurl, order_query, [last_invoice]) yield row
// we need to parse out domains from the emails
WITH row,split(row.customer_email,"@")[1] as domain
// if the email is in below list, then take whole email as a code
// so that those with gmail are not treated as one customer
WITH *,case when domain in ["gmail.com","yahoo.com","outlook.com","hotmail.com","msn.com","hotmail.co.uk"] 
  then row.customer_email
  else domain end as code
MERGE (i:Invoice{order_id:row.order_number})
// we round all the numbers to 2 digits
ON CREATE SET i.status = row.status,
              i.total_amount = apoc.math.round(row.total,2,"UP"),
              i.quantity = toINT(row.total_line_items_quantity),
              i.total_tax = apoc.math.round(row.total_tax,2,"UP"),
// save a human readable date format
              i.date = apoc.date.format(row.completed_at,"ms","yyyy-MM-dd'T'HH:mm:ss.SSS"),
              i.postcode = row.customer_billing_address_postcode,
              i.address = row.billing_address_address_1,
              i.total_discount = apoc.math.round(row.total_discount,2,"UP"),
              i.total_shipping = apoc.math.round(row.total_shipping,2,"UP"),
              i.unix = (row.completed_at / 1000),
              i.invoice_number = row.order_meta_tzs_invoice_number,
              i.total_refunded = apoc.math.round(row.refunded_total,2,"UP")
// columns, that can change in our invoices in time, we add to ON MATCH statement	
ON MATCH SET i.status = row.status,
             i.total_refunded = apoc.math.round(row.refunded_total,2,"UP"), 
             i.total_tax = apoc.math.round(row.total_tax,2,"UP"), 
             i.total_amount = apoc.math.round(row.total,2,"UP"),
// merge domain and person nodes and connect them 
MERGE (domain:Domain{name:code})
MERGE (person:Person{email:row.customer_email})
MERGE (domain)-[:MEMBER]->(person)
ON CREATE SET 
    person.name = row.billing_address_first_name + " " + row.customer_billing_address_last_name,
              person.ip = row.customer_ip ,
              person.phone = row.customer_billing_address_phone
// merge payment menthod node
MERGE (pay:PaymentMethod{name:row.payment_details_method_id})
MERGE (i)-[:PAYMENT_METHOD]->(pay)
// merge location tree
MERGE (country:Country{code:row.billing_address_country})
MERGE (country)(city)
RETURN distinct 'orders imported'

This is so easy I had to share with you, so you can maybe do a pilot project on your own to see how does Neo4j fit to your needs. Just wrap around it your favourite scripting language and you are good to go.

Neo4j Game of Thrones part 1

We are only a month away from the seventh Game of Thrones season, so the time is right to import GoT data into our favourite graph database Neo4j. Data is from the books, so some things that happened on the TV show, haven’t happened yet in the books. The data that is available on kaggle is a good material to get started as I can show you all the queries you will need, when importing and linking different sources of data together. This will probably end up 3 to 4 part series, so that we can cover all the material available.

Data:

We will be using kaggle dataset of Game of Thrones, which I saved on my github repo for easier access. I cleaned up the names of the persons a bit, so that they match better. There are 3 CSVs available.

  • battles.csv — contains information about all of the battles in game of thrones.
  • character-deaths.csv — contains information about characters and when they died.
  • character-predictions.csv –This csv takes an expanded view on character deaths, including predictions of how likely they are to die.

In this first part we will import battles.csv.

Graph model:

Screen Shot 2017-06-23 at 21.35.53

We will start with a simple graph model containing Person,House,Battle nodes and a very simple location tree.Unfortunately we cannot link Person to House nodes as they are not explicity marked. We will get this connections from the other two CSVs.

Import:

First we will create constraints and indexes, so that our queries will be faster.

CALL apoc.schema.assert(
{Location:['name'],Region:['name']},
{Battle:['name'],Person:['name'],House:['name']});

We will split the import into 4 different queries for easier understanding.

Battle:

First we create :Battle nodes using MERGE.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/battles.csv" as row
//merge node labeled Battle 
MERGE (b:Battle{name:row.name})
ON CREATE SET b.year = toINT(row.year),
              b.summer = row.summer,
              b.major_death = row.major_death,
              b.major_capture = row.major_capture,
              b.note = row.note,
              b.battle_type = row.battle_type,
              b.attacker_size = toINT(row.attacker_size),
              b.defender_size = toINT(row.defender_size)

Foreach trick:

If we check the data, we can see that some battles have one attacker and some have four. They are stored in separate columns, so when there is only one attacker, there are 3 columns with a null value. When there are null values in a column and we want to skip them (ignore them) when using MERGE on that property, we can use the following foreach trick.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/battles.csv" as row
// there is only attacker_outcome in the data, 
// so we do a CASE statement for defender_outcome
WITH row,
case when row.attacker_outcome = "win" THEN "loss" ELSE "win" END as defender_outcome
// match the battle
MATCH (b:Battle{name:row.name})
// all battles have atleast one attacker so we don't have to use foreach trick
MERGE (attacker1:House{name:row.attacker_1}) 
MERGE (attacker1)-[a1:ATTACKER]->(b) 
ON CREATE SET a1.outcome = row.attacker_outcome

// When we want to skip null values we can use foreach trick
FOREACH
  (ignoreMe IN CASE WHEN row.defender_1 is not null THEN [1] ELSE [] END | 
    MERGE (defender1:House{name:row.defender_1})
    MERGE (defender1)-[d1:DEFENDER]->(b)
    ON CREATE SET d1.outcome = defender_outcome)
FOREACH
  (ignoreMe IN CASE WHEN row.defender_2 is not null THEN [1] ELSE [] END | 
    MERGE (defender2:House{name:row.defender_2})
    MERGE (defender2)-[d2:DEFENDER]->(b)
    ON CREATE SET d2.outcome = defender_outcome)
FOREACH
  (ignoreMe IN CASE WHEN row.attacker_2 is not null THEN [1] ELSE [] END | 
    MERGE (attacker2:House{name:row.attacker_2})
    MERGE (attacker2)-[a2:ATTACKER]->(b)
    ON CREATE SET a2.outcome = row.attacker_outcome)
FOREACH
  (ignoreMe IN CASE WHEN row.attacker_3 is not null THEN [1] ELSE [] END | 
    MERGE (attacker2:House{name:row.attacker_3})
    MERGE (attacker3)-[a3:ATTACKER]->(b)
    ON CREATE SET a3.outcome = row.attacker_outcome)
FOREACH
  (ignoreMe IN CASE WHEN row.attacker_4 is not null THEN [1] ELSE [] END | 
    MERGE (attacker4:House{name:row.attacker_4})
    MERGE (attacker4)-[a4:ATTACKER]->(b)
    ON CREATE SET a4.outcome = row.attacker_outcome)

Coalesce:

What if we didn’t want to skip null values, but mark them with a default value “Unknown”. In this case we can use coalesce, to set a default value to all null values. Because there is only one location with null value, we do not have to use local merges, but you can get more information in Neo4j Location Trees.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/battles.csv" as row
MATCH (b:Battle{name:row.name})
// We use coalesce, so that null values are replaced with "Unknown" 
MERGE (location:Location{name:coalesce(row.location,"Unknown")})
MERGE (b)-[:IS_IN]->(location)
MERGE (region:Region{name:row.region})
MERGE (location)-[:IS_IN]->(region)

Multiple unwinds:

Using multiple unwinds in a single query, we need to take of the cardinality (number of rows), as UNWIND transforms any list back into individual rows. We need to separate them using any aggregation function, so they do not become nested (like a nested .map function in JS).

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/battles.csv" as row
// We split the columns that may contain more than one person
WITH row,
     split(row.attacker_commander,",") as att_commanders,
     split(row.defender_commander,",") as def_commanders,
     split(row.attacker_king,"/") as att_kings,
     split(row.defender_king,"/") as def_kings,
     row.attacker_outcome as att_outcome,
     CASE when row.attacker_outcome = "win" THEN "loss" 
     ELSE "win" END as def_outcome
MATCH (b:Battle{name:row.name})
// we unwind a list
UNWIND att_commanders as att_commander
MERGE (p:Person{name:trim(att_commander)})
MERGE (p)-[ac:ATTACKER_COMMANDER]->(b)
ON CREATE SET ac.outcome=att_outcome
// to end the unwind and correct cardinality(number of rows)
// we use any aggregation function ( e.g. count(*))
WITH b,def_commanders,def_kings,att_kings,att_outcome,def_outcome,count(*) as c1
UNWIND def_commanders as def_commander
MERGE (p:Person{name:trim(def_commander)})
MERGE (p)-[dc:DEFENDER_COMMANDER]->(b)
ON CREATE SET dc.outcome = def_outcome
// reset cardinality with an aggregation function (end the unwind)
WITH b,def_kings,att_kings,att_outcome,def_outcome,count(*) as c2
UNWIND def_kings as def_king
MERGE (p:Person{name:trim(def_king)})
MERGE (p)-[dk:DEFENDER_KING]->(b)
ON CREATE SET dk.outcome = def_outcome
// reset cardinality with an aggregation function (end the unwind)
WITH b,att_kings,att_outcome,count(*) as c3
UNWIND att_kings as att_king
MERGE (p:Person{name:trim(att_king)})
MERGE (p)-[ak:ATTACKER_KING]->(b)
ON CREATE SET ak.outcome = att_outcome

Conclusion:

As we have to import two more CSVs I will not do any queries on the data yet, but focus first on the importing and linking part of the process and then do an analysis on the data from all three CSVs. Second part should be coming in a few days so stay tuned!