Neo4j Graph Visualizations using GoT dataset

Once again I will use the data made available by Andrew Beveridge to first demonstrate the use of categorical pageRank and breakdown pageRank by the sequence of the books, that will help us find the winners of game of thrones and secondly to show some visualizations options Neo4j community has to offer.

To find more details about the dataset check Analyzing the Graph of Thrones by William Lyon or my Neo4j GoT social graph analysis.

Import

Lets first define the schema of our graph.

We only need one constraint on Person label. This will speed up our import and later queries.

CREATE CONSTRAINT ON (p:Person) ASSERT p.id IS UNIQUE;

As the data of all five books is available on github we can import all 5 books using a single cypher query and APOC‘s load.json.

We will differentiate networks from different books using separate relationship types. We need to use apoc.merge.relationship  as Cypher does not allow using parameters for relationship types. Network from the first book will be stored as relationship type INTERACTS_1, second INTERACTS_2 and so on.

UNWIND ['1','2','3','45'] as book
LOAD CSV WITH HEADERS FROM 
'https://raw.githubusercontent.com/mathbeveridge/asoiaf/master/data/asoiaf-book' + book + '-edges.csv' as value
MERGE (source:Person{id:value.Source})
MERGE (target:Person{id:value.Target})
WITH source,target,value.weight as weight,book
CALL apoc.merge.relationship(source,'INTERACTS_' + book, {}, {weight:toFloat(weight)}, target) YIELD rel
RETURN distinct 'done'

Categorical pagerank

As described in my previous blog post, categorical pageRank is a concept where we break down the global pageRank into categories and run pageRank on each category subset of the graph separately to get a better understanding of the global pageRank.

Here we will use books as categories, so that we get character’s importance breakdown by the sequence of books.

UNWIND ['1','2','3','45'] as sequence
MERGE (book:Book{sequence:sequence})
WITH book,sequence
CALL algo.pageRank.stream(
 'MATCH (p:Person) WHERE (p)-[:INTERACTS_' + sequence + ']-() RETURN id(p) as id',
 'MATCH (p1:Person)-[INTERACTS_' + sequence + ']-(p2:Person) RETURN id(p1) as source,id(p2) as target',
 {graph:'cypher'})
YIELD node,score
// filter out nodes with default pagerank 
// for nodes with no incoming rels
WITH node,score,book where score > 0.16
MERGE (node)<-[p:PAGERANK]-(book)
SET p.score = score

 

Biggest winner of the game of thrones by books so far

Basically we will order pageRank values by the sequence of the books and return top ten characters with the highest positive changes in pageRank.

MATCH (person:Person)<-[pagerank:PAGERANK]-(book:Book)
// order by book sequence
WITH person,pagerank,book order by book asc
WITH person,collect(pagerank) as scores
RETURN person.id as person,
       scores[0].score as first_score,
       scores[-1].score as last_score,
       length(scores) as number_of_books 
ORDER BY last_score - first_score DESC LIMIT 10

While Jon Snow leads by absolute positive difference in pageRank, Victarion Greyjoy is very interesting. He had pageRank score 0.59 in the second book, was missing in third, and jumped to 4.43 in fourth and fifth book.

Stannis Baratheon is probably at the peak of his career judging by the show and is surprisingly in second place. Other than that the list is made out of the usual suspects.

goty.png

I also checked the characters with the biggest negative change, but it turns out that they are mostly dead so it’s not all that interesting.

Spoonjs

Thanks to Michael Hunger spoonJS is back. With it we can visualize charts directly in Neo4j browser.

Within a few clicks you can get it set up following the guide.

:play spoon.html

In our example we will visualize characters sorted by pageRank in the last two books combined.

MATCH (p:Person)<-[r:PAGERANK]-(:Book{sequence:'45'})
RETURN p.id as person,r.score as score
ORDER BY score DESC LIMIT 15

 

pagerank_book45

Three out of the first four places belong to the Lannisters, with the most important being Tyrion. If you think about it from this perspective what GoT is really about, you might think it’s just a family crisis of the Lannisters with huge collateral damage ūüôā

3d force graph

Another cool visualization project by Michael is called 3d-force-graph. It lets us visualize and explore graphs.

We will use pageRank to define the size of the nodes, so that the most important nodes will be the biggest. To represent communities in the graph we use the color of the nodes.

We need to run label propagation or Louvain algorithm to find communities within our graph and store them as a property of the nodes.

We run label propagation using only the network of characters from the last two books.

CALL algo.labelPropagation('Person','INTERACTS_45','BOTH',{partitionProperty:'lpa',iterations:10})

I like this visualization because it is 3d and you can approach from different angles and zoom in and out of the graph while exploring it.

got_gif.gif

Neovisjs

We can also use neovis.js, developed by William Lyon, to visualize graphs. Similarly as before we use label propagation results to color the nodes. To mix it up a bit we will use betweenness centrality of the nodes, instead of pageRank, to represent the size of the nodes in the graph.

Run betweenness centrality algorithm.

CALL algo.betweenness('Person','INTERACTS_45',{direction:'BOTH',writeProperty:'betweenness'})

In the visualization we also defined the size of the relationships based on the weight and colored them according to the community of the pair of nodes they are connecting.

got_2.gif

 

Code for Neovis and 3d-force-graph visualization used in the post can be found on github. Have fun!

Advertisements

Paradise papers analysis with Neo4j

I haven’t used a real-world dataset yet in my blog so I decided it’s time to try out some real-world networks. I will be using the Paradise papers dataset that we can explore thanks to ICIJ.

Paradise papers are a leak of millions of documents containing information about companies and trusts in offshore tax havens that have revealed information on tens of thousands of companies and people, including some high-profile figures like the Queen of England.

We can get this dataset and any of the other leaks at ICIJ official site in CSV or Neo4j Desktop form. If you are lazy like me you can just use Neo4j online sandbox, where you get your own Neo4j with APOC and graph algorithms library setup within a matter of seconds.

Graph model:

We have a graph of officers, who are persons or a banks in real world and entities that are companies. Entities can also have intermediaries and they all have one or more registered addresses that we know of.

I will focus today more on algorithms use, but if you want to learn more about the dataset itself and Neo4j you should check out analysing paradise papers and an in depth analysis of paradise papers by Michael Hunger and William Lyon.

jaha

Michael Hunger & William Lyon,An In-Depth Graph Analysis of the Paradise Papers, https://neo4j.com/blog/depth-graph-analysis-paradise-papers/

 

Infer the network

As I mentioned before we will focus more on graph algorithms use cases for Paradise papers.

We will assume that officers who are related to the same entity might know each other or at least have some contacts with one another. With this assumption we will create a social network of officers who are related to same entities and have a registered address in Switzerland only. I filtered Switzerland only so we might get a better understanding of their local investment network.

MATCH (o1:Officer)-[:OFFICER_OF]->()<-[:OFFICER_OF]-(o2:Officer)
WHERE id(o1) > id(o2) AND o1.countries contains "Switzerland" 
AND o2.countries contains "Switzerland"
WITH o1,o2,count(*) as common_investments
MERGE (o1)-[c:COMMON_INVESTMENTS]-(o2)
ON CREATE SET c.weight = common_investments

 

Analysis

We start by analyzing the degree of nodes in our network. There are 1130 officer with registered address in Switzerland. Each officer has on average 6 contacts to other officers through his entities.

MATCH (o:Officer)
WHERE (o)-[:COMMON_INVESTMENTS]-()
WITH o,size((o)-[:COMMON_INVESTMENTS]-()) as degree
RETURN count(o) as officers,
       avg(degree) as average_degree,
       stdev(degree) as stdev_degree,
       max(degree) as max_degree,
       min(degree) as min_degree

paradise_sat.png

We can search for pairs of officers with the most common investments as we stored this value as a property of relationship.

MATCH (o1:Officer)-[w:COMMON_INVESTMENTS]->(o2)
RETURN o1.name as officer1,
                 o2.name as officer2,
                 w.weight as common_investments 
order by common_investments desc limit 10

Barnett – Kevin Alan David seems to be very intertwined with the Mackies as he has got 322 common investments with Thomas Louis and 233 with Jacqueline Anne. Actually eight out of first ten places belong to Barnett Kevin Alan David, Hartland Georgina Louise and the Mackies. This would indicate that they cooperate on a large scale.

paradise_a.png

Weakly connected components

With weakly connected components algorithm we search for so called “islands” in our graph. An island or a connected component is a connected graph where all nodes are reachable between each other and any disconnected part of the global graph is it’s own component.

In our scenario it would be an useful algorithm to search for people who have common investments in companies and might know each other or maybe just have easier access to communicate with.

CALL algo.unionFind.stream(
    'MATCH (o:Officer) WHERE (o)-[:COMMON_INVESTMENTS]-()
    RETURN id(o) as id
   ','
    MATCH (o1:Officer)-[:COMMON_INVESTMENTS]-(o2)
    RETURN id(o1) as source,id(o2) as target',
    {graph:'cypher'})
YIELD nodeId,setId
RETURN setId as component,count(*) as componentSize
ORDER BY componentSize desc limit 10

As with most real-world graphs I have encountered so far we get one larger component and some smaller ones. If we wanted we could dig deeper into smaller components and check out their members and see if something interesting comes up.

paradise_component

Lets visualize component 14 for example.

paradise_compt.png

Studhalter РAlexander Walter seem to be quite interlaced with Gadzhiev РNariman as they have 60 common investments. To complete the triangle there is Studhalter РPhilipp Rudolf with 15 common investments with Alexander Walter and 12 with Nariman.  Alexander Walter is positioned at the center of this graph with connection to 8 different officers and we could assume that he holds some influence over this network.

Pagerank

PageRank was first used to measure importance of websites to help users find better results when searching the internet. In the domain of websites and links each link is treated as a vote from one website to another indicating that there is some quality content over there. When calculating pageRank it is also taken into account how important is the voter website as a link from amazon.com means something completely different as a link from tbgraph.wordpress.com.

In the Paradise papers domain we can use it to find potential influencer in our inferred “common_investments” network as officer who have common investments with other important officers in the network will come on top.

CALL algo.pageRank.stream(
    'MATCH (o:Officer) WHERE (o)-[:COMMON_INVESTMENTS]-()
     RETURN id(o) as id
    ','
     MATCH (o1:Officer)-[:COMMON_INVESTMENTS]-(o2)
     RETURN id(o1) as source,id(o2) as target',
    {graph:'cypher'})
YIELD node,score
WITH node,score order by score desc limit 10
RETURN node.name as officer,score

Cabral РWarren Wilton comes out on top by a large margin. I checked him out and it turns out he is an officer of 430 different entities and he has got connection to 116 other officers  from Switzerland through his entities. Find out more about Cabral РWarren Wilton. Following is the Swiss Reinsurance Company, which is a shareholder of 19 different entities. You can get same detailed look as above for Swiss Reinsurance thanks to ICIJ.

paradise_pagerank

Harmonic closeness centrality

We can interpret closeness centrality as the potential ability of a node to reach all other nodes as quickly as possible. This works both ways in our example as also other nodes can reach a specific node quickly through shortest paths between them. Harmonic centrality is a variation of closeness centrality that deals nicely with disconnected graphs.

In our domain we could interpret it as the potential ability for “insider trading” as having quick access to other nodes in the network could potentially lead to an advantage such as having access to (confidential) information before others.

CALL algo.closeness.harmonic.stream(
    'MATCH (o:Officer) WHERE (o)-[:COMMON_INVESTMENTS]-()
     RETURN id(o) as id
    ','
     MATCH (o1:Officer)-[:COMMON_INVESTMENTS]-(o2)
     RETURN id(o1) as source,id(o2) as target',
    {graph:'cypher'})
YIELD nodeId,centrality
WITH nodeId,centrality order by centrality desc limit 10
MATCH (n) where id(n)=nodeId
RETURN n.name as officer,centrality

Cabral – Warren Wilton also leads by harmonic centrality. He seems to be a big player in Switzerland. Swiss Reinsurance Company and PricewaterhouseCoopers are the only two that were also in pagerank top 10 leaderboard. All the others are new candidates we haven’t seen before.¬† We can deep dive on Schr√∂der – Stefan and observe that he has connections in SwissRe.

paradise_harmonic.png

Betweenness centrality

Betweenness centrality is useful in finding nodes that serve as a bridge from one group of users to another in a graph. Betweenness centrality in a social network can be interpreted as a rudimentary measure of the control that a specific node exerts over the information flow throughout the graph.

CALL algo.betweenness.stream(
    'MATCH (o:Officer) WHERE (o)-[:COMMON_INVESTMENTS]-()
     RETURN id(o) as id
    ','
     MATCH (o1:Officer)-[:COMMON_INVESTMENTS]-(o2)
     RETURN id(o1) as source,id(o2) as target',
    {graph:'cypher'})
YIELD nodeId,centrality
WITH nodeId,centrality order by centrality desc limit 10
MATCH (n) where id(n)=nodeId
RETURN n.name as officer,centrality

The usual players are in the top 3 spots. We can also spot Schr√∂der – Stefan in the fifth spot and the other officers we haven’t come across yet. It’s interesting to see¬†Zulauf – Hans-Kaspar up there as he is an officer of only two entities, but looks like his network position makes him so interesting.

paradise_betweenness

Label propagation algorithm

Label propagation algorithm is a community detection algorithm. Algorithm divides the network into communities of nodes with dense connections internally and sparses connections between communities.

CALL algo.labelPropagation(
    'MATCH (o:Officer) WHERE (o)-[:COMMON_INVESTMENTS]-()
     RETURN id(o) as id
    ','
     MATCH (o1:Officer)-[q:COMMON_INVESTMENTS]-(o2)
     RETURN id(o1) as source,id(o2) as target',
    'OUT',
    {graph:'cypher',partitionProperty:'community',iterations:10})

To help us with analyzing communities we will use Gephi to visualize our network.

Visualize with Gephi:

I like to use Gephi for visualizing networks. It is a really cool tool that lets you draw nice network visualizations based on centrality and community values.

Check out my previous blog post Neo4j to Gephi for more information.

We would need to save centrality and label propagation results to nodes if we wanted to export them to Gephi. Assuming we have done that we can use the following query to export data from Neo4j to Gephi.

MATCH path = (o:Officer)-[:COMMON_INVESTMENTS]->(:Officer)
CALL apoc.gephi.add(null,'workspace1',path,'weight',['pagerank','labelpropagation','betweeneess']) 
yield nodes
return distinct "done"

Here we have a visualization of only the biggest component in the graph with 344 members. There are 10+ communities that we can easily identify just by looking at this picture. I used label propagation results for color of nodes, betweenness centrality results for node size and pageRank results for node title.

We can’t really see much except that Cabral – Warren Wilton is very important in our network and positioned at the center of it.

paradise

 

Lets zoom in on the center of the network to get a better understanding of the graph.

As we noticed at the start that Barnet – Kevin Alan David is deeply connected with the Mackies and Hartlands. I have also noticed there is Hartland Mackie – Thomas Alan located on the bottom left so this might answer why the Hartlands and Mackies are so deeply connected. We can also find Barnett – Emma Louise in this community, which would makes this community(red) a community of Barnetts, Hartlands, Mackies primarily.

On the bottom right we can find Schroder Stefan very near to Swiss Reinsurance Company.

 

middle.png

Conclusion:

I think that understanding of the graph and proper visualizations tool is a powerful tool in the hand of an explorer of data. With Neo4j and Gephi we are able to understand the graph and find insights even when we have little prior knowledge about the data and what exactly are we looking for in the first place.

Neo4j Categorical Pagerank

I found this cool Neo4j blog post written by Kenny Bastani, that describes a concept called categorical pagerank.
I will try to recreate it using neo4j-graph-algorithms library and GoT dataset.

 

categorical_pagerank

Kenny Bastani, Categorical PageRank Using Neo4j and Apache Spark, https://neo4j.com/blog/categorical-pagerank-using-neo4j-apache-spark/

Idea behind it is pretty simple. As shown in the example above we have a graph of pages that have links between each other and might also belong to one or more categories. To better understand global pagerank score of nodes in a network, we can breakdown our graph into several subgraphs, one for each category and execute pagerank algorithm on each of that subgraphs. We store results as a relationship property between category and pages.
This way we can break down which are the contributing categories to page’s global pagerank score.

 

Requirements:

Graph Model:

We will use the dataset made available by Joakim Skoog through his API of ice and fire.

I first encountered this dataset when Michael Hunger showed us how to import the data in his game of data blog post. I thought the dataset was pretty nice and as all I had to do was copy/paste the import queries I decided to play around with it and wrote a Neo4j GoT Graph Analysis post.

Michael’s import query of house data:

// create Houses and their relationships
call apoc.load.jsonArray('https://raw.githubusercontent.com/joakimskoog/AnApiOfIceAndFire/master/data/houses.json') yield value
// cleanup
with apoc.map.clean(apoc.convert.toMap(value), [],['',[''],[],null]) as data
// lowercase keys
with apoc.map.fromPairs([k in keys(data) | [toLower(substring(k,0,1))+substring(k,1,length(k)), data[k]]]) as data

// create House
MERGE (h:House {id:data.id}) 
// set attributes
SET 
h += apoc.map.clean(data, ['overlord','swornMembers','currentLord','heir','founder','cadetBranches'],[])

// create relationships to people or other houses
FOREACH (id in data.swornMembers | MERGE (o:Person {id:id}) MERGE (o)-[:ALLIED_WITH]->(h))
FOREACH (s in data.seats | MERGE (seat:Seat {name:s}) MERGE (seat)-[:SEAT_OF]->(h))
FOREACH (id in data.cadetBranches | MERGE (b:House {id:id}) MERGE (b)-[:BRANCH_OF]->(h))
FOREACH (id in case data.overlord when null then [] else [data.overlord] end | MERGE (o:House {id:id}) MERGE (h)-[:SWORN_TO]->(o))
FOREACH (id in case data.currentLord when null then [] else [data.currentLord] end | MERGE (o:Person {id:id}) MERGE (h)-[:LED_BY]->(o))
FOREACH (id in case data.founder when null then [] else [data.founder] end | MERGE (o:Person {id:id}) MERGE (h)-[:FOUNDED_BY]->(o))
FOREACH (id in case data.heir when null then [] else [data.heir] end | MERGE (o:Person {id:id}) MERGE (o)-[:HEIR_TO]->(h))
FOREACH (r in case data.region when null then [] else [data.region] end | MERGE (o:Region {name:r}) MERGE (h)-[:IN_REGION]->(o));

After we have imported the dataset our graph will have a schema as shown below. You can always check the schema of your graph using CALL db.schema

meta

Categorical pagerank:

As in my previous blog post we will use the SWORN_TO network of houses to demonstrate categorical pagerank and this time use regions as categories. This way we will try to understand and breakdown from which regions do the houses get their power and support.

We first match all regions so that we will iterate our algorithm through all regions. In the node-statement of cypher projection we project nodes belonging to only a specific region using a parameter. As we already filtered nodes from a specific region we don’t have to filter out any relationships as only relationships with both source and target nodes described in node-statement will be projected and all other relationships that don’t have both the source and target nodes described in node-statement will be ignored in the projection.

We will then save the results as a relationship property between region and house.

MATCH (r:Region)
CALL algo.pageRank.stream('
    MATCH (h:House)-[:IN_REGION]->(r:Region)
    WHERE r.name ="' + r.name +
    '" RETURN id(h) as id
    ','
    MATCH (h1:House)-[:SWORN_TO]->(h2:House)
    RETURN id(h1) as source,id(h2) as target',
    {graph:'cypher'}) 
YIELD nodeId,score
MATCH (h:House) where id(h) = nodeId
MERGE (r)-[p:PAGERANK]->(h)
ON CREATE SET p.score = score

Lets first examine the North.

MATCH (r:Region{name:"The North"})-[p:PAGERANK]->(h)
RETURN h.name as house,p.score as pagerank 
ORDER BY pagerank DESC LIMIT 5

House Bolton leads with House Stark following in second place. This might be disheartening to some fans as Starks is more lovable than Boltons, but we all know how things ended for Boltons in the TV series.

north

Westerlands region is the home of the Lannister House. Lets see how well they do in their home region.

MATCH (r:Region{name:"The Westerlands"})-[p:PAGERANK]->(h)
RETURN h.name as house,p.score as pagerank
ORDER BY pagerank DESC LIMIT 5

Lannisters have a very strong direct support in their home region. This is shown in that House Farman is the only other house in Westerlands that has the support of at least one house.

west

 

Second version:

As I was writing the blog post and running the above algorithm I thought to myself that even though a house might not be in a specific region it might still have support from a house in that region and hence support from that region.

For that reason I turned our projection of the graph to be analyzed around a bit and now we project all the houses of our graph and filter SWORN_TO relationships that have the source node based in a specific region only. This directly translates to support from a house in that region.

We filter out pagerank scores below 0.151 as 0.15 is the default value for a node with no inbound relationships and save results as a relationship between a region and a house. This way we keep our graph tidy.

MATCH (r:Region)
CALL algo.pageRank.stream('
    MATCH (h:House)
    RETURN id(h) as id
    ','
    MATCH (r:Region)<-[:IN_REGION]-(h1:House)-[:SWORN_TO]->(h2:House)
    WHERE r.name ="' + r.name +
    '" RETURN id(h1) as source,id(h2) as target',
    {graph:'cypher'})
YIELD nodeId,score
WITH nodeId,score,r where score > 0.151
MATCH (h:House) where id(h) = nodeId
MERGE (r)-[p:SUPPORT]->(h)
ON CREATE SET p.score = score

As we get back only 51 created relationships we can easily visualize this network in Neo4j Browser. It is pretty obvious that House Baratheon of King’s Landing has the support from most regions lacking only The Neck and Beyond the Wall region support.

categ.png

Check top 20 individual regional pagerank scores.

MATCH (h:House)<-[s:SUPPORT]-(r:Region)
RETURN r.name as region,h.name as house,s.score as score 
ORDER BY score DESC LIMIT 20

Both Baratheon houses are very dominant in the Crownlands region. House Tyrell comes in third in regional pagerank score from The Reach region. House Tyrell is sworn to House Baratheon of King’s Landing and solely because of this relationship House Baratheon comes in immediately after house Tyrell by support from the Reach region. This is a pattern occurring through most of the graph except for the North Region, where House Baratheon comes in before the Starks and Boltons having support from both of them.

cate_table.png

Conclusion:

With cypher projections we get all the freedom cypher query language provides. We can even parametrize graph algorithms to run on only specific subgraphs as shown in this blog post. Cypher projections are a very powerful tool that can be used to extract useful insights from our graphs and if you are familiar with cypher also quite easy to use.

Neo4j GoT social graph analysis

Lately I have been showing how to project a bi-partite graph to mono-partite and run algorithms on it. To mix it up a little I will demonstrate how to project a network between nodes using cosine similarity of certain features that the node possess.  We will be using the Network of thrones dataset based on A Storm of Swords book. Check out also their analysis on the same network.

To find communities of people, that are in a similar position of power in the network, we will first calculate pagerank, harmonic centrality, triangles count and cluster coefficient for each node and use those values as features from which we will infer a similarity network using cosine similarity. This will allow us to find communities of people who are in a similar position of power.

Requirements:

Graph model:

got-datamodel

William Lyon has also wrote a blog post how to import and analyse this network. I have stolen both the graph model picture aswell as the cypher query for importing the data into Neo4j.

LOAD CSV WITH HEADERS FROM 
"https://www.macalester.edu/~abeverid/data/stormofswords.csv" AS row
MERGE (src:Character {name: row.Source})
MERGE (tgt:Character {name: row.Target})
MERGE (src)-[r:INTERACTS]->(tgt)
ON CREATE SET r.weight = toInt(row.Weight)

Centrality

In graph theory and network analysis, indicators of centrality identify the most important nodes within a graph.

Given this assumption we will use centrality indicators as features for the k-nearest neighbors algorithm (k-NN) that will infer a similarity network. As the centralities are not the focus of this blog post I will skip the theory and just run the write back versions of algorithms.

Calculate and write back triangles count and clustering coefficient.

CALL algo.triangleCount('Character', 'INTERACTS',
{write:true, writeProperty:'triangles',clusteringCoefficientProperty:'coefficient'}) 
YIELD nodeCount, triangleCount, averageClusteringCoefficient;

Calculate and write back pageRank score

CALL algo.pageRank(
 'MATCH (c:Character) RETURN id(c) as id',
 'MATCH (p1:Character)-[:INTERACTS]-(p2:Character) 
  RETURN id(p1) as source, id(p2) as target',
{graph:'cypher'});

Calculate and write back harmonic centrality

CALL algo.closeness.harmonic(
  'MATCH (c:Character) RETURN id(c) as id',
  'MATCH (c1:Character)-[:INTERACTS]-(c2:Character) 
   RETURN id(c1) as source, id(c2) as target',
{graph:'cypher', writeProperty: 'harmonic'});

k-NearestNeighbour

First we will normalize our features using the min-max method. Cluster coefficient is already normalized so there is no need to do it again. If you want to do something similar on bigger graphs I would suggest you use apoc.periodic.iterate  for batching.

WITH ["pagerank","harmonic","triangles"] as keys
UNWIND keys as key
MATCH (c:Character)
WITH max(c[key]) as max,min(c[key]) as min,key
MATCH (c1:Character)
WITH c1, key + "normalized" AS newKey, 
    (toFloat(c1[key]) - min) / (max - min) as normalized_value
CALL apoc.create.setProperty(c1, newKey, normalized_value) 
YIELD node
RETURN COUNT(*)

First we match all pairs of nodes and compare their cosine similarity. To avoid a complete graph where all nodes are connected between each other, we will set a similarity threshold, meaning that all relationships with cosine similarity less than 0.9 will be ignored and not stored. I am slightly cheating as with the kNN algorithm you define to how many nearest neighbours should the node be connected to and here we are defining how similar or close should the pair of nodes be to store a relationship.

Again for bigger graphs you should use APOC for batching. I wrote a blog post with a very similar example.

MATCH (c1:Character),(c2:Character) where id(c1) < id(c2)
WITH c1,c2,apoc.algo.cosineSimilarity([c1.pageranknormalized,
                                       c1.coefficient,
                                       c1.harmonicnormalized,
                                       c1.trianglesnormalized],
                                      [c2.pageranknormalized,
                                       c2.coefficient,
                                       c2.harmonicnormalized,
                                       c2.trianglesnormalized]) 
                                         as cosine_similarity
WHERE cosine_similarity > 0.9
MERGE (c1)-[s:SIMILAR_POSITION]-(c2)
SET s.cosine = cosine_similarity

Community Detection

A network is said to have community structure if the nodes of the network can be easily grouped into sets of nodes such that each set of nodes is densely connected internally. In the particular case of non-overlapping community finding, this implies that the network divides naturally into communities of nodes with dense connections internally and sparser connections between communities.

 

Let’s check what community structure will Louvain algorithm find in our network.

CALL algo.louvain.stream('Character', 'SIMILAR_POSITION',
{}) 
YIELD nodeId, community
MATCH (c:Character) where id(c) = nodeId
RETURN community,
       count(*) as communitySize,
       collect(c.name) as members 
ORDER BY communitySize ASC LIMIT 20;

Illyrio Mopatis is all alone in Pentos and probably has no network power at all. The most interesting group is community 8, where all the cream of the book is collected ranging from Starks, Targaryens and Lannisters to interestingly also Stannis, Mance and Davos.

Community 106 looks like a community of captains and maesters and differs from largest community in that the largest community has a higher average cluster coefficient.

GOTlouvain.png

 

Let’s try the Label Propagation algorithm and check what it finds.

CALL algo.labelPropagation(
'MATCH (c:Character) RETURN id(c) as id, 1 as weight, id(c) as value',
'MATCH (c1:Character)-[f:INTERACTS]->(c2:Character) 
RETURN id(c1) as source, id(c2) as target,f.weight as weight',
"OUT",{graph:'cypher',partitionProperty:'lpa',iterations:10});

gotlpa.png

We can immediately observe that LPA returns much more granular communities than Louvain in this network. The biggest community consists of Starks and Lannisters with an addition of Varys. It’s safe to say he deserves this spot.¬† On the other hand Jon is in the community with all members from Night Watch. Daenerys is also left out of the “strongest” community and shares community with Jorah and ser Barristan. She just wasn’t such a badass in season 3 as she has became in season 7 ūüôā

 

Neo4j Marvel Social Graph Algorithms Centralities

To top off the Marvel Social Graph series we will look at how to use centralities on a projected graph via cypher queries to find influencers or otherwise important nodes in our network using Neo4j and neo4j-graph-algorithms library.

To recap the series:

Graph projections via cypher queries:

As we noticed in the previous part using graph projections via cypher queries or for short “cypher loading” ¬†is really great as it lets us filter and/or project virtual graphs easily and quickly. To let you fully take advantage of this awesome tool we need to get to know exactly how it works.

Unlike default label and relationship type of loading subsets of graphs, where we can in some cases define direction of the relationship to be either “incoming”,”outgoing” or “both”(birected/undirected) , cypher loading does not support loading single relationship as undirected.

While this may seem bad it’s actually not as cypher loading allows us to get creative with trying out graph algorithms on different virtual networks that we can project using cypher queries. We already did this in the previous post, but I didn’t describe it in detail yet.

Imagine that we have two hero nodes and a single directed relationship between them.
Only difference between loading this graph as undirected or directed is specifying the direction of the relationship in the cypher query. When we do not specify the direction of the relationship in the cypher query, cypher engine returns two patterns in both direction for each relationship in our graph. That in turn projects our network bidirected or undirected.

projecting directed network:

MATCH (u1:Hero)-[rel:KNOWS]->(u2:Hero)
RETURN id(u1) as source, id(u2) as target

projecting undirected network:

MATCH (u1:Hero)-[rel:KNOWS]-(u2:Hero)
RETURN id(u1) as source, id(u2) as target

Centralities

In graph theory and network analysis, indicators of centrality identify the most important nodes within a graph. Applications include identifying the most influential person(s) in a social network, key infrastructure nodes in the Internet or urban networks, and super-spreaders of disease. Centrality concepts were first developed in social network analysis, and many of the terms used to measure centrality reflect their sociological origin.[1]

Pagerank

PageRank is Google’s popular search algorithm. PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the node is. The underlying assumption is that more important nodes are likely to receive more links from other websites.

More in documentation.

We will use cypher loading to load only the nodes of the biggest component and set a weight threshold of 100 for relationships.

call algo.pageRank.stream(
// Match only the biggest component 
'MATCH (u:Hero) WHERE u.component = 136 RETURN id(u) as id
','
MATCH (u1:Hero)-[k:KNOWS]-(u2:Hero)
// Similarity threshold 
WHERE k.weight >= 100
RETURN id(u1) as source, id(u2) as target
',{graph:'cypher'}
) yield node,score 
WITH node,score ORDER BY score DESC limit 10 
return node.name as name, score;

Captain America has the highest pagerank score. He is located in the middle of the network with a total of 24 relations and also relations to most of the other important heroes in the network like Thor, Spiderman and Iron Man. If we check all heroes related to Captain America, we can notice, that they have on average higher Pagerank score just because of this relation to Captain America.

marvel_pagerank.png

* Node color from white (less) to red (more): Pagerank

Closeness centrality

Closeness centrality is defined as the total number of links separating a node from all others along the shortest possible paths. In other words, to calculate closeness, one begins by calculating, for each pair of nodes in the network, the length of the shortest path from one to the other (aka the geodesic distance). Then for each node, one sums up the total distance from the node to all other nodes.[2]

Closeness can be interpreted as an index of time-until-arrival of something flowing through the network. The greater the raw closeness score, the greater the time it takes on average for information originating at random points in the network to arrive at the node. Equally, one can interpret closeness as the potential ability of a node to reach all other nodes as quickly as possible.[2]

More in documentation.

We will use cypher loading to load only the nodes of the biggest component and set a weight threshold of 100 for relationships. With closeness centrality it is especially important that we load only a single component.

Unfortunately, when the graph is unconnected, closeness centrality appears to be useless because the distance between two nodes belonging to different components is infinite by convention, which makes the sum in 2 infinite too and therefore its inverse equal to zero. For every node of such a graph, there is always another node belonging to another component: indices of all vertices of the graph are therefore useless and the calculation of the index is limited to the largest component, omitting the roles played by individuals of other components.[3]

CALL algo.closeness.stream(
// Match only the biggest component 
'MATCH (u:Hero) WHERE u.component = 136 RETURN id(u) as id
','
MATCH (u1:Hero)-[k:KNOWS]-(u2:Hero) 
// Similarity threshold 
WHERE k.weight >= 100
RETURN id(u1) as source,id(u2) as target
',{graph:'cypher'}) YIELD nodeId, centrality
WITH nodeId,centrality 
ORDER BY centrality DESC LIMIT 10
MATCH (h:Hero) where id(h)=nodeId
RETURN h.name as hero, centrality

Captain America is in such a privileged position, that he will be leading in all categories of centralities. We can observe that nodes in more tight communities have higher closeness centrality indexes while those on the brinks and less connected have smaller values. Second thing we can notice is that also the overall position of nodes in the graph matter as the middle community has on average higher closeness centrality as others. As an example both Iron Man and Vision have higher closeness centrality than Spiderman, while Spiderman has higher Pagerank index than them.

marvel_closeness.png

* Node color from white (less) to red (more): Closeness centrality

Harmonic Centrality

The harmonic mean has been known since the time of Pythagoras and Plato as the mean expressing ‚Äúharmonious and tuneful ratios‚ÄĚ, and later has been employed by musicians to formalize the diatonic scale, and by architects as paradigm for beautiful proportions.[4]

Social network analysis is a rapid expanding interdisciplinary field, growing from work of sociologists, physicists, historians, mathematicians, political scientists, etc. Some methods have been commonly accepted in spite of defects, perhaps because of the rareness of synthetic work like (Freeman, 1978; Faust & Wasserman, 1992). Harmonic centrality was proposed as an alternative index of closeness centrality defined on undirected networks. Results show its computation on real cases are identical to those of the closeness centrality index, with same computational complexity and we give some interpretations. An important property is its use in the case of unconnected networks.[3]

CALL algo.closeness.harmonic.stream( 
// Match only the biggest component 
'MATCH (u:Hero) WHERE u.component = 136 RETURN id(u) as id 
',' 
MATCH (u1:Hero)-[k:KNOWS]-(u2:Hero) 
// Similarity threshold 
WHERE k.weight >= 100 
RETURN id(u1) as source,id(u2) as target '
,{graph:'cypher'}) YIELD nodeId, centrality 
WITH nodeId,centrality 
ORDER BY centrality DESC LIMIT 10 
MATCH (h:Hero) where id(h)=nodeId 
RETURN h.name as hero, centrality

Harmonic centrality was proposed as an alternative for closeness centrality to help solve the problem of disconnected components. Because of this we get back very similar results, given that we also have a single connected component.

marvel_harmonic

Betweenness Centrality

In graph theory, betweenness centrality is a measure of centrality in a graph based on shortest paths. For every pair of nodes in a connected graph, there exists at least one shortest path between the vertices such that either the number of relationships that the path passes through (for unweighted graphs) or the sum of the weights of the edges (for weighted graphs) is minimized. The betweenness centrality for each node is the number of these shortest paths that pass through the node.[6]

More in documentation.

We will use cypher loading to load only the nodes of the biggest component and set a weight threshold of 100 for relationships.

CALL algo.betweenness.stream(
// Match only the biggest component
'MATCH (u:Hero) WHERE u.component = 136 RETURN id(u) as id
','
MATCH (u1:Hero)-[k:KNOWS]-(u2:Hero) 
// Similarity threshold
WHERE k.weight >= 100
RETURN id(u1) as source,id(u2) as target
',{graph:'cypher'}) YIELD nodeId, centrality
WITH nodeId,centrality 
ORDER BY centrality DESC LIMIT 10
MATCH (h:Hero) where id(h)=nodeId
RETURN h.name as hero, centrality

As always Captain America is in first place and this time Beast being in the second place. This comes as no surprise as we can observe that he is the sole bridge between middle and right community. Spiderman and Incredible Hulk play a similar role as Beast, but have smaller communities behind them and hence also smaller betweenness scores.

marvel_betweeness.png

* Node color from white (less) to red (more): Betweenness centrality

References:

[1] https://en.wikipedia.org/wiki/Centrality

[2] http://qualquant.org/wp-content/uploads/networks/2008%201-7-3.pdf

[3] https://infoscience.epfl.ch/record/200525/files/[EN]ASNA09.pdf?

[4] https://arxiv.org/pdf/cond-mat/0008357.pdf

[6] https://en.wikipedia.org/wiki/Betweenness_centrality

Neo4j Marvel Social Graph Algorithms Community Detection

In the first part we infered a Hero to Hero network from a bi-partite graph of heroes and comics. Following was the second part where we got to know some basic network information to help us get a sense of what kind of a network we are dealing with.

In this part I will continue on finding resilient communities within our network using Louvain method and Label Propagation algorithm.

* Visualizations are made with Gephi. Check my previous post Neo4j to Gephi for more information. Another option is to use neovis.js to visualize communities.

Graph projection

Neo4j graph algorithms support two ways of loading subset of the graph, as a virtual graph to quickly run the algorithms on.  First one is known as label and relationship-type loading, where we load nodes by labels and relationships by their types.

What if we wanted to run algorithms on very specific subsets of graphs, but labels and relationship types are not descriptive enough or we do not want to update our actual graph?

Luckily we can load or project subsets of our graph using Cypher statements. Use a cypher query to fetch nodes instead of the label parameter and a second cypher query for fetching relationships instead of the relationship-type parameter.

Use graph:'cypher' in the config.

Example:

CALL algo.unionFind(
//First cypher query is used to fetch nodes
    'MATCH (p:User)
     WHERE p.property = 'import'
    RETURN id(p) as id',
//Second cypher query is used to fetch relationships
    'MATCH (p1:User)-[f:FRIEND]->(p2:User) 
     RETURN id(p1) as source, id(p2) as target,f.weight as weight',
{graph:'cypher',write:true});

Projecting and loading graphs via cypher queries allows us to describe the graph we want to run algorithms on in great detail. Not only that, but we can also use it to project virtual graphs from indirect patterns or omit some relationships to be loaded without actually deleting them.

Cypher projection use cases:
  • Filtering nodes and relationships.
  • Loading indirect relationships.
  • Projecting bidirected graph.¬†Example
  • Similarity threshold. (discussed here)

Community detection

In the study of networks, such as computer and information networks, social networks and biological networks, a number of different characteristics have been found to occur commonly, including the small-world property, heavy-tailed degree distributions, and clustering, among others. Another common characteristic of networks is community structure, which is the pattern of connections and groupings. The connections within real-world networks are not homogenous or random which suggests certain natural divisions exist.[1]

Community detection algorithms do not perform well in a very connected graph as most of the nodes are densely connected, hence they belong to the same community. This usually leads to poor results where we end up with one big community that stretches over most of the graph and some small communities.

We introduce similarity threshold concept, where the weight of the relationship has to be above certain value or the relationship is ignored. We can easily exclude these relationships using graph projections via cypher queries.

In this post we will set the weight threshold to 100 so the resulting communities should be tightly-knit and resilient.

Connected components

Connected Components or UnionFind algorithm basically finds sets of connected nodes also known as islands where each node is reachable from any other node in the same set.
In graph theory, a connected component of an undirected graph is a subgraph in which any two nodes are connected to each other by paths, and which is connected to no additional nodes in the graph. More in documentation.

As with any new network I come across I want to first know how many connected components exist in the network and what is their size. Because we will set the weight threshold to 100 we will get back much less connected (sparser) graph than in previous post where threshold 10 was used.

We use the default label and relationship type loading in this example, where we load all nodes labeled “Hero” and relationships of type “KNOWS”. We also set threshold to 100, which means that only relationships with weight greater than 100 are considered by the algorithm.

CALL algo.unionFind.exp3.stream('Hero', 'KNOWS',
{weightProperty:'weight', defaultValue:0.0, threshold:100.0}) 
YIELD nodeId,setId
RETURN setId as component,count(*) as componentSize
ORDER BY componentSize DESC LIMIT 10;

m_ccAs expected our graph is sparse with one big component, that has 101 members and 6 small components with 2-4 members. 116 out of total 6439 heroes have at least 1 relationship with weight more than 100.

If we visualize the largest component of 101 nodes in Neo4j Browser, we can easily observe that there are some intuitive communities hidden here and some bridge nodes between those communities. We will try to define community structure of this subgraph with the help of Louvain method and Label Propagation algorithm.

marvel_cc.png

Louvain

Communities are groups of nodes within a network that are more densely connected to one another than to other nodes. Modularity is a metric that quantifies the quality of an assignment of nodes to communities by evaluating how much more densely connected the nodes within a community are compared to how connected they would be, on average, in a suitably defined random network. The Louvain method of community detection is an algorithm for detecting communities in networks that relies upon a heuristic for maximizing the modularity.[2]

As mentioned we will use graph projecting via cypher queries to load only relationships with weight 100 or more.

CALL algo.louvain.stream(
// load nodes
    'MATCH (u:Hero) RETURN id(u) as id', 
// load relationships
    'MATCH (u1:Hero)-[rel:KNOWS]-(u2:Hero) 
// similarity threshold 
    WHERE rel.weight > 100
    RETURN id(u1) as source,id(u2) as target',
{graph:"cypher"}) 
YIELD nodeId,community
MATCH (n:Hero) WHERE id(n)=nodeId
RETURN community,
       count(*) as communitySize, 
       collect(n.name) as members 
order by communitySize desc limit 5

I use Gephi for visualizing communities as it is more pleasant and insightful to look at good visualizations instead of tables.

*Node color: Louvain community, node size: Pagerank, name size: Betweenness centrality

marvel_social_louvain.png

I am not an expert in Marvel domain, so I will just give a brief explanation of the results. We get a total of 8 communities. The largest and also the best positioned community is the purple. It consists mostly of team Avengers and S.H.I.E.L.D with Captain America being the leader. On the left we can find Mr. Fantastic and Fantastic Four team in the same purple community . Light-blue is the Spiderman team, with Spiderman being their only connection to the outside world. They keep to themselves and don’t mingle with others. Dark blue are the Asgardians who also keep to themselves and communicate only through Thor to outside world. Incredible Hulk Series has it’s own community and again Hulk being the only connection to outside world. We observe that Beast is in a unique position as a bridge between purple and green community, which is the X-Men community.

Label propagation algorithm

Label Propagation algorithm which was first proposed by Raghavan et al. (2007) uses unique identifiers of nodes as labels and propagates the labels based on an agreement with the majority of the neighbour nodes and each node selects a label from its neighbourhood to adopt it as its label. LPA works as follows: Node x has neighbours and each neighbour carries a label denoting the community to which they belong to. Each node in the network chooses to join the community to which the maximum number of its neighbours belongs to, with ties broken uniformly and randomly. At the beginning, every node is initialized with unique label (called as identifier) and the labels propagate through the network. At every step of propagation, each node updates its label based on the labels of its neighbours. As labels propagate, densely connected groups of nodes quickly reach a consensus on a unique label.[3]

More in documentation.

Similar to Louvain algorithm we will use graph projecting via cypher queries to load only relationships that have weight more than 100. I used the write back version, so that I could export the results to Gephi for visualization.

CALL algo.labelPropagation(
// supports node-weights and defining 
// initial communities using parameter value
    'MATCH (u:Hero) RETURN id(u) as id, 1 as weight,id(u) as value',
// load relationships 
    'MATCH (u1:Hero)-[rel:KNOWS]-(u2:Hero) 
// Similarity threshold
    WHERE rel.weight > 100
    RETURN id(u1) as source,id(u2) as target, rel.weight as weight',
'OUT',{graph:"cypher",partitionProperty:"lpa" }) 
YIELD computeMillis

We get back 21 communities with some single node communities.
Team Avengers(purple) and Fantastic Four(light blue) get split up into two separate communities. Spiderman(green), Incredible Hulk(turquoise) and Asgardians(red) communities are same as in Louvain results. We observe that X-Men faction also gets split up into two communities and Cannonball group is slightly bigger this time and not so isolated.
marvel_social_lpa.png

*Node color: LPA community, node size: Pagerank, name size: Betweenness centrality

Conclusion

Hope you have noticed by now that neo4j graph algorithms plugin is really awesome and easy to use. Combine that with projecting virtual graphs via cypher queries feature and you get an easy and efficient way to analyse and understand your graph.

Firstly I was going to show how to run centrality algorithms in this blog post also, but decided not to as it would be way too long, so I will shorty post another blog post with examples of centralities and cypher projections.

Stay tuned!

References:

[1] https://en.wikipedia.org/wiki/Community_structure

[2] https://www.quora.com/Is-there-a-simple-explanation-of-the-Louvain-Method-of-community-detection

[3] http://shodhganga.inflibnet.ac.in/bitstream/10603/36003/4/chapter3.pdf

Neo4j Marvel Social Graph Analysis

This is part 2 of Marvel series. In  previous post  I imported the data from kaggle competition and showed how to project a mono-partite graph from a bipartite graph. We used a topological similarity measure, that takes into account how many common comics any pair of heroes have.

For easier understanding we can represent this as the following function:

(:Hero)‚Üź(:Comic)‚Üí(:Hero)

to

(:Hero)‚Üź[:KNOWS{weight=Nb of Common Comics}]‚Üí(:Hero)

To find out more check this Gephi tutorial. We could have also projected mono-partite graph looking like:

(:Comic)‚Üź[{weight=Nb of Common Heroes}]‚Üí(:Comic)

You will need to have the graph imported to proceed with the following steps in analysis.

Requirements:

Graph model:

g.png

We end up with a simple graph model. We started with a bi-partite(two types/labels of nodes) network of Heroes and Comics and then inferred a mono-partite (single type/label of nodes) graph amongst Heroes based on the number of common Comics.

Analysis

We will analyse the inferred network of Heroes. I usually start with some global statistics to get a sense of the whole graph and then dive deeper.

Weight distribution

First we check the distribution of similarity weights between pairs of heroes. Weight value translates to the number of common Comics Heroes have appeared in.

MATCH ()-[k:KNOWS]->()
RETURN (k.weight / 10) * 10 as weight,count(*) as count 
ORDER BY count DESC LIMIT 20

You might wonder why we use (k.weight / 10) * 10 as it looks silly at first glance. If we divide an integer by an integer in Neo4j we get back integer. I use it as an bucketizing function,that groups numeric data into bins of 10, so that it is easier to read results.

weig

162489 out of total 171644 relationships (94%) in our network has weight 9 or less. This means that most of our Heroes have only briefly met.

Largest weight is between “THING/BENJAMIN J. GR” and “HUMAN TORCH/JOHNNY S” at value 724. I would assume they are good buddies.

Even though we have a well connected graph, most of relationships are “weak” judging by weight. I would assume that most comics have standard teams of heroes, where not necessarily all of the team appear in every comic.

Second assumption I would make is that there are occasinal comics where different “teams” of heroes appear in together, hence so many weak relationships.

To check my assumptions I start with this query to get a basic feel.

MATCH (u:Comic)
RETURN avg(apoc.node.degree(u,'APPEARED_IN')) as average_heroes,
stdev(apoc.node.degree(u,'APPEARED_IN')) as stdev_heroes,
max(apoc.node.degree(u,'APPEARED_IN')) as max_heroes,
min(apoc.node.degree(u,'APPEARED_IN')) as min_heroes

local

I personally prefer distribution over average. We use the “bucketizing” function as before:

MATCH (u:Comic)
RETURN (size((u)-[:APPEARED_IN]-()) / 10) * 10 as heroCount,
count(*) as times
ORDER BY times DESC limit 20

heroCount
Looks like my assumptions are plausible and quite possible. 8999 (71%) of comics have less than 10 heroes playing, with most of them probably at around 5 or less as the total average is 7,5. There are a few comics where we have “family gatherings” with more than 30 heroes. One of them is a comic named COC1, probably Contest of Champions, where 110 heroes play.

Normalize weight:

If we need to, we can normalize the weight using min-max method.
Notice how this time we use (toFloat(k1.weight) - min) / (max - min). By declaring k.weight a float, we are dividing a float by an integer, that returns a float and does not bucketize/group number into bins

MATCH (:Hero)-[k:KNOWS]->(:Hero) 
//get the the max and min value
WITH max(k.weight) as max,min(k.weight) as min
MATCH (:Hero)-[k1:KNOWS]->(:Hero) 
//normalize
SET k1.weight_minmax = (toFloat(k1.weight) - min) / (max - min)

Triangle count / Clustering coefficient

In graph theory, a clustering coefficient is a measure of the degree to which nodes in a graph tend to cluster together. Evidence suggests that in most real-world networks, and in particular social networks, nodes tend to create tightly knit groups characterised by a relatively high density of ties; this likelihood tends to be greater than the average probability of a tie randomly established between two nodes (Holland and Leinhardt, 1971;[1] Watts and Strogatz, 1998[2]).

Two versions of this measure exist: the global and the local. The global version was designed to give an overall indication of the clustering in the network, whereas the local gives an indication of the embeddedness of single nodes.[1]

CALL algo.triangleCount('Hero', 'KNOWS',
{write:true, writeProperty:'triangles',
clusteringCoefficientProperty:'coefficient'})
YIELD nodeCount, triangleCount, averageClusteringCoefficient;

Running this algorithms writes back local triangle count and clustering coefficient, while providing total triangle count and average clustering coefficient of the graph as return.
Find out more in documentation.

av

Average clustering is very high at 0.77, considering that 1 would mean we have a complete graph, where everybody knows each other. This comes as no surprise as we observed before that our network is very connected with most of relationships being “weak”.

Connected Components

Testing whether a graph is connected is an essential preprocessing step for every graph algorithm. Such tests can be performed so quickly and easily that you should always verify that your input graph is connected, even when you know it has to be. Subtle, difficult-to-detect bugs can result when your algorithm on a disconnected graph.

 

Connected components have other practical use cases, for example, if we are analysing a social network and we want to find all the disconnected groups of people that exist in our graph.

More in documentation.

CALL algo.unionFind.stream('Hero', 'KNOWS', {}) 
YIELD nodeId,setId
WITH setId,count(*) as communitySize
// filter our single node communities
WHERE communitySize > 1 
RETURN setId,communitySize
ORDER BY communitySize DESC LIMIT 20

com.png

Our graph consists of 22 total components with the largest covering almost all the graph(99,5%). There is 18 single node communities, and 3 very small ones with 9, 7 and 2 members.

Lets check who are the members of these small components.

CALL algo.unionFind.stream('Hero', 'KNOWS', {}) 
YIELD nodeId,setId
WITH setId,collect(nodeId) as membersId,count(*) as communitySize 
// skip the largest component
ORDER BY communitySize DESC SKIP 1 LIMIT 20
MATCH (h:Hero) WHERE id(h) in membersId
RETURN setId,collect(h.name) as members 
ORDER BY length(members) DESC

We need to match nodes by their nodeId from result so that we can get back the names of heroes.

othe

18 heroes never appeared in a comic with any other heroes. Some of the are RED WOLF II, RUNE, DEATHCHARGE etc. Second largest component with 9 members seems to be the ASHER family and some of their friends.

Weighted Connected Components

What if decided that two heroes co-appearing in a single comic is not enough of interaction for us to tag them as colleagues and raise the bar to 10 common comics.Here is where weighted Connected Component algorithm can help us.

If we define the property that holds the weight(weightProperty) and the threshold,it means the nodes are only connected, if the threshold on the weight of the relationship is high enough otherwise the relationship is thrown away.

In our case it means that two heroes need at least 10 common comics to be considered connected.

CALL algo.unionFind.stream('Hero', 'KNOWS', 
{weightProperty:'weight', threshold:10.0}) 
YIELD nodeId,setId
RETURN setId,count(*) as communitySize
ORDER BY communitySize DESC LIMIT 20;

Define weightProperty and threshold.

com10.png

As we could expect the biggest component drops significantly in size from 99,5% to 18,6% of total number of nodes. Using threshold can be used to find potentially resilient teams of heroes as more interactions between heroes (we take any co-appearance as positive) means longer history and more bonding, which can translate to resilience.

We can play around with different threshold settings.

Not only community sizes are interesting, but also their members. We check the members of the largest communities skipping the first one as showing 1200 names in a browser can be messy.

CALL algo.unionFind.stream('Hero', 'KNOWS', 
{weightProperty:'weight', defaultValue:0.0, threshold:10.0}) 
YIELD nodeId,setId
WITH setId,collect(nodeId) as membersId,count(*) as communitySize 
// skip the largest component
ORDER BY communitySize DESC SKIP 1 LIMIT 20
MATCH (h:Hero) WHERE id(h) in membersId
RETURN setId,collect(h.name) as members 
ORDER BY length(members) DESC

com10skip.png

Second largest community has really cool names like Bodybag,China Doll and Scatter Brain. As I have no clue of the domain I can’t really give any other comment.

 

Conclusion:

While this post is long, it does not contain everything I want to show on Marvel graph, so there will be one more post using centralities with cypher loading and maybe some community detection with Louvain and Label Propagation Algorithm.

Hope you learned something today!

Reference:

[1] https://en.wikipedia.org/wiki/Clustering_coefficient

Neo4j London tube system analysis

I recently came across London tube dataset, that was uploaded by Nicola Greco. I thought it would a cool example to show some algorithms from the new Neo4j graph algorithms plugin. They are really useful as they allow network analysis without using external services to run the algorithms on.

Requirements:

Graph model:

We will create a very simple graph model, with stations and connections between them as shown below. You can check the always check the schema of your graph with db.schema() procedure.

london

Lets define the schema for our graph model with a constraint and an index, so that all queries will run faster.

CREATE CONSTRAINT ON (s:Station) ASSERT s.id is unique;

CREATE INDEX ON :Station(name);

Import:

LOAD CSV can import data from local files or from the internet, which is really cool, so that we can access data on github easily without the need of any intermediate steps.

Import stations

LOAD CSV WITH HEADERS FROM
"https://raw.githubusercontent.com/nicola/tubemaps/master/datasets/london.stations.csv" as row
MERGE (s:Station{id:row.id})
ON CREATE SET s.name = row.name,
              s.latitude=row.latitude,
              s.longitude=row.longitude,
              s.zone=row.zone,
              s.total_lines=row.total_lines

Import connections between stations.
We create relationships in both directions as trains also travel in both directions. We save the time spent traveling between stations as a relationship property, so that we can use it as a weight in our algorithms.

LOAD CSV WITH HEADERS FROM
"https://raw.githubusercontent.com/nicola/tubemaps/master/datasets/london.connections.csv" as row
MATCH (s1:Station{id:row.station1})
MATCH (s2:Station{id:row.station2})
MERGE (s1)-[:CONNECTION{time:row.time,line:row.line}]->(s2)
MERGE (s1)<-[:CONNECTION{time:row.time,line:row.line}]-(s2)

Analysis:

Which stations have the most connections to other stations?

MATCH (n:Station)--(n1)
RETURN n.name as station,
       count(distinct(n1)) as connections 
order by connections desc LIMIT 15

Results:

connections

Find the fastest path based between two stations.

algo.shortestPath is a dijsktra algorithm, that returns shortest path between start and end node. For more info check the documentation.

transfers are not taken into account

MATCH (start:Station{name:"Baker Street"}),(end:Station{name:"Beckton Park"})
CALL algo.shortestPath.stream(start, end, 'time')
YIELD nodeId, cost
MATCH (s:Station) where id(s)=nodeId
RETURN s.name as station,cost as time

Results:

single_connection

Find out which stations take the longest to get to from Baker Street station.

algo.shortestPaths is a dijsktra algorithm, that return shortest paths from start node to all the other nodes in the network.

transfers are not taken into account

MATCH (n:Station {name:'Baker Street'})
CALL algo.shortestPaths.stream(n, 'time')
YIELD nodeId, distance
MATCH (s:Station) where id(s)=nodeId
RETURN s.name as station,distance as time 
ORDER BY time desc LIMIT 15

Results:

single_many

Find the longest shortest paths between stations in the network.

algo.allShortestPaths returns shortest paths between all pairs of nodes in the network

transfers are not taken into account

CALL algo.allShortestPaths.stream('time',
{nodeQuery:'Station',relationshipQuery:'CONNECTION',defaultValue:1.0})
YIELD sourceNodeId, targetNodeId, distance
WITH sourceNodeId, targetNodeId, distance 
ORDER BY distance desc LIMIT 20
// We filter out duplicates
WHERE sourceNodeId > targetNodeId
MATCH (s1:Station),(s2:Station) 
WHERE id(s1)=sourceNodeId AND id(s2)=targetNodeId
RETURN s1.name as station1,s2.name as station2,distance as time

Results:

all_paths.png

Find bridge stations in network.

We can use betweenness centrality to help us identify, which are the stations, where people will be passing by the most, based on the structure of the network.

algo.betweenness is a betweenness algorithm, that helps us find bridges in the network. For more info check documentation.

CALL algo.betweenness.exp1.stream('Station','CONNECTION')
YIELD nodeId, centrality
MATCH (n:Station) where id(n)=nodeId
RETURN n.name,centrality 
ORDER BY centrality desc
LIMIT 15

Results:

bett

Conclusion:

It is very easy to enhance Neo4j with graph algorithms. You just copy the .jar file to /plugins folder, restart Neo4j and you are good to go. They offer lots of cool algorithms, that can help you analyse your graph in Neo4j. You should try them out!

Neo4j GoT Graph Analysis

Winter has come!

Thanks to Joakim Skoog, who built this cool API of ice and fire. It offers data about GoT universe, that we will examine and do some further analysis.

I will show you how to use just released neo4j graph algorithms plugin on the data, that we got from the above API. As the name suggests, they are user defined procedures of graph algorithms, that you can call with cypher. For most algorithms there are two procedures, one that writes results back to the graph as node-properties and another (named algo..stream) that returns a stream of data, e.g. node-ids and computed values.

Requirements:

Import:

Our cool community caretaker Michael Hunger spotted this data and posted a blog post, so that we can easily import the data into our Neo4j using cypher shell. I followed his script, imported the data, and am now sharing the graph.db database, that you can simply copy to your neo4j/data/databases folder and you will have the GoT universe in your Neo4j

Graph Model:

got.png
We have imported the data, that describes a network of persons and houses, with some additional meta-data. There are a couple of relationships between persons and houses like SWORN_TO and LED_BY. We can also find family tree hiearchies in PARENT_OF relationship. I will use the SWORN_TO relationship between houses, that describes house hiearchy of allegiances. It is a directed, unweighted network.

Analysis:

Sworn to network

sworn_to.png

At first glance we can observe, that we have some bridges between different part of communities and that they all link towards the center where the power resides. As you will see this network is from the peaceful times, when the 7 kingdoms were at peace.

Connected Components:

As a part of preprocessing we will check how many disconnected components exist within our network with algo.unionFind.

Connected Components or UnionFind basically finds sets of connected nodes where each node is reachable from any other node in the same set. In graph theory, a connected component of an undirected graph is a subgraph in which any two vertices are connected to each other by paths, and which is connected to no additional vertices in the graph.

Query:

CALL algo.unionFind('House', 'SWORN_TO', {write:true, partitionProperty:"partition"})

Result:
updateconect.png
We see that there is one big component with 393 members and a small one with 5 members. All the rest are disconnected with only one member. That means that either they are a loner house or we are missing some data. We will use centrality algorithms only on the biggest component with 393 members as graph algorithms work better on connected graphs.

Centralities:

In graph theory and network analysis, indicators of centrality identify the most important nodes within a graph. Applications include identifying the most influential person(s) in a social network, key infrastructure nodes in the Internet or urban networks, and super-spreaders of disease. Centrality concepts were first developed in social network analysis, and many of the terms used to measure centrality reflect their sociological origin.

Outgoing degree is always one as each house is sworn to only one house.

Incoming degree:

We simply count how many incoming relationships each house has. In this example it directly translates to how many houses are sworn to each house.

MATCH (h:House)<-[:SWORN_TO]-(:House)
RETURN h.name as house,count(*) as in_degree order by in_degree desc limit 10

updatedsworn

House Tyrell leads with a decent margin in front of Lannisters. The main force is house Baratheon, that consists of two branches and has combined number of sworn houses at 77. Others slowly follow, its funny ( sad ? ) to see, that even house Baelish of Harrenhal, which is ruled by Petyr Baelish has more support than house Stark of Winterfell. Books and show divergence a lot for Littlefinger’s character as he is not a ruler of Harrenhal in the show.

Pagerank centrality:

Pagerank is a version of weighted degree centrality with a feedback loop. You only get your ‚Äúfair share‚ÄĚ of your neighbor‚Äôs importance. That is, your neighbor‚Äôs importance is split between their neighbors, proportional to the number of interactions with that neighbor. Intuitively, PageRank captures how effectively you are taking advantage of your network contacts. In our context, PageRank centrality nicely captures where the power resides. Houses that have the support of other big houses will come on top.

CALL algo.pageRank(
'MATCH (h:House) where h.partition = 151 return id(h) as id',
'MATCH (p1:House)-[:SWORN_TO]->(p2:House) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', iterations:30, write: true});

pagerank.png

I was surprised with the results to say the least, but after investigating a bit I understood why. Remember in the picture of sworn to network, where I told you that all the different communities point at the center of the graph where the power resides. And at that center are the two Baratheon houses, that are sworn to each other and hence the number is so big. Where incoming degree takes into account only the number of your neighbours, pagerank also takes into account their importance. You can notice that house Stark has a better position than house Baelish, because Stark’s supporters are more important ( have more sworn to houses ).

Betweenness centrality:

Betweenness centrality identifies nodes that are strategically positioned in the network, meaning that information will often travel through that person. Such an intermediary position gives that person power and influence. Betweenness centrality is a raw count of the number of short paths that go through a given node. For example, if a node is located on a bottleneck between two large communities, then it will have high betweenness.

CALL algo.betweenness(
'MATCH (p:House) WHERE p.partition = 151 RETURN id(p) as id',
'MATCH (p1:House)-[:SWORN_TO]->(p2:House) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', write: true, writeProperty: 'betweenness'});

betweenness.png
House Baratheon is uniquely positioned at the center of the graph, so it will be the first in all of the centrality measures. Strong second candidate are the Tyrells, which is a bit surprising as they beat the Lannisters at both pagerank and betweenness centrality. In our case betweenness centrality measures which are the bridging houses with big support of other houses at their back.

Closeness centrality:

The most central nodes according to closeness centrality can quickly interact to all others because they are close to all others. This measure is preferable to degree centrality, because it does not take into account only direct connections among nodes, but also indirect connections.

CALL algo.closeness(
'MATCH (p:House) WHERE p.partition = 151 RETURN id(p) as id',
'MATCH (p1:House)-[:SWORN_TO]->(p2:House) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', write: true, writeProperty:'closeness'});

closeness.png
Closeness centrality is different from others in that the lower the closeness weight is, the better. As always house Baratheon leads with a decent margin, Tyrells are standing firmly on the second place. Starks are better than Lannisters at closeness, but are losing in all other measures, so they get fourth place and Lannisters third.

Conclusion:

This is a network from peaceful times and is therefore less interesting than having two strong disconnected components, that fight each other. What would be cool is to get a network based on the current show or even better that G.R.R.M publishes the Words of Winter, where alliances will change and people will die, to see how the graph evolves through time.

Check also https://github.com/neo4j-examples/game-of-graphs, where you can see how to expose this data as a GraphQL endpoint.

Thank for reading!

Game of Thrones part 4 Analysis

If you followed my GoT series so far, you have seen how to import a ¬†GoT dataset, that I found on kaggle.com into Neo4j. There are only a couple days left before new season starts, so let’s do some analysis on the dataset. You need to know, that this data was derived from the books, so it does not contain some battles and data about the events that happened on the show as it has already surpassed the books.

Requirements:

Data:

We have already imported all the data from the CSVs into our Neo4j in previous posts. If we look closely, we can find that some persons have two houses they belong to, and one of them is None, which is the default unknown value in characters_death.csv.

Screen Shot 2017-07-14 at 13.48.42.png

To get rid of this anomaly, we can simply MATCH all persons that belong to House “None” and a separate house also. Then we just delete the relationship pointing to House “None” and we have solved the problem.

match (h1:House{name:"None"})-[r]-(p:Person)--(h2:House)
delete r

Distributions:

We can start by visualizing few distributions with Spoon.js directly in Neo4j Browser. Does not work on 3.2.x as the Neo4j browser was rewritten. 

Which house won/lost most battles ?

match (house:House)
OPTIONAL MATCH (house)-[fight:ATTACKER|:DEFENDER]->()
return house.name as house, 
       sum(case fight.outcome when "win" then 1 else 0 end) as wins,
       sum(case fight.outcome when "loss" then 1 else 0 end) as losses
order by wins + losses desc limit 20

House_battle_results.png

We can easily see the Lannisters are currently on the rise, with 11 wins and 7 losses. Second are the Starks, on the other hand in decline, with 6 wins and 10 losses. Red Wedding and what came afterward was really devastating for the Starks. Freys and Boltons look in a good position, as they were on the winning side of the Red Wedding, but those who watch the show, know that a bitter end is waiting for them.

Distribution of alive/dead people by house:

match (p:Person)--(h:House)
return h.name,
       sum(case when not p:Dead then 1 else 0 end) as alive,
       sum(case when p:Dead then 1 else 0 end) as dead 
order by dead desc limit 15

house_dead.png

It’s a hard life being in a Night’s Watch and we can see that mortality is high. Almost half the characters from Night’s Watch have died. It is also not a good time for Targaryens (yet!), as they were mostly killed off during the Robert’s Rebellion AKA the War of the Usurper. We all know that Starks are dying like flies, but it came as a surprise to see so many dead Lannisters.

Distribution of alive/dead people by culture:

match (p:Person)--(c:Culture)
return c.name as culture,
       sum(case when not p:Dead then 1 else 0 end) as alive,
       sum(case when p:Dead then 1 else 0 end) as dead 
order by dead desc limit 15

culture_dead

We can observe, that the most represented cultures are Northmen and Ironborn. While seeing so many Northmen in the books does not come as a surprise, as the whole story starts with House Stark living in Winterfell. Surprising is that there are so many Ironborns in the books. I think this all has to do with the Kingsmoot at Iron Islands, where we get to know lots of different Ironborn characters.

Valyrians are almost extinct, but they have the hidden joker Daenerys Targaryen, who has a big role yet to play.

Distribution of alive/dead people by books:

match (p:Person)-[r]-(h:Book)
with h.name as book,
     h.sequence as sequence,
     sum(case when type(r)="APPEARED_IN" then 1 else 0 end) as total,
     sum(case when type(r)="DIED_IN" then 1 else 0 end) as dead 
     order by sequence asc
return book,total,dead

book_dead.png

This figures of dead people seem small-ish for GoT, so I guess not all deaths are recorded. We can see, how the author slowly build up the community of all the people in the books. It started already with 400 characters in the first book and rose up to almost 1200 characters in the fourth book. That is a lot of characters to keep track. Storm of swords was the bloodiest book with 97 deaths, while Feast for Crows has unusual little deaths, only 27.

Analysis:

We will use algo.unionFind, that is a (weakly) connected components algorithm.  Taken from the docs.

Connected Components or UnionFind basically finds sets of connected nodes where each node is reachable from any other node in the same set. In graph theory, a connected component of an undirected graph is a subgraph in which any two vertices are connected to each other by paths, and which is connected to no additional vertices in the graph.

Which houses fought on the same side ?

We can create a network of Houses from indirect patterns of battles they fought together. We will also assume that an ally of my ally is also my ally. In this case we can use algo.unionFind to solve the problem.

Create a network of houses that fought on the same side

match (house1:House)-[r1]-(battle:Battle)-[r2]-(house2:House)
WHERE type(r1)=type(r2)
MERGE (house1)-[:ALLY]-(house2)

Run the algorithms and write back the results as a property of the node community.

CALL algo.unionFind('House', 'ALLY', {write:true, partitionProperty:"community"})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis

Check results

match (h:House)
return distinct(h.community) as community,
       collect(h.name) as members,
       count(*) as count 
order by count desc limit 4

House_community.png

We observe 4 different communities (partitions) came back as a result. We can see that Lannisters teamed up with Boltons and Freys to defeat the Starks. Night’s Watch and Baratheon fought on the same side when they defended the Wall from Wildlings (Free folk, Thenns). Stark and Tully have an alliance with marriage (Catelyn Tully and Ned Stark), so it is no wonder to see them fight battles on the same side.

Which persons fought on the same side in the year 299 ?

We will do something similar for persons also, but we will focus only on the year 299 as there were most battles fought that year. The reason why I chose a specific year is also, because alliances change over time and if we looked at all battles, then we get most persons in one community as they fought on the same side at least once.

Create a network of persons that fought on the same side in the year 299.

MATCH (p1:Person)-[r1]-(b:Battle)-[r2]-(p2:Person) 
WHERE b.year = 299 and type(r1)=type(r2)
MERGE (p1)-[:FRIEND]-(p2)

Run the algorithms and write back the results as a property of the node community.

CALL algo.unionFind('Person', 'FRIEND', {write:true, partitionProperty:"community"}) 
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis

Check results

match (p:Person)
return distinct(p.community) as community,
       collect(p.name) as members,
       count(*) as count 
order by count desc limit 10

Screen Shot 2017-07-14 at 16.11.52

We observe that 10 different communities (partitions) exist. It is a good representation of the state of alliances, that fought together on the specific year. We could do this analysis for a couple of different years and compare alliances over the year how they change.

Conclusion:

We have only scratched the surface of what is possible to do with this dataset so stay tuned for more!

You can grab the Neo4j database and just copy it into data/databases folder

Got for 3.2.x

Got for Neo4j 3.1.x