Neo4j GoT social graph analysis

Lately I have been showing how to project a bi-partite graph to mono-partite and run algorithms on it. To mix it up a little I will demonstrate how to project a network between nodes using cosine similarity of certain features that the node possess.  We will be using the Network of thrones dataset based on A Storm of Swords book. Check out also their analysis on the same network.

To find communities of people, that are in a similar position of power in the network, we will first calculate pagerank, harmonic centrality, triangles count and cluster coefficient for each node and use those values as features from which we will infer a similarity network using cosine similarity. This will allow us to find communities of people who are in a similar position of power.

Requirements:

Graph model:

got-datamodel

William Lyon has also wrote a blog post how to import and analyse this network. I have stolen both the graph model picture aswell as the cypher query for importing the data into Neo4j.

LOAD CSV WITH HEADERS FROM 
"https://www.macalester.edu/~abeverid/data/stormofswords.csv" AS row
MERGE (src:Character {name: row.Source})
MERGE (tgt:Character {name: row.Target})
MERGE (src)-[r:INTERACTS]->(tgt)
ON CREATE SET r.weight = toInt(row.Weight)

Centrality

In graph theory and network analysis, indicators of centrality identify the most important nodes within a graph.

Given this assumption we will use centrality indicators as features for the k-nearest neighbors algorithm (k-NN) that will infer a similarity network. As the centralities are not the focus of this blog post I will skip the theory and just run the write back versions of algorithms.

Calculate and write back triangles count and clustering coefficient.

CALL algo.triangleCount('Character', 'INTERACTS',
{write:true, writeProperty:'triangles',clusteringCoefficientProperty:'coefficient'}) 
YIELD nodeCount, triangleCount, averageClusteringCoefficient;

Calculate and write back pageRank score

CALL algo.pageRank(
 'MATCH (c:Character) RETURN id(c) as id',
 'MATCH (p1:Character)-[:INTERACTS]-(p2:Character) 
  RETURN id(p1) as source, id(p2) as target',
{graph:'cypher'});

Calculate and write back harmonic centrality

CALL algo.harmonic(
  'MATCH (c:Character) RETURN id(c) as id',
  'MATCH (c1:Character)-[:INTERACTS]-(c2:Character) 
   RETURN id(c1) as source, id(c2) as target',
{graph:'cypher', writeProperty: 'harmonic'});

k-NearestNeighbour

First we will normalize our features using the min-max method. Cluster coefficient is already normalized so there is no need to do it again. If you want to do something similar on bigger graphs I would suggest you use apoc.periodic.iterate  for batching.

WITH ["pagerank","harmonic","triangles"] as keys
UNWIND keys as key
MATCH (c:Character)
WITH max(c[key]) as max,min(c[key]) as min,key
MATCH (c1:Character)
WITH c1, key + "normalized" AS newKey, 
    (toFloat(c1[key]) - min) / (max - min) as normalized_value
CALL apoc.create.setProperty(c1, newKey, normalized_value) 
YIELD node
RETURN COUNT(*)

First we match all pairs of nodes and compare their cosine similarity. To avoid a complete graph where all nodes are connected between each other, we will set a similarity threshold, meaning that all relationships with cosine similarity less than 0.9 will be ignored and not stored. I am slightly cheating as with the kNN algorithm you define to how many nearest neighbours should the node be connected to and here we are defining how similar or close should the pair of nodes be to store a relationship.

Again for bigger graphs you should use APOC for batching. I wrote a blog post with a very similar example.

MATCH (c1:Character),(c2:Character) where id(c1) < id(c2)
WITH c1,c2,apoc.algo.cosineSimilarity([c1.pageranknormalized,
                                       c1.coefficient,
                                       c1.harmonicnormalized,
                                       c1.trianglesnormalized],
                                      [c2.pageranknormalized,
                                       c2.coefficient,
                                       c2.harmonicnormalized,
                                       c2.trianglesnormalized]) 
                                         as cosine_similarity
WHERE cosine_similarity > 0.9
MERGE (c1)-[s:SIMILAR_POSITION]-(c2)
SET s.cosine = cosine_similarity

Community Detection

A network is said to have community structure if the nodes of the network can be easily grouped into sets of nodes such that each set of nodes is densely connected internally. In the particular case of non-overlapping community finding, this implies that the network divides naturally into communities of nodes with dense connections internally and sparser connections between communities.

 

Let’s check what community structure will Louvain algorithm find in our network.

CALL algo.louvain.stream('Character', 'SIMILAR_POSITION',
{}) 
YIELD nodeId, community
MATCH (c:Character) where id(c) = nodeId
RETURN community,
       count(*) as communitySize,
       collect(c.name) as members 
ORDER BY communitySize ASC LIMIT 20;

Illyrio Mopatis is all alone in Pentos and probably has no network power at all. The most interesting group is community 8, where all the cream of the book is collected ranging from Starks, Targaryens and Lannisters to interestingly also Stannis, Mance and Davos.

Community 106 looks like a community of captains and maesters and differs from largest community in that the largest community has a higher average cluster coefficient.

GOTlouvain.png

 

Let’s try the Label Propagation algorithm and check what it finds.

CALL algo.labelPropagation(
'MATCH (c:Character) RETURN id(c) as id, 1 as weight, id(c) as value',
'MATCH (c1:Character)-[f:INTERACTS]->(c2:Character) 
RETURN id(c1) as source, id(c2) as target,f.weight as weight',
"OUT",{graph:'cypher',partitionProperty:'lpa',iterations:10});

gotlpa.png

We can immediately observe that LPA returns much more granular communities than Louvain in this network. The biggest community consists of Starks and Lannisters with an addition of Varys. It’s safe to say he deserves this spot.  On the other hand Jon is in the community with all members from Night Watch. Daenerys is also left out of the “strongest” community and shares community with Jorah and ser Barristan. She just wasn’t such a badass in season 3 as she has became in season 7 🙂

 

Advertisements

Neo4j Graph Algorithms projecting a virtual graph

Last time I started to talk about projecting virtual graphs with cypher queries and showed how to project a bidirected graph between nodes of only the largest component (island) in the graph. For now we learned how to project bidirected graphs and apply similarity threshold with “cypher loading”. Another common use that I haven’t talked about yet is projecting virtual relationships (not stored) to algorithms engine.

Requirements:

Dataset:

We will use the actors database that we get following :play movies guide in Neo4j browser. We have persons and movies in our graph and various relationships between them. I like to use this dataset as a example of what Neo4j is capable of.

matrix1.png

Projecting a virtual graph:

I will show how to project an undirected mono-partite graph from a bi-partite graph via cypher queries. This is pretty useful when you want to analyse different networks of your graph, but don’t want to actually save those networks to your database.

Louvain:

Projecting a mono-partite graph from a bi-partite graph can be easily undestood as:

(:Movie)(:Person)(:Movie)

to

(:Movie)←[{weight=Nb of Common persons}]→(:Movie)

We will project a virtual undirected graph with a similarity threshold of 2 and run Louvain algorithm on it. With it we will try to find groups of movies that share more than 2 common persons.

CALL algo.louvain.stream(
 'MATCH (m:Movie) RETURN id(m) as id',
 'MATCH (m1:Movie)<--(:Person)-->(m2:Movie)
 WITH m1,m2,count(*) as common_actors
 WHERE common_actors > 2
 RETURN id(m1) as source,id(m2) as target, common_actors as weight'
,{graph:'cypher'}) yield nodeId,community
MATCH (movie:Movie) where id(movie)=nodeId
RETURN  community,
        count(*) as communitySize,
        collect(movie.title) as movies
ORDER BY communitySize desc limit 5

kajapa.png

There aren’t many communities as this dataset is small. The largest community are movies that are directed by Lana Wachowski and Lilly Wachowski.

Pagerank

We can also project a person to person network from our bi-partite graph.

(:Person)(:Movie)(:Person)

to

(:Person)←[:{weight=Nb of common movies}]→(:Person)

Similarly like before we will project an undirected graph with a similarity threshold of 2 and run pageRank on it.

call algo.pageRank.stream(
  'MATCH (p:Person) return id(p) as id',
  'MATCH (p1:Person)-->(:Movie)<--(p2:Person)
   WITH p1,p2,count(*) as commonMovies        
   WHERE commonMovies > 2
   RETURN id(p1) as source, id(p2) as target',
{graph:'cypher'})
YIELD node,score WITH node,score
ORDER BY score desc limit 10
RETURN node.name as person, score

matrix.png

Camerow Crowe leads in first place by far margin followed by the Wachowski brothers. We have most of the Matrix crew in the first 10 spots as they collaborated in 3 movies and have stronger ties between each others.

Conclusion:

I skipped theory behind graph algorithms on purpose as I have explained it a couple of times now and don’t want to repeat myself too much. The mission of this blog post is to demonstrate the abilities cypher loading offers us in terms of projecting virtual graphs. If you want to learn more check previous posts and the documentation.

Have some feedback or questions?
Join the #neo4j-graph-algorithms channel on  Neo4j slack community

Neo4j Marvel Social Graph Algorithms Centralities

To top off the Marvel Social Graph series we will look at how to use centralities on a projected graph via cypher queries to find influencers or otherwise important nodes in our network using Neo4j and neo4j-graph-algorithms library.

To recap the series:

Graph projections via cypher queries:

As we noticed in the previous part using graph projections via cypher queries or for short “cypher loading”  is really great as it lets us filter and/or project virtual graphs easily and quickly. To let you fully take advantage of this awesome tool we need to get to know exactly how it works.

Unlike default label and relationship type of loading subsets of graphs, where we can in some cases define direction of the relationship to be either “incoming”,”outgoing” or “both”(birected/undirected) , cypher loading does not support loading single relationship as undirected.

While this may seem bad it’s actually not as cypher loading allows us to get creative with trying out graph algorithms on different virtual networks that we can project using cypher queries. We already did this in the previous post, but I didn’t describe it in detail yet.

Imagine that we have two hero nodes and a single directed relationship between them.
Only difference between loading this graph as undirected or directed is specifying the direction of the relationship in the cypher query. When we do not specify the direction of the relationship in the cypher query, cypher engine returns two patterns in both direction for each relationship in our graph. That in turn projects our network bidirected or undirected.

projecting directed network:

MATCH (u1:Hero)-[rel:KNOWS]->(u2:Hero)
RETURN id(u1) as source, id(u2) as target

projecting undirected network:

MATCH (u1:Hero)-[rel:KNOWS]-(u2:Hero)
RETURN id(u1) as source, id(u2) as target

Centralities

In graph theory and network analysis, indicators of centrality identify the most important nodes within a graph. Applications include identifying the most influential person(s) in a social network, key infrastructure nodes in the Internet or urban networks, and super-spreaders of disease. Centrality concepts were first developed in social network analysis, and many of the terms used to measure centrality reflect their sociological origin.[1]

Pagerank

PageRank is Google’s popular search algorithm. PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the node is. The underlying assumption is that more important nodes are likely to receive more links from other websites.

More in documentation.

We will use cypher loading to load only the nodes of the biggest component and set a weight threshold of 100 for relationships.

call algo.pageRank.stream(
// Match only the biggest component 
'MATCH (u:Hero) WHERE u.component = 136 RETURN id(u) as id
','
MATCH (u1:Hero)-[k:KNOWS]-(u2:Hero)
// Similarity threshold 
WHERE k.weight >= 100
RETURN id(u1) as source, id(u2) as target
',{graph:'cypher'}
) yield node,score 
WITH node,score ORDER BY score DESC limit 10 
return node.name as name, score;

Captain America has the highest pagerank score. He is located in the middle of the network with a total of 24 relations and also relations to most of the other important heroes in the network like Thor, Spiderman and Iron Man. If we check all heroes related to Captain America, we can notice, that they have on average higher Pagerank score just because of this relation to Captain America.

marvel_pagerank.png

* Node color from white (less) to red (more): Pagerank

Closeness centrality

Closeness centrality is defined as the total number of links separating a node from all others along the shortest possible paths. In other words, to calculate closeness, one begins by calculating, for each pair of nodes in the network, the length of the shortest path from one to the other (aka the geodesic distance). Then for each node, one sums up the total distance from the node to all other nodes.[2]

Closeness can be interpreted as an index of time-until-arrival of something flowing through the network. The greater the raw closeness score, the greater the time it takes on average for information originating at random points in the network to arrive at the node. Equally, one can interpret closeness as the potential ability of a node to reach all other nodes as quickly as possible.[2]

More in documentation.

We will use cypher loading to load only the nodes of the biggest component and set a weight threshold of 100 for relationships. With closeness centrality it is especially important that we load only a single component.

Unfortunately, when the graph is unconnected, closeness centrality appears to be useless because the distance between two nodes belonging to different components is infinite by convention, which makes the sum in 2 infinite too and therefore its inverse equal to zero. For every node of such a graph, there is always another node belonging to another component: indices of all vertices of the graph are therefore useless and the calculation of the index is limited to the largest component, omitting the roles played by individuals of other components.[3]

CALL algo.closeness.stream(
// Match only the biggest component 
'MATCH (u:Hero) WHERE u.component = 136 RETURN id(u) as id
','
MATCH (u1:Hero)-[k:KNOWS]-(u2:Hero) 
// Similarity threshold 
WHERE k.weight >= 100
RETURN id(u1) as source,id(u2) as target
',{graph:'cypher'}) YIELD nodeId, centrality
WITH nodeId,centrality 
ORDER BY centrality DESC LIMIT 10
MATCH (h:Hero) where id(h)=nodeId
RETURN h.name as hero, centrality

Captain America is in such a privileged position, that he will be leading in all categories of centralities. We can observe that nodes in more tight communities have higher closeness centrality indexes while those on the brinks and less connected have smaller values. Second thing we can notice is that also the overall position of nodes in the graph matter as the middle community has on average higher closeness centrality as others. As an example both Iron Man and Vision have higher closeness centrality than Spiderman, while Spiderman has higher Pagerank index than them.

marvel_closeness.png

* Node color from white (less) to red (more): Closeness centrality

Harmonic Centrality

The harmonic mean has been known since the time of Pythagoras and Plato as the mean expressing “harmonious and tuneful ratios”, and later has been employed by musicians to formalize the diatonic scale, and by architects as paradigm for beautiful proportions.[4]

Social network analysis is a rapid expanding interdisciplinary field, growing from work of sociologists, physicists, historians, mathematicians, political scientists, etc. Some methods have been commonly accepted in spite of defects, perhaps because of the rareness of synthetic work like (Freeman, 1978; Faust & Wasserman, 1992). Harmonic centrality was proposed as an alternative index of closeness centrality defined on undirected networks. Results show its computation on real cases are identical to those of the closeness centrality index, with same computational complexity and we give some interpretations. An important property is its use in the case of unconnected networks.[3]

CALL algo.harmonic.stream( 
// Match only the biggest component 
'MATCH (u:Hero) WHERE u.component = 136 RETURN id(u) as id 
',' 
MATCH (u1:Hero)-[k:KNOWS]-(u2:Hero) 
// Similarity threshold 
WHERE k.weight >= 100 
RETURN id(u1) as source,id(u2) as target '
,{graph:'cypher'}) YIELD nodeId, centrality 
WITH nodeId,centrality 
ORDER BY centrality DESC LIMIT 10 
MATCH (h:Hero) where id(h)=nodeId 
RETURN h.name as hero, centrality

Harmonic centrality was proposed as an alternative for closeness centrality to help solve the problem of disconnected components. Because of this we get back very similar results, given that we also have a single connected component.

marvel_harmonic

Betweenness Centrality

In graph theory, betweenness centrality is a measure of centrality in a graph based on shortest paths. For every pair of nodes in a connected graph, there exists at least one shortest path between the vertices such that either the number of relationships that the path passes through (for unweighted graphs) or the sum of the weights of the edges (for weighted graphs) is minimized. The betweenness centrality for each node is the number of these shortest paths that pass through the node.[6]

More in documentation.

We will use cypher loading to load only the nodes of the biggest component and set a weight threshold of 100 for relationships.

CALL algo.betweenness.stream(
// Match only the biggest component
'MATCH (u:Hero) WHERE u.component = 136 RETURN id(u) as id
','
MATCH (u1:Hero)-[k:KNOWS]-(u2:Hero) 
// Similarity threshold
WHERE k.weight >= 100
RETURN id(u1) as source,id(u2) as target
',{graph:'cypher'}) YIELD nodeId, centrality
WITH nodeId,centrality 
ORDER BY centrality DESC LIMIT 10
MATCH (h:Hero) where id(h)=nodeId
RETURN h.name as hero, centrality

As always Captain America is in first place and this time Beast being in the second place. This comes as no surprise as we can observe that he is the sole bridge between middle and right community. Spiderman and Incredible Hulk play a similar role as Beast, but have smaller communities behind them and hence also smaller betweenness scores.

marvel_betweeness.png

* Node color from white (less) to red (more): Betweenness centrality

References:

[1] https://en.wikipedia.org/wiki/Centrality

[2] http://qualquant.org/wp-content/uploads/networks/2008%201-7-3.pdf

[3] https://infoscience.epfl.ch/record/200525/files/[EN]ASNA09.pdf?

[4] https://arxiv.org/pdf/cond-mat/0008357.pdf

[6] https://en.wikipedia.org/wiki/Betweenness_centrality

Neo4j Marvel Social Graph Algorithms Community Detection

In the first part we infered a Hero to Hero network from a bi-partite graph of heroes and comics. Following was the second part where we got to know some basic network information to help us get a sense of what kind of a network we are dealing with.

In this part I will continue on finding resilient communities within our network using Louvain method and Label Propagation algorithm.

* Visualizations are made with Gephi. Check my previous post Neo4j to Gephi for more information. Another option is to use neovis.js to visualize communities.

Graph projection

Neo4j graph algorithms support two ways of loading subset of the graph, as a virtual graph to quickly run the algorithms on.  First one is known as label and relationship-type loading, where we load nodes by labels and relationships by their types.

What if we wanted to run algorithms on very specific subsets of graphs, but labels and relationship types are not descriptive enough or we do not want to update our actual graph?

Luckily we can load or project subsets of our graph using Cypher statements. Use a cypher query to fetch nodes instead of the label parameter and a second cypher query for fetching relationships instead of the relationship-type parameter.

Use graph:'cypher' in the config.

Example:

CALL algo.unionFind(
//First cypher query is used to fetch nodes
    'MATCH (p:User)
     WHERE p.property = 'import'
    RETURN id(p) as id',
//Second cypher query is used to fetch relationships
    'MATCH (p1:User)-[f:FRIEND]->(p2:User) 
     RETURN id(p1) as source, id(p2) as target,f.weight as weight',
{graph:'cypher',write:true});

Projecting and loading graphs via cypher queries allows us to describe the graph we want to run algorithms on in great detail. Not only that, but we can also use it to project virtual graphs from indirect patterns or omit some relationships to be loaded without actually deleting them.

Cypher projection use cases:
  • Filtering nodes and relationships.
  • Loading indirect relationships.
  • Projecting bidirected graph. Example
  • Similarity threshold. (discussed here)

Community detection

In the study of networks, such as computer and information networks, social networks and biological networks, a number of different characteristics have been found to occur commonly, including the small-world property, heavy-tailed degree distributions, and clustering, among others. Another common characteristic of networks is community structure, which is the pattern of connections and groupings. The connections within real-world networks are not homogenous or random which suggests certain natural divisions exist.[1]

Community detection algorithms do not perform well in a very connected graph as most of the nodes are densely connected, hence they belong to the same community. This usually leads to poor results where we end up with one big community that stretches over most of the graph and some small communities.

We introduce similarity threshold concept, where the weight of the relationship has to be above certain value or the relationship is ignored. We can easily exclude these relationships using graph projections via cypher queries.

In this post we will set the weight threshold to 100 so the resulting communities should be tightly-knit and resilient.

Connected components

Connected Components or UnionFind algorithm basically finds sets of connected nodes also known as islands where each node is reachable from any other node in the same set.
In graph theory, a connected component of an undirected graph is a subgraph in which any two nodes are connected to each other by paths, and which is connected to no additional nodes in the graph. More in documentation.

As with any new network I come across I want to first know how many connected components exist in the network and what is their size. Because we will set the weight threshold to 100 we will get back much less connected (sparser) graph than in previous post where threshold 10 was used.

We use the default label and relationship type loading in this example, where we load all nodes labeled “Hero” and relationships of type “KNOWS”. We also set threshold to 100, which means that only relationships with weight greater than 100 are considered by the algorithm.

CALL algo.unionFind.exp3.stream('Hero', 'KNOWS',
{weightProperty:'weight', defaultValue:0.0, threshold:100.0}) 
YIELD nodeId,setId
RETURN setId as component,count(*) as componentSize
ORDER BY componentSize DESC LIMIT 10;

m_ccAs expected our graph is sparse with one big component, that has 101 members and 6 small components with 2-4 members. 116 out of total 6439 heroes have at least 1 relationship with weight more than 100.

If we visualize the largest component of 101 nodes in Neo4j Browser, we can easily observe that there are some intuitive communities hidden here and some bridge nodes between those communities. We will try to define community structure of this subgraph with the help of Louvain method and Label Propagation algorithm.

marvel_cc.png

Louvain

Communities are groups of nodes within a network that are more densely connected to one another than to other nodes. Modularity is a metric that quantifies the quality of an assignment of nodes to communities by evaluating how much more densely connected the nodes within a community are compared to how connected they would be, on average, in a suitably defined random network. The Louvain method of community detection is an algorithm for detecting communities in networks that relies upon a heuristic for maximizing the modularity.[2]

As mentioned we will use graph projecting via cypher queries to load only relationships with weight 100 or more.

CALL algo.louvain.stream(
// load nodes
    'MATCH (u:Hero) RETURN id(u) as id', 
// load relationships
    'MATCH (u1:Hero)-[rel:KNOWS]-(u2:Hero) 
// similarity threshold 
    WHERE rel.weight > 100
    RETURN id(u1) as source,id(u2) as target',
{graph:"cypher"}) 
YIELD nodeId,community
MATCH (n:Hero) WHERE id(n)=nodeId
RETURN community,
       count(*) as communitySize, 
       collect(n.name) as members 
order by communitySize desc limit 5

I use Gephi for visualizing communities as it is more pleasant and insightful to look at good visualizations instead of tables.

*Node color: Louvain community, node size: Pagerank, name size: Betweenness centrality

marvel_social_louvain.png

I am not an expert in Marvel domain, so I will just give a brief explanation of the results. We get a total of 8 communities. The largest and also the best positioned community is the purple. It consists mostly of team Avengers and S.H.I.E.L.D with Captain America being the leader. On the left we can find Mr. Fantastic and Fantastic Four team in the same purple community . Light-blue is the Spiderman team, with Spiderman being their only connection to the outside world. They keep to themselves and don’t mingle with others. Dark blue are the Asgardians who also keep to themselves and communicate only through Thor to outside world. Incredible Hulk Series has it’s own community and again Hulk being the only connection to outside world. We observe that Beast is in a unique position as a bridge between purple and green community, which is the X-Men community.

Label propagation algorithm

Label Propagation algorithm which was first proposed by Raghavan et al. (2007) uses unique identifiers of nodes as labels and propagates the labels based on an agreement with the majority of the neighbour nodes and each node selects a label from its neighbourhood to adopt it as its label. LPA works as follows: Node x has neighbours and each neighbour carries a label denoting the community to which they belong to. Each node in the network chooses to join the community to which the maximum number of its neighbours belongs to, with ties broken uniformly and randomly. At the beginning, every node is initialized with unique label (called as identifier) and the labels propagate through the network. At every step of propagation, each node updates its label based on the labels of its neighbours. As labels propagate, densely connected groups of nodes quickly reach a consensus on a unique label.[3]

More in documentation.

Similar to Louvain algorithm we will use graph projecting via cypher queries to load only relationships that have weight more than 100. I used the write back version, so that I could export the results to Gephi for visualization.

CALL algo.labelPropagation(
// supports node-weights and defining 
// initial communities using parameter value
    'MATCH (u:Hero) RETURN id(u) as id, 1 as weight,id(u) as value',
// load relationships 
    'MATCH (u1:Hero)-[rel:KNOWS]-(u2:Hero) 
// Similarity threshold
    WHERE rel.weight > 100
    RETURN id(u1) as source,id(u2) as target, rel.weight as weight',
'OUT',{graph:"cypher",partitionProperty:"lpa" }) 
YIELD computeMillis

We get back 21 communities with some single node communities.
Team Avengers(purple) and Fantastic Four(light blue) get split up into two separate communities. Spiderman(green), Incredible Hulk(turquoise) and Asgardians(red) communities are same as in Louvain results. We observe that X-Men faction also gets split up into two communities and Cannonball group is slightly bigger this time and not so isolated.
marvel_social_lpa.png

*Node color: LPA community, node size: Pagerank, name size: Betweenness centrality

Conclusion

Hope you have noticed by now that neo4j graph algorithms plugin is really awesome and easy to use. Combine that with projecting virtual graphs via cypher queries feature and you get an easy and efficient way to analyse and understand your graph.

Firstly I was going to show how to run centrality algorithms in this blog post also, but decided not to as it would be way too long, so I will shorty post another blog post with examples of centralities and cypher projections.

Stay tuned!

References:

[1] https://en.wikipedia.org/wiki/Community_structure

[2] https://www.quora.com/Is-there-a-simple-explanation-of-the-Louvain-Method-of-community-detection

[3] http://shodhganga.inflibnet.ac.in/bitstream/10603/36003/4/chapter3.pdf

Neo4j Marvel Social Graph Analysis

This is part 2 of Marvel series. In  previous post  I imported the data from kaggle competition and showed how to project a mono-partite graph from a bipartite graph. We used a topological similarity measure, that takes into account how many common comics any pair of heroes have.

For easier understanding we can represent this as the following function:

(:Hero)(:Comic)(:Hero)

to

(:Hero)←[:KNOWS{weight=Nb of Common Comics}]→(:Hero)

To find out more check this Gephi tutorial. We could have also projected mono-partite graph looking like:

(:Comic)←[{weight=Nb of Common Heroes}]→(:Comic)

You will need to have the graph imported to proceed with the following steps in analysis.

Requirements:

Graph model:

g.png

We end up with a simple graph model. We started with a bi-partite(two types/labels of nodes) network of Heroes and Comics and then inferred a mono-partite (single type/label of nodes) graph amongst Heroes based on the number of common Comics.

Analysis

We will analyse the inferred network of Heroes. I usually start with some global statistics to get a sense of the whole graph and then dive deeper.

Weight distribution

First we check the distribution of similarity weights between pairs of heroes. Weight value translates to the number of common Comics Heroes have appeared in.

MATCH ()-[k:KNOWS]->()
RETURN (k.weight / 10) * 10 as weight,count(*) as count 
ORDER BY count DESC LIMIT 20

You might wonder why we use (k.weight / 10) * 10 as it looks silly at first glance. If we divide an integer by an integer in Neo4j we get back integer. I use it as an bucketizing function,that groups numeric data into bins of 10, so that it is easier to read results.

weig

162489 out of total 171644 relationships (94%) in our network has weight 9 or less. This means that most of our Heroes have only briefly met.

Largest weight is between “THING/BENJAMIN J. GR” and “HUMAN TORCH/JOHNNY S” at value 724. I would assume they are good buddies.

Even though we have a well connected graph, most of relationships are “weak” judging by weight. I would assume that most comics have standard teams of heroes, where not necessarily all of the team appear in every comic.

Second assumption I would make is that there are occasinal comics where different “teams” of heroes appear in together, hence so many weak relationships.

To check my assumptions I start with this query to get a basic feel.

MATCH (u:Comic)
RETURN avg(apoc.node.degree(u,'APPEARED_IN')) as average_heroes,
stdev(apoc.node.degree(u,'APPEARED_IN')) as stdev_heroes,
max(apoc.node.degree(u,'APPEARED_IN')) as max_heroes,
min(apoc.node.degree(u,'APPEARED_IN')) as min_heroes

local

I personally prefer distribution over average. We use the “bucketizing” function as before:

MATCH (u:Comic)
RETURN (size((u)-[:APPEARED_IN]-()) / 10) * 10 as heroCount,
count(*) as times
ORDER BY times DESC limit 20

heroCount
Looks like my assumptions are plausible and quite possible. 8999 (71%) of comics have less than 10 heroes playing, with most of them probably at around 5 or less as the total average is 7,5. There are a few comics where we have “family gatherings” with more than 30 heroes. One of them is a comic named COC1, probably Contest of Champions, where 110 heroes play.

Normalize weight:

If we need to, we can normalize the weight using min-max method.
Notice how this time we use (toFloat(k1.weight) - min) / (max - min). By declaring k.weight a float, we are dividing a float by an integer, that returns a float and does not bucketize/group number into bins

MATCH (:Hero)-[k:KNOWS]->(:Hero) 
//get the the max and min value
WITH max(k.weight) as max,min(k.weight) as min
MATCH (:Hero)-[k1:KNOWS]->(:Hero) 
//normalize
SET k1.weight_minmax = (toFloat(k1.weight) - min) / (max - min)

Triangle count / Clustering coefficient

In graph theory, a clustering coefficient is a measure of the degree to which nodes in a graph tend to cluster together. Evidence suggests that in most real-world networks, and in particular social networks, nodes tend to create tightly knit groups characterised by a relatively high density of ties; this likelihood tends to be greater than the average probability of a tie randomly established between two nodes (Holland and Leinhardt, 1971;[1] Watts and Strogatz, 1998[2]).

Two versions of this measure exist: the global and the local. The global version was designed to give an overall indication of the clustering in the network, whereas the local gives an indication of the embeddedness of single nodes.[1]

CALL algo.triangleCount('Hero', 'KNOWS',
{write:true, writeProperty:'triangles',
clusteringCoefficientProperty:'coefficient'})
YIELD nodeCount, triangleCount, averageClusteringCoefficient;

Running this algorithms writes back local triangle count and clustering coefficient, while providing total triangle count and average clustering coefficient of the graph as return.
Find out more in documentation.

av

Average clustering is very high at 0.77, considering that 1 would mean we have a complete graph, where everybody knows each other. This comes as no surprise as we observed before that our network is very connected with most of relationships being “weak”.

Connected Components

Testing whether a graph is connected is an essential preprocessing step for every graph algorithm. Such tests can be performed so quickly and easily that you should always verify that your input graph is connected, even when you know it has to be. Subtle, difficult-to-detect bugs can result when your algorithm on a disconnected graph.

 

Connected components have other practical use cases, for example, if we are analysing a social network and we want to find all the disconnected groups of people that exist in our graph.

More in documentation.

CALL algo.unionFind.stream('Hero', 'KNOWS', {}) 
YIELD nodeId,setId
WITH setId,count(*) as communitySize
// filter our single node communities
WHERE communitySize > 1 
RETURN setId,communitySize
ORDER BY communitySize DESC LIMIT 20

com.png

Our graph consists of 22 total components with the largest covering almost all the graph(99,5%). There is 18 single node communities, and 3 very small ones with 9, 7 and 2 members.

Lets check who are the members of these small components.

CALL algo.unionFind.stream('Hero', 'KNOWS', {}) 
YIELD nodeId,setId
WITH setId,collect(nodeId) as membersId,count(*) as communitySize 
// skip the largest component
ORDER BY communitySize DESC SKIP 1 LIMIT 20
MATCH (h:Hero) WHERE id(h) in membersId
RETURN setId,collect(h.name) as members 
ORDER BY length(members) DESC

We need to match nodes by their nodeId from result so that we can get back the names of heroes.

othe

18 heroes never appeared in a comic with any other heroes. Some of the are RED WOLF II, RUNE, DEATHCHARGE etc. Second largest component with 9 members seems to be the ASHER family and some of their friends.

Weighted Connected Components

What if decided that two heroes co-appearing in a single comic is not enough of interaction for us to tag them as colleagues and raise the bar to 10 common comics.Here is where weighted Connected Component algorithm can help us.

If we define the property that holds the weight(weightProperty) and the threshold,it means the nodes are only connected, if the threshold on the weight of the relationship is high enough otherwise the relationship is thrown away.

In our case it means that two heroes need at least 10 common comics to be considered connected.

CALL algo.unionFind.stream('Hero', 'KNOWS', 
{weightProperty:'weight', threshold:10.0}) 
YIELD nodeId,setId
RETURN setId,count(*) as communitySize
ORDER BY communitySize DESC LIMIT 20;

Define weightProperty and threshold.

com10.png

As we could expect the biggest component drops significantly in size from 99,5% to 18,6% of total number of nodes. Using threshold can be used to find potentially resilient teams of heroes as more interactions between heroes (we take any co-appearance as positive) means longer history and more bonding, which can translate to resilience.

We can play around with different threshold settings.

Not only community sizes are interesting, but also their members. We check the members of the largest communities skipping the first one as showing 1200 names in a browser can be messy.

CALL algo.unionFind.stream('Hero', 'KNOWS', 
{weightProperty:'weight', defaultValue:0.0, threshold:10.0}) 
YIELD nodeId,setId
WITH setId,collect(nodeId) as membersId,count(*) as communitySize 
// skip the largest component
ORDER BY communitySize DESC SKIP 1 LIMIT 20
MATCH (h:Hero) WHERE id(h) in membersId
RETURN setId,collect(h.name) as members 
ORDER BY length(members) DESC

com10skip.png

Second largest community has really cool names like Bodybag,China Doll and Scatter Brain. As I have no clue of the domain I can’t really give any other comment.

 

Conclusion:

While this post is long, it does not contain everything I want to show on Marvel graph, so there will be one more post using centralities with cypher loading and maybe some community detection with Louvain and Label Propagation Algorithm.

Hope you learned something today!

Reference:

[1] https://en.wikipedia.org/wiki/Clustering_coefficient

Neo4j Facebook ego network analysis

Graph algorithms plugin for Neo4j is getting bigger and bigger by the day, so here are some new algorithms, that have been added to the collection. I will use Facebook’s ego network of friends to show triangle count algorithms and betweenness centrality algorithms on an undirected graph in Neo4j.

Requirements:

Graph model:

We will import a friends social network of Facebook’s users. There are 4039 nodes labeled User and 88234 unweighted relationships indicating friendship between users as shown below. You can always get the schema of your graph using CALL dbms.schema().

user

We create a constraint on User labels, so that our queries will run faster.

CREATE CONSTRAINT ON (u:User) ASSERT u.id is unique

Import:

Direct link to data is here. We will model this network as undirected. There is a twist here, that Neo4j supports storing only directed relationships. That is no problem as we can still treat it as undirected in our queries and algorithms.

USING PERIODIC COMMIT 5000
LOAD CSV FROM 
"file:///facebook_combined.txt" as row fieldterminator ' '
MERGE (u:User{id:row[0]})
MERGE (u1:User{id:row[1]})
MERGE (u)-[:FRIEND]-(u1)

We are using MERGE without specifying direction of the relationship as it does not matter in an undirected graph. Using MERGE like this will check for any existing relationships in both direction and if none exists it will choose a random direction to store the new relationship.

Graph Analysis:

Find (weakly) connected components in the graph ?

I always start with checking how many connected components are there in the graph. Most graph algorithms run best on an connected component, so I usually choose the largest component to run further algorithms on. Check documentation for more.

CALL algo.unionFind.stream('User', 'FRIEND', {})
YIELD nodeId,setId
return setId as component, count(*) as size

un

As we can see there exists only one connected component in this graph. This is very good as there are no disconnected components. Graph algorithms library also support finding strongly connected components with algo.scc. In undirected graph every weakly connected component is also strongly connected so we can skip it.

 Get average number of friends and standard deviation.

MATCH (u:User)
return avg(size((u)--())) as average_degree,
       stdev(size((u)--())) as stdev

st

A user has on average 43 friends. Standard deviation is quite high so the number of friends is spread out over a wider range of values. We can check now for users with most friends to start looking for influencers in our network.

Degree:

MATCH (u:User)
RETURN u.id as id,size((u)--()) as degree 
order by degree desc limit 10

deg

Most basic metric is degree, that we can get using plain cypher. This directly translates to the number of friends a user has. It can be used as a good starting ground to finding influencers in our network. Top 4 users have more than 10x the average friends with user 107 being really high at 1045 friends (26% of all users).

Calculate triangle count and clustering coefficient for each node and return total triangle count and average clustering coefficient of the network.

CALL algo.triangleCount('User', 'FRIEND',
{concurrency:4, write:true, writeProperty:'triangles', 
clusteringCoefficientProperty:'coefficient'})
YIELD nodeCount, triangleCount, averageClusteringCoefficient,
loadMillis, computeMillis, writeMillis

global

There are 4039 node with 1612010 triangles in our network. Average clustering is quite high at 0.6055. This is expected as evidence suggests that in most real-world networks, and in particular social networks, nodes tend to create tightly knit groups characterized by a relatively high density of relationships. [1]

Check user local triangle count and cluster coefficient.

The local clustering coefficient of a node in a graph quantifies how close its neighbours are to being a complete graph, where complete graph is a simple undirected graph in which every pair of distinct nodes is connected by a unique relationship.[1]

If put simple the clustering coeffient would be 1 when all of my friends knew eachother.
Above algorithms also saved values of individual nodes as properties of nodes, so we can now query the results.

MATCH (u:User)
return u.id,u.triangles as triangles,
            u.coefficient as cluster_coefficient 
order triangles desc limit 10

tri

User 1912 is a member of more triangles then user 107 , who has the most friends. They both have a small clustering coefficient, which indicates, that their friends don’t really know eachother and they have a great potencial for introducing people to one another. User 1912,107 were third and first by the number of friends, so combined with this results they are solidifying their importance in the network. There is quite a drop in triangle count to third place, while on the other hand their neighbour communities are more tightly knit.

Betweenness centrality:

Betweenness centrality is useful in finding nodes that serve as a bridge from one group of users to another in  a graph. Consequently, betweenness is a rudimentary measure of the control, that a specific node exerts over the information flow throughout the full graph.

We use direction:'BOTH'for algorithms on an undirected graph.

Check documentation for more.

CALL algo.betweenness.stream('User', 'FRIEND', 
{direction:'BOTH'}) 
YIELD nodeId,centrality
MATCH (u:User) where id(u) = nodeId
RETURN u.id, centrality order by centrality desc limit 15

betwennness

User 107 tops the betweenness centrality results, which is not surprising as he has the most amount of friends with a very low clustering coeffient. So he in the position to act as a broker of informations between different groups of users. He sure does appear as the main influencer in this network based on the metrics we checked. Based on the combined view of centrality metrics we can create a list with users, who are likely to be important in our network and do some further research on them.

Conclusion:

As you can see graph algorithms plugin greatly improves our ability to analyse the network in Neo4j. You should check them out! If you think some algorithms are still missing or find some bugs please let us know on github or the Neo4j slack community.

 

[1] https://en.wikipedia.org/wiki/Clustering_coefficient

Neo4j London tube system analysis

I recently came across London tube dataset, that was uploaded by Nicola Greco. I thought it would a cool example to show some algorithms from the new Neo4j graph algorithms plugin. They are really useful as they allow network analysis without using external services to run the algorithms on.

Requirements:

Graph model:

We will create a very simple graph model, with stations and connections between them as shown below. You can check the always check the schema of your graph with db.schema() procedure.

london

Lets define the schema for our graph model with a constraint and an index, so that all queries will run faster.

CREATE CONSTRAINT ON (s:Station) ASSERT s.id is unique;

CREATE INDEX ON :Station(name);

Import:

LOAD CSV can import data from local files or from the internet, which is really cool, so that we can access data on github easily without the need of any intermediate steps.

Import stations

LOAD CSV WITH HEADERS FROM
"https://raw.githubusercontent.com/nicola/tubemaps/master/datasets/london.stations.csv" as row
MERGE (s:Station{id:row.id})
ON CREATE SET s.name = row.name,
              s.latitude=row.latitude,
              s.longitude=row.longitude,
              s.zone=row.zone,
              s.total_lines=row.total_lines

Import connections between stations.
We create relationships in both directions as trains also travel in both directions. We save the time spent traveling between stations as a relationship property, so that we can use it as a weight in our algorithms.

LOAD CSV WITH HEADERS FROM
"https://raw.githubusercontent.com/nicola/tubemaps/master/datasets/london.connections.csv" as row
MATCH (s1:Station{id:row.station1})
MATCH (s2:Station{id:row.station2})
MERGE (s1)-[:CONNECTION{time:row.time,line:row.line}]->(s2)
MERGE (s1)<-[:CONNECTION{time:row.time,line:row.line}]-(s2)

Analysis:

Which stations have the most connections to other stations?

MATCH (n:Station)--(n1)
RETURN n.name as station,
       count(distinct(n1)) as connections 
order by connections desc LIMIT 15

Results:

connections

Find the fastest path based between two stations.

algo.shortestPath is a dijsktra algorithm, that returns shortest path between start and end node. For more info check the documentation.

transfers are not taken into account

MATCH (start:Station{name:"Baker Street"}),(end:Station{name:"Beckton Park"})
CALL algo.shortestPath.stream(start, end, 'time')
YIELD nodeId, cost
MATCH (s:Station) where id(s)=nodeId
RETURN s.name as station,cost as time

Results:

single_connection

Find out which stations take the longest to get to from Baker Street station.

algo.shortestPaths is a dijsktra algorithm, that return shortest paths from start node to all the other nodes in the network.

transfers are not taken into account

MATCH (n:Station {name:'Baker Street'})
CALL algo.shortestPaths.stream(n, 'time')
YIELD nodeId, distance
MATCH (s:Station) where id(s)=nodeId
RETURN s.name as station,distance as time 
ORDER BY time desc LIMIT 15

Results:

single_many

Find the longest shortest paths between stations in the network.

algo.allShortestPaths returns shortest paths between all pairs of nodes in the network

transfers are not taken into account

CALL algo.allShortestPaths.stream('time',
{nodeQuery:'Station',relationshipQuery:'CONNECTION',defaultValue:1.0})
YIELD sourceNodeId, targetNodeId, distance
WITH sourceNodeId, targetNodeId, distance 
ORDER BY distance desc LIMIT 20
// We filter out duplicates
WHERE sourceNodeId > targetNodeId
MATCH (s1:Station),(s2:Station) 
WHERE id(s1)=sourceNodeId AND id(s2)=targetNodeId
RETURN s1.name as station1,s2.name as station2,distance as time

Results:

all_paths.png

Find bridge stations in network.

We can use betweenness centrality to help us identify, which are the stations, where people will be passing by the most, based on the structure of the network.

algo.betweenness is a betweenness algorithm, that helps us find bridges in the network. For more info check documentation.

CALL algo.betweenness.exp1.stream('Station','CONNECTION')
YIELD nodeId, centrality
MATCH (n:Station) where id(n)=nodeId
RETURN n.name,centrality 
ORDER BY centrality desc
LIMIT 15

Results:

bett

Conclusion:

It is very easy to enhance Neo4j with graph algorithms. You just copy the .jar file to /plugins folder, restart Neo4j and you are good to go. They offer lots of cool algorithms, that can help you analyse your graph in Neo4j. You should try them out!

Neo4j GoT Graph Analysis

Winter has come!

Thanks to Joakim Skoog, who built this cool API of ice and fire. It offers data about GoT universe, that we will examine and do some further analysis.

I will show you how to use just released neo4j graph algorithms plugin on the data, that we got from the above API. As the name suggests, they are user defined procedures of graph algorithms, that you can call with cypher. For most algorithms there are two procedures, one that writes results back to the graph as node-properties and another (named algo..stream) that returns a stream of data, e.g. node-ids and computed values.

Requirements:

Import:

Our cool community caretaker Michael Hunger spotted this data and posted a blog post, so that we can easily import the data into our Neo4j using cypher shell. I followed his script, imported the data, and am now sharing the graph.db database, that you can simply copy to your neo4j/data/databases folder and you will have the GoT universe in your Neo4j

Graph Model:

got.png
We have imported the data, that describes a network of persons and houses, with some additional meta-data. There are a couple of relationships between persons and houses like SWORN_TO and LED_BY. We can also find family tree hiearchies in PARENT_OF relationship. I will use the SWORN_TO relationship between houses, that describes house hiearchy of allegiances. It is a directed, unweighted network.

Analysis:

Sworn to network

sworn_to.png

At first glance we can observe, that we have some bridges between different part of communities and that they all link towards the center where the power resides. As you will see this network is from the peaceful times, when the 7 kingdoms were at peace.

Connected Components:

As a part of preprocessing we will check how many disconnected components exist within our network with algo.unionFind.

Connected Components or UnionFind basically finds sets of connected nodes where each node is reachable from any other node in the same set. In graph theory, a connected component of an undirected graph is a subgraph in which any two vertices are connected to each other by paths, and which is connected to no additional vertices in the graph.

Query:

CALL algo.unionFind('House', 'SWORN_TO', {write:true, partitionProperty:"partition"})

Result:
updateconect.png
We see that there is one big component with 393 members and a small one with 5 members. All the rest are disconnected with only one member. That means that either they are a loner house or we are missing some data. We will use centrality algorithms only on the biggest component with 393 members as graph algorithms work better on connected graphs.

Centralities:

In graph theory and network analysis, indicators of centrality identify the most important nodes within a graph. Applications include identifying the most influential person(s) in a social network, key infrastructure nodes in the Internet or urban networks, and super-spreaders of disease. Centrality concepts were first developed in social network analysis, and many of the terms used to measure centrality reflect their sociological origin.

Outgoing degree is always one as each house is sworn to only one house.

Incoming degree:

We simply count how many incoming relationships each house has. In this example it directly translates to how many houses are sworn to each house.

MATCH (h:House)<-[:SWORN_TO]-(:House)
RETURN h.name as house,count(*) as in_degree order by in_degree desc limit 10

updatedsworn

House Tyrell leads with a decent margin in front of Lannisters. The main force is house Baratheon, that consists of two branches and has combined number of sworn houses at 77. Others slowly follow, its funny ( sad ? ) to see, that even house Baelish of Harrenhal, which is ruled by Petyr Baelish has more support than house Stark of Winterfell. Books and show divergence a lot for Littlefinger’s character as he is not a ruler of Harrenhal in the show.

Pagerank centrality:

Pagerank is a version of weighted degree centrality with a feedback loop. You only get your “fair share” of your neighbor’s importance. That is, your neighbor’s importance is split between their neighbors, proportional to the number of interactions with that neighbor. Intuitively, PageRank captures how effectively you are taking advantage of your network contacts. In our context, PageRank centrality nicely captures where the power resides. Houses that have the support of other big houses will come on top.

CALL algo.pageRank(
'MATCH (h:House) where h.partition = 151 return id(h) as id',
'MATCH (p1:House)-[:SWORN_TO]->(p2:House) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', iterations:30, write: true});

pagerank.png

I was surprised with the results to say the least, but after investigating a bit I understood why. Remember in the picture of sworn to network, where I told you that all the different communities point at the center of the graph where the power resides. And at that center are the two Baratheon houses, that are sworn to each other and hence the number is so big. Where incoming degree takes into account only the number of your neighbours, pagerank also takes into account their importance. You can notice that house Stark has a better position than house Baelish, because Stark’s supporters are more important ( have more sworn to houses ).

Betweenness centrality:

Betweenness centrality identifies nodes that are strategically positioned in the network, meaning that information will often travel through that person. Such an intermediary position gives that person power and influence. Betweenness centrality is a raw count of the number of short paths that go through a given node. For example, if a node is located on a bottleneck between two large communities, then it will have high betweenness.

CALL algo.betweenness(
'MATCH (p:House) WHERE p.partition = 151 RETURN id(p) as id',
'MATCH (p1:House)-[:SWORN_TO]->(p2:House) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', write: true, writeProperty: 'betweenness'});

betweenness.png
House Baratheon is uniquely positioned at the center of the graph, so it will be the first in all of the centrality measures. Strong second candidate are the Tyrells, which is a bit surprising as they beat the Lannisters at both pagerank and betweenness centrality. In our case betweenness centrality measures which are the bridging houses with big support of other houses at their back.

Closeness centrality:

The most central nodes according to closeness centrality can quickly interact to all others because they are close to all others. This measure is preferable to degree centrality, because it does not take into account only direct connections among nodes, but also indirect connections.

CALL algo.closeness(
'MATCH (p:House) WHERE p.partition = 151 RETURN id(p) as id',
'MATCH (p1:House)-[:SWORN_TO]->(p2:House) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', write: true, writeProperty:'closeness'});

closeness.png
Closeness centrality is different from others in that the lower the closeness weight is, the better. As always house Baratheon leads with a decent margin, Tyrells are standing firmly on the second place. Starks are better than Lannisters at closeness, but are losing in all other measures, so they get fourth place and Lannisters third.

Conclusion:

This is a network from peaceful times and is therefore less interesting than having two strong disconnected components, that fight each other. What would be cool is to get a network based on the current show or even better that G.R.R.M publishes the Words of Winter, where alliances will change and people will die, to see how the graph evolves through time.

Check also https://github.com/neo4j-examples/game-of-graphs, where you can see how to expose this data as a GraphQL endpoint.

Thank for reading!

Neo4j apoc.load.jdbc

I recently made a simple query, that I used for syncing postgreSQL to Neo4j on an hourly basis with a cron job. My job was to make a simple scripts, that would automate the syncing of Neo4j with a SQL database. This is a very simple task thanks to APOC functions, which has loads of very useful functions, we can call with cypher.

I had to import a couple of SQL tables and join them together in a graph to have a better overview of the data. This is an example of query, that was used to import a table with sales data.

Graph Model:

jdbc2.png

As you can see we are importing some sales data from the postgreSQL database in my case. I made a very simple graph model to capture the data we get about our sales. Main focus is the invoice. We use domain instead of company name as an index as this works out better in our case.

You will need:

Just download them and put them into neo4j /plugins directory and that’s it!

Authorization:

Sample call looks like:

CALL apoc.load.jdbc('jdbc:mysql://localhost:3306/proddb1?user=root&password=football',
'select title from movies where studio=\'MGM Studios\'') 
YIELD row

with jdbc:mysql://localhost:3306/proddb1?user=root&password=football' being the connection string, that contains all the sensitive information such as ip, username and password of your SQL database.
If you want to hide/alias the connection string, this can be accomplished by adding to the conf/neo4j.conf a parameter similar to:

apoc.jdbc.myalias.url=jdbc:mysql://localhost:3306/proddb1?user=root&password=football

and now the above sample call can be rewritten as:

CALL apoc.load.jdbc('myalias',
'select title from movies where studio=\'MGM Studios\'') 
YIELD row

For more info check knowledge base post.

Query:

I made some logic with cypher, where I first get the date of the latest invoice in my Neo4j and then only select those that are were updated on the same day or later. I imagine there is a better way of doing this, but it serves our purpose.

// match existing invoices
MATCH (w:Invoice)
// get the latest date in the desired format
WITH apoc.date.format(max(w.unix),'s','yyyy-MM-dd') as last_invoice
// prepare SQL select statement
WITH 
'SELECT * FROM public.order WHERE updated_at::date >=?::date' as order_query,
last_invoice
// call load.jdbc with the latest date of existing data as a param
CALL apoc.load.jdbcParams(postgresurl, order_query, [last_invoice]) yield row
// we need to parse out domains from the emails
WITH row,split(row.customer_email,"@")[1] as domain
// if the email is in below list, then take whole email as a code
// so that those with gmail are not treated as one customer
WITH *,case when domain in ["gmail.com","yahoo.com","outlook.com","hotmail.com","msn.com","hotmail.co.uk"] 
  then row.customer_email
  else domain end as code
MERGE (i:Invoice{order_id:row.order_number})
// we round all the numbers to 2 digits
ON CREATE SET i.status = row.status,
              i.total_amount = apoc.math.round(row.total,2,"UP"),
              i.quantity = toINT(row.total_line_items_quantity),
              i.total_tax = apoc.math.round(row.total_tax,2,"UP"),
// save a human readable date format
              i.date = apoc.date.format(row.completed_at,"ms","yyyy-MM-dd'T'HH:mm:ss.SSS"),
              i.postcode = row.customer_billing_address_postcode,
              i.address = row.billing_address_address_1,
              i.total_discount = apoc.math.round(row.total_discount,2,"UP"),
              i.total_shipping = apoc.math.round(row.total_shipping,2,"UP"),
              i.unix = (row.completed_at / 1000),
              i.invoice_number = row.order_meta_tzs_invoice_number,
              i.total_refunded = apoc.math.round(row.refunded_total,2,"UP")
// columns, that can change in our invoices in time, we add to ON MATCH statement	
ON MATCH SET i.status = row.status,
             i.total_refunded = apoc.math.round(row.refunded_total,2,"UP"), 
             i.total_tax = apoc.math.round(row.total_tax,2,"UP"), 
             i.total_amount = apoc.math.round(row.total,2,"UP"),
// merge domain and person nodes and connect them 
MERGE (domain:Domain{name:code})
MERGE (person:Person{email:row.customer_email})
MERGE (domain)-[:MEMBER]->(person)
ON CREATE SET 
    person.name = row.billing_address_first_name + " " + row.customer_billing_address_last_name,
              person.ip = row.customer_ip ,
              person.phone = row.customer_billing_address_phone
// merge payment menthod node
MERGE (pay:PaymentMethod{name:row.payment_details_method_id})
MERGE (i)-[:PAYMENT_METHOD]->(pay)
// merge location tree
MERGE (country:Country{code:row.billing_address_country})
MERGE (country)(city)
RETURN distinct 'orders imported'

This is so easy I had to share with you, so you can maybe do a pilot project on your own to see how does Neo4j fit to your needs. Just wrap around it your favourite scripting language and you are good to go.

Game of Thrones part 4 Analysis

If you followed my GoT series so far, you have seen how to import a  GoT dataset, that I found on kaggle.com into Neo4j. There are only a couple days left before new season starts, so let’s do some analysis on the dataset. You need to know, that this data was derived from the books, so it does not contain some battles and data about the events that happened on the show as it has already surpassed the books.

Requirements:

Data:

We have already imported all the data from the CSVs into our Neo4j in previous posts. If we look closely, we can find that some persons have two houses they belong to, and one of them is None, which is the default unknown value in characters_death.csv.

Screen Shot 2017-07-14 at 13.48.42.png

To get rid of this anomaly, we can simply MATCH all persons that belong to House “None” and a separate house also. Then we just delete the relationship pointing to House “None” and we have solved the problem.

match (h1:House{name:"None"})-[r]-(p:Person)--(h2:House)
delete r

Distributions:

We can start by visualizing few distributions with Spoon.js directly in Neo4j Browser. Does not work on 3.2.x as the Neo4j browser was rewritten. 

Which house won/lost most battles ?

match (house:House)
OPTIONAL MATCH (house)-[fight:ATTACKER|:DEFENDER]->()
return house.name as house, 
       sum(case fight.outcome when "win" then 1 else 0 end) as wins,
       sum(case fight.outcome when "loss" then 1 else 0 end) as losses
order by wins + losses desc limit 20

House_battle_results.png

We can easily see the Lannisters are currently on the rise, with 11 wins and 7 losses. Second are the Starks, on the other hand in decline, with 6 wins and 10 losses. Red Wedding and what came afterward was really devastating for the Starks. Freys and Boltons look in a good position, as they were on the winning side of the Red Wedding, but those who watch the show, know that a bitter end is waiting for them.

Distribution of alive/dead people by house:

match (p:Person)--(h:House)
return h.name,
       sum(case when not p:Dead then 1 else 0 end) as alive,
       sum(case when p:Dead then 1 else 0 end) as dead 
order by dead desc limit 15

house_dead.png

It’s a hard life being in a Night’s Watch and we can see that mortality is high. Almost half the characters from Night’s Watch have died. It is also not a good time for Targaryens (yet!), as they were mostly killed off during the Robert’s Rebellion AKA the War of the Usurper. We all know that Starks are dying like flies, but it came as a surprise to see so many dead Lannisters.

Distribution of alive/dead people by culture:

match (p:Person)--(c:Culture)
return c.name as culture,
       sum(case when not p:Dead then 1 else 0 end) as alive,
       sum(case when p:Dead then 1 else 0 end) as dead 
order by dead desc limit 15

culture_dead

We can observe, that the most represented cultures are Northmen and Ironborn. While seeing so many Northmen in the books does not come as a surprise, as the whole story starts with House Stark living in Winterfell. Surprising is that there are so many Ironborns in the books. I think this all has to do with the Kingsmoot at Iron Islands, where we get to know lots of different Ironborn characters.

Valyrians are almost extinct, but they have the hidden joker Daenerys Targaryen, who has a big role yet to play.

Distribution of alive/dead people by books:

match (p:Person)-[r]-(h:Book)
with h.name as book,
     h.sequence as sequence,
     sum(case when type(r)="APPEARED_IN" then 1 else 0 end) as total,
     sum(case when type(r)="DIED_IN" then 1 else 0 end) as dead 
     order by sequence asc
return book,total,dead

book_dead.png

This figures of dead people seem small-ish for GoT, so I guess not all deaths are recorded. We can see, how the author slowly build up the community of all the people in the books. It started already with 400 characters in the first book and rose up to almost 1200 characters in the fourth book. That is a lot of characters to keep track. Storm of swords was the bloodiest book with 97 deaths, while Feast for Crows has unusual little deaths, only 27.

Analysis:

We will use algo.unionFind, that is a (weakly) connected components algorithm.  Taken from the docs.

Connected Components or UnionFind basically finds sets of connected nodes where each node is reachable from any other node in the same set. In graph theory, a connected component of an undirected graph is a subgraph in which any two vertices are connected to each other by paths, and which is connected to no additional vertices in the graph.

Which houses fought on the same side ?

We can create a network of Houses from indirect patterns of battles they fought together. We will also assume that an ally of my ally is also my ally. In this case we can use algo.unionFind to solve the problem.

Create a network of houses that fought on the same side

match (house1:House)-[r1]-(battle:Battle)-[r2]-(house2:House)
WHERE type(r1)=type(r2)
MERGE (house1)-[:ALLY]-(house2)

Run the algorithms and write back the results as a property of the node community.

CALL algo.unionFind('House', 'ALLY', {write:true, partitionProperty:"community"})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis

Check results

match (h:House)
return distinct(h.community) as community,
       collect(h.name) as members,
       count(*) as count 
order by count desc limit 4

House_community.png

We observe 4 different communities (partitions) came back as a result. We can see that Lannisters teamed up with Boltons and Freys to defeat the Starks. Night’s Watch and Baratheon fought on the same side when they defended the Wall from Wildlings (Free folk, Thenns). Stark and Tully have an alliance with marriage (Catelyn Tully and Ned Stark), so it is no wonder to see them fight battles on the same side.

Which persons fought on the same side in the year 299 ?

We will do something similar for persons also, but we will focus only on the year 299 as there were most battles fought that year. The reason why I chose a specific year is also, because alliances change over time and if we looked at all battles, then we get most persons in one community as they fought on the same side at least once.

Create a network of persons that fought on the same side in the year 299.

MATCH (p1:Person)-[r1]-(b:Battle)-[r2]-(p2:Person) 
WHERE b.year = 299 and type(r1)=type(r2)
MERGE (p1)-[:FRIEND]-(p2)

Run the algorithms and write back the results as a property of the node community.

CALL algo.unionFind('Person', 'FRIEND', {write:true, partitionProperty:"community"}) 
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis

Check results

match (p:Person)
return distinct(p.community) as community,
       collect(p.name) as members,
       count(*) as count 
order by count desc limit 10

Screen Shot 2017-07-14 at 16.11.52

We observe that 10 different communities (partitions) exist. It is a good representation of the state of alliances, that fought together on the specific year. We could do this analysis for a couple of different years and compare alliances over the year how they change.

Conclusion:

We have only scratched the surface of what is possible to do with this dataset so stay tuned for more!

You can grab the Neo4j database and just copy it into data/databases folder

Got for 3.2.x

Got for Neo4j 3.1.x