Neo4j Graph Algorithms projecting a virtual graph

Last time I started to talk about projecting virtual graphs with cypher queries and showed how to project a bidirected graph between nodes of only the largest component (island) in the graph. For now we learned how to project bidirected graphs and apply similarity threshold with “cypher loading”. Another common use that I haven’t talked about yet is projecting virtual relationships (not stored) to algorithms engine.

Requirements:

Dataset:

We will use the actors database that we get following :play movies guide in Neo4j browser. We have persons and movies in our graph and various relationships between them. I like to use this dataset as a example of what Neo4j is capable of.

matrix1.png

Projecting a virtual graph:

I will show how to project an undirected mono-partite graph from a bi-partite graph via cypher queries. This is pretty useful when you want to analyse different networks of your graph, but don’t want to actually save those networks to your database.

Louvain:

Projecting a mono-partite graph from a bi-partite graph can be easily undestood as:

(:Movie)(:Person)(:Movie)

to

(:Movie)←[{weight=Nb of Common persons}]→(:Movie)

We will project a virtual undirected graph with a similarity threshold of 2 and run Louvain algorithm on it. With it we will try to find groups of movies that share more than 2 common persons.

CALL algo.louvain.stream(
 'MATCH (m:Movie) RETURN id(m) as id',
 'MATCH (m1:Movie)<--(:Person)-->(m2:Movie)
 WITH m1,m2,count(*) as common_actors
 WHERE common_actors > 2
 RETURN id(m1) as source,id(m2) as target, common_actors as weight'
,{graph:'cypher'}) yield nodeId,community
MATCH (movie:Movie) where id(movie)=nodeId
RETURN  community,
        count(*) as communitySize,
        collect(movie.title) as movies
ORDER BY communitySize desc limit 5

kajapa.png

There aren’t many communities as this dataset is small. The largest community are movies that are directed by Lana Wachowski and Lilly Wachowski.

Pagerank

We can also project a person to person network from our bi-partite graph.

(:Person)(:Movie)(:Person)

to

(:Person)←[:{weight=Nb of common movies}]→(:Person)

Similarly like before we will project an undirected graph with a similarity threshold of 2 and run pageRank on it.

call algo.pageRank.stream(
  'MATCH (p:Person) return id(p) as id',
  'MATCH (p1:Person)-->(:Movie)<--(p2:Person)
   WITH p1,p2,count(*) as commonMovies        
   WHERE commonMovies > 2
   RETURN id(p1) as source, id(p2) as target',
{graph:'cypher'})
YIELD node,score WITH node,score
ORDER BY score desc limit 10
RETURN node.name as person, score

matrix.png

Camerow Crowe leads in first place by far margin followed by the Wachowski brothers. We have most of the Matrix crew in the first 10 spots as they collaborated in 3 movies and have stronger ties between each others.

Conclusion:

I skipped theory behind graph algorithms on purpose as I have explained it a couple of times now and don’t want to repeat myself too much. The mission of this blog post is to demonstrate the abilities cypher loading offers us in terms of projecting virtual graphs. If you want to learn more check previous posts and the documentation.

Have some feedback or questions?
Join the #neo4j-graph-algorithms channel on  Neo4j slack community

Advertisements

Neo4j Marvel Social Graph Algorithms Centralities

To top off the Marvel Social Graph series we will look at how to use centralities on a projected graph via cypher queries to find influencers or otherwise important nodes in our network using Neo4j and neo4j-graph-algorithms library.

To recap the series:

Graph projections via cypher queries:

As we noticed in the previous part using graph projections via cypher queries or for short “cypher loading”  is really great as it lets us filter and/or project virtual graphs easily and quickly. To let you fully take advantage of this awesome tool we need to get to know exactly how it works.

Unlike default label and relationship type of loading subsets of graphs, where we can in some cases define direction of the relationship to be either “incoming”,”outgoing” or “both”(birected/undirected) , cypher loading does not support loading single relationship as undirected.

While this may seem bad it’s actually not as cypher loading allows us to get creative with trying out graph algorithms on different virtual networks that we can project using cypher queries. We already did this in the previous post, but I didn’t describe it in detail yet.

Imagine that we have two hero nodes and a single directed relationship between them.
Only difference between loading this graph as undirected or directed is specifying the direction of the relationship in the cypher query. When we do not specify the direction of the relationship in the cypher query, cypher engine returns two patterns in both direction for each relationship in our graph. That in turn projects our network bidirected or undirected.

projecting directed network:

MATCH (u1:Hero)-[rel:KNOWS]->(u2:Hero)
RETURN id(u1) as source, id(u2) as target

projecting undirected network:

MATCH (u1:Hero)-[rel:KNOWS]-(u2:Hero)
RETURN id(u1) as source, id(u2) as target

Centralities

In graph theory and network analysis, indicators of centrality identify the most important nodes within a graph. Applications include identifying the most influential person(s) in a social network, key infrastructure nodes in the Internet or urban networks, and super-spreaders of disease. Centrality concepts were first developed in social network analysis, and many of the terms used to measure centrality reflect their sociological origin.[1]

Pagerank

PageRank is Google’s popular search algorithm. PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the node is. The underlying assumption is that more important nodes are likely to receive more links from other websites.

More in documentation.

We will use cypher loading to load only the nodes of the biggest component and set a weight threshold of 100 for relationships.

call algo.pageRank.stream(
// Match only the biggest component 
'MATCH (u:Hero) WHERE u.component = 136 RETURN id(u) as id
','
MATCH (u1:Hero)-[k:KNOWS]-(u2:Hero)
// Similarity threshold 
WHERE k.weight >= 100
RETURN id(u1) as source, id(u2) as target
',{graph:'cypher'}
) yield node,score 
WITH node,score ORDER BY score DESC limit 10 
return node.name as name, score;

Captain America has the highest pagerank score. He is located in the middle of the network with a total of 24 relations and also relations to most of the other important heroes in the network like Thor, Spiderman and Iron Man. If we check all heroes related to Captain America, we can notice, that they have on average higher Pagerank score just because of this relation to Captain America.

marvel_pagerank.png

* Node color from white (less) to red (more): Pagerank

Closeness centrality

Closeness centrality is defined as the total number of links separating a node from all others along the shortest possible paths. In other words, to calculate closeness, one begins by calculating, for each pair of nodes in the network, the length of the shortest path from one to the other (aka the geodesic distance). Then for each node, one sums up the total distance from the node to all other nodes.[2]

Closeness can be interpreted as an index of time-until-arrival of something flowing through the network. The greater the raw closeness score, the greater the time it takes on average for information originating at random points in the network to arrive at the node. Equally, one can interpret closeness as the potential ability of a node to reach all other nodes as quickly as possible.[2]

More in documentation.

We will use cypher loading to load only the nodes of the biggest component and set a weight threshold of 100 for relationships. With closeness centrality it is especially important that we load only a single component.

Unfortunately, when the graph is unconnected, closeness centrality appears to be useless because the distance between two nodes belonging to different components is infinite by convention, which makes the sum in 2 infinite too and therefore its inverse equal to zero. For every node of such a graph, there is always another node belonging to another component: indices of all vertices of the graph are therefore useless and the calculation of the index is limited to the largest component, omitting the roles played by individuals of other components.[3]

CALL algo.closeness.stream(
// Match only the biggest component 
'MATCH (u:Hero) WHERE u.component = 136 RETURN id(u) as id
','
MATCH (u1:Hero)-[k:KNOWS]-(u2:Hero) 
// Similarity threshold 
WHERE k.weight >= 100
RETURN id(u1) as source,id(u2) as target
',{graph:'cypher'}) YIELD nodeId, centrality
WITH nodeId,centrality 
ORDER BY centrality DESC LIMIT 10
MATCH (h:Hero) where id(h)=nodeId
RETURN h.name as hero, centrality

Captain America is in such a privileged position, that he will be leading in all categories of centralities. We can observe that nodes in more tight communities have higher closeness centrality indexes while those on the brinks and less connected have smaller values. Second thing we can notice is that also the overall position of nodes in the graph matter as the middle community has on average higher closeness centrality as others. As an example both Iron Man and Vision have higher closeness centrality than Spiderman, while Spiderman has higher Pagerank index than them.

marvel_closeness.png

* Node color from white (less) to red (more): Closeness centrality

Harmonic Centrality

The harmonic mean has been known since the time of Pythagoras and Plato as the mean expressing “harmonious and tuneful ratios”, and later has been employed by musicians to formalize the diatonic scale, and by architects as paradigm for beautiful proportions.[4]

Social network analysis is a rapid expanding interdisciplinary field, growing from work of sociologists, physicists, historians, mathematicians, political scientists, etc. Some methods have been commonly accepted in spite of defects, perhaps because of the rareness of synthetic work like (Freeman, 1978; Faust & Wasserman, 1992). Harmonic centrality was proposed as an alternative index of closeness centrality defined on undirected networks. Results show its computation on real cases are identical to those of the closeness centrality index, with same computational complexity and we give some interpretations. An important property is its use in the case of unconnected networks.[3]

CALL algo.closeness.harmonic.stream( 
// Match only the biggest component 
'MATCH (u:Hero) WHERE u.component = 136 RETURN id(u) as id 
',' 
MATCH (u1:Hero)-[k:KNOWS]-(u2:Hero) 
// Similarity threshold 
WHERE k.weight >= 100 
RETURN id(u1) as source,id(u2) as target '
,{graph:'cypher'}) YIELD nodeId, centrality 
WITH nodeId,centrality 
ORDER BY centrality DESC LIMIT 10 
MATCH (h:Hero) where id(h)=nodeId 
RETURN h.name as hero, centrality

Harmonic centrality was proposed as an alternative for closeness centrality to help solve the problem of disconnected components. Because of this we get back very similar results, given that we also have a single connected component.

marvel_harmonic

Betweenness Centrality

In graph theory, betweenness centrality is a measure of centrality in a graph based on shortest paths. For every pair of nodes in a connected graph, there exists at least one shortest path between the vertices such that either the number of relationships that the path passes through (for unweighted graphs) or the sum of the weights of the edges (for weighted graphs) is minimized. The betweenness centrality for each node is the number of these shortest paths that pass through the node.[6]

More in documentation.

We will use cypher loading to load only the nodes of the biggest component and set a weight threshold of 100 for relationships.

CALL algo.betweenness.stream(
// Match only the biggest component
'MATCH (u:Hero) WHERE u.component = 136 RETURN id(u) as id
','
MATCH (u1:Hero)-[k:KNOWS]-(u2:Hero) 
// Similarity threshold
WHERE k.weight >= 100
RETURN id(u1) as source,id(u2) as target
',{graph:'cypher'}) YIELD nodeId, centrality
WITH nodeId,centrality 
ORDER BY centrality DESC LIMIT 10
MATCH (h:Hero) where id(h)=nodeId
RETURN h.name as hero, centrality

As always Captain America is in first place and this time Beast being in the second place. This comes as no surprise as we can observe that he is the sole bridge between middle and right community. Spiderman and Incredible Hulk play a similar role as Beast, but have smaller communities behind them and hence also smaller betweenness scores.

marvel_betweeness.png

* Node color from white (less) to red (more): Betweenness centrality

References:

[1] https://en.wikipedia.org/wiki/Centrality

[2] http://qualquant.org/wp-content/uploads/networks/2008%201-7-3.pdf

[3] https://infoscience.epfl.ch/record/200525/files/[EN]ASNA09.pdf?

[4] https://arxiv.org/pdf/cond-mat/0008357.pdf

[6] https://en.wikipedia.org/wiki/Betweenness_centrality

Neo4j Marvel Social Graph Algorithms Community Detection

In the first part we infered a Hero to Hero network from a bi-partite graph of heroes and comics. Following was the second part where we got to know some basic network information to help us get a sense of what kind of a network we are dealing with.

In this part I will continue on finding resilient communities within our network using Louvain method and Label Propagation algorithm.

* Visualizations are made with Gephi. Check my previous post Neo4j to Gephi for more information. Another option is to use neovis.js to visualize communities.

Graph projection

Neo4j graph algorithms support two ways of loading subset of the graph, as a virtual graph to quickly run the algorithms on.  First one is known as label and relationship-type loading, where we load nodes by labels and relationships by their types.

What if we wanted to run algorithms on very specific subsets of graphs, but labels and relationship types are not descriptive enough or we do not want to update our actual graph?

Luckily we can load or project subsets of our graph using Cypher statements. Use a cypher query to fetch nodes instead of the label parameter and a second cypher query for fetching relationships instead of the relationship-type parameter.

Use graph:'cypher' in the config.

Example:

CALL algo.unionFind(
//First cypher query is used to fetch nodes
    'MATCH (p:User)
     WHERE p.property = 'import'
    RETURN id(p) as id',
//Second cypher query is used to fetch relationships
    'MATCH (p1:User)-[f:FRIEND]->(p2:User) 
     RETURN id(p1) as source, id(p2) as target,f.weight as weight',
{graph:'cypher',write:true});

Projecting and loading graphs via cypher queries allows us to describe the graph we want to run algorithms on in great detail. Not only that, but we can also use it to project virtual graphs from indirect patterns or omit some relationships to be loaded without actually deleting them.

Cypher projection use cases:
  • Filtering nodes and relationships.
  • Loading indirect relationships.
  • Projecting bidirected graph. Example
  • Similarity threshold. (discussed here)

Community detection

In the study of networks, such as computer and information networks, social networks and biological networks, a number of different characteristics have been found to occur commonly, including the small-world property, heavy-tailed degree distributions, and clustering, among others. Another common characteristic of networks is community structure, which is the pattern of connections and groupings. The connections within real-world networks are not homogenous or random which suggests certain natural divisions exist.[1]

Community detection algorithms do not perform well in a very connected graph as most of the nodes are densely connected, hence they belong to the same community. This usually leads to poor results where we end up with one big community that stretches over most of the graph and some small communities.

We introduce similarity threshold concept, where the weight of the relationship has to be above certain value or the relationship is ignored. We can easily exclude these relationships using graph projections via cypher queries.

In this post we will set the weight threshold to 100 so the resulting communities should be tightly-knit and resilient.

Connected components

Connected Components or UnionFind algorithm basically finds sets of connected nodes also known as islands where each node is reachable from any other node in the same set.
In graph theory, a connected component of an undirected graph is a subgraph in which any two nodes are connected to each other by paths, and which is connected to no additional nodes in the graph. More in documentation.

As with any new network I come across I want to first know how many connected components exist in the network and what is their size. Because we will set the weight threshold to 100 we will get back much less connected (sparser) graph than in previous post where threshold 10 was used.

We use the default label and relationship type loading in this example, where we load all nodes labeled “Hero” and relationships of type “KNOWS”. We also set threshold to 100, which means that only relationships with weight greater than 100 are considered by the algorithm.

CALL algo.unionFind.exp3.stream('Hero', 'KNOWS',
{weightProperty:'weight', defaultValue:0.0, threshold:100.0}) 
YIELD nodeId,setId
RETURN setId as component,count(*) as componentSize
ORDER BY componentSize DESC LIMIT 10;

m_ccAs expected our graph is sparse with one big component, that has 101 members and 6 small components with 2-4 members. 116 out of total 6439 heroes have at least 1 relationship with weight more than 100.

If we visualize the largest component of 101 nodes in Neo4j Browser, we can easily observe that there are some intuitive communities hidden here and some bridge nodes between those communities. We will try to define community structure of this subgraph with the help of Louvain method and Label Propagation algorithm.

marvel_cc.png

Louvain

Communities are groups of nodes within a network that are more densely connected to one another than to other nodes. Modularity is a metric that quantifies the quality of an assignment of nodes to communities by evaluating how much more densely connected the nodes within a community are compared to how connected they would be, on average, in a suitably defined random network. The Louvain method of community detection is an algorithm for detecting communities in networks that relies upon a heuristic for maximizing the modularity.[2]

As mentioned we will use graph projecting via cypher queries to load only relationships with weight 100 or more.

CALL algo.louvain.stream(
// load nodes
    'MATCH (u:Hero) RETURN id(u) as id', 
// load relationships
    'MATCH (u1:Hero)-[rel:KNOWS]-(u2:Hero) 
// similarity threshold 
    WHERE rel.weight > 100
    RETURN id(u1) as source,id(u2) as target',
{graph:"cypher"}) 
YIELD nodeId,community
MATCH (n:Hero) WHERE id(n)=nodeId
RETURN community,
       count(*) as communitySize, 
       collect(n.name) as members 
order by communitySize desc limit 5

I use Gephi for visualizing communities as it is more pleasant and insightful to look at good visualizations instead of tables.

*Node color: Louvain community, node size: Pagerank, name size: Betweenness centrality

marvel_social_louvain.png

I am not an expert in Marvel domain, so I will just give a brief explanation of the results. We get a total of 8 communities. The largest and also the best positioned community is the purple. It consists mostly of team Avengers and S.H.I.E.L.D with Captain America being the leader. On the left we can find Mr. Fantastic and Fantastic Four team in the same purple community . Light-blue is the Spiderman team, with Spiderman being their only connection to the outside world. They keep to themselves and don’t mingle with others. Dark blue are the Asgardians who also keep to themselves and communicate only through Thor to outside world. Incredible Hulk Series has it’s own community and again Hulk being the only connection to outside world. We observe that Beast is in a unique position as a bridge between purple and green community, which is the X-Men community.

Label propagation algorithm

Label Propagation algorithm which was first proposed by Raghavan et al. (2007) uses unique identifiers of nodes as labels and propagates the labels based on an agreement with the majority of the neighbour nodes and each node selects a label from its neighbourhood to adopt it as its label. LPA works as follows: Node x has neighbours and each neighbour carries a label denoting the community to which they belong to. Each node in the network chooses to join the community to which the maximum number of its neighbours belongs to, with ties broken uniformly and randomly. At the beginning, every node is initialized with unique label (called as identifier) and the labels propagate through the network. At every step of propagation, each node updates its label based on the labels of its neighbours. As labels propagate, densely connected groups of nodes quickly reach a consensus on a unique label.[3]

More in documentation.

Similar to Louvain algorithm we will use graph projecting via cypher queries to load only relationships that have weight more than 100. I used the write back version, so that I could export the results to Gephi for visualization.

CALL algo.labelPropagation(
// supports node-weights and defining 
// initial communities using parameter value
    'MATCH (u:Hero) RETURN id(u) as id, 1 as weight,id(u) as value',
// load relationships 
    'MATCH (u1:Hero)-[rel:KNOWS]-(u2:Hero) 
// Similarity threshold
    WHERE rel.weight > 100
    RETURN id(u1) as source,id(u2) as target, rel.weight as weight',
'OUT',{graph:"cypher",partitionProperty:"lpa" }) 
YIELD computeMillis

We get back 21 communities with some single node communities.
Team Avengers(purple) and Fantastic Four(light blue) get split up into two separate communities. Spiderman(green), Incredible Hulk(turquoise) and Asgardians(red) communities are same as in Louvain results. We observe that X-Men faction also gets split up into two communities and Cannonball group is slightly bigger this time and not so isolated.
marvel_social_lpa.png

*Node color: LPA community, node size: Pagerank, name size: Betweenness centrality

Conclusion

Hope you have noticed by now that neo4j graph algorithms plugin is really awesome and easy to use. Combine that with projecting virtual graphs via cypher queries feature and you get an easy and efficient way to analyse and understand your graph.

Firstly I was going to show how to run centrality algorithms in this blog post also, but decided not to as it would be way too long, so I will shorty post another blog post with examples of centralities and cypher projections.

Stay tuned!

References:

[1] https://en.wikipedia.org/wiki/Community_structure

[2] https://www.quora.com/Is-there-a-simple-explanation-of-the-Louvain-Method-of-community-detection

[3] http://shodhganga.inflibnet.ac.in/bitstream/10603/36003/4/chapter3.pdf