To top off the Marvel Social Graph series we will look at how to use centralities on a projected graph via cypher queries to find influencers or otherwise important nodes in our network using Neo4j and neo4j-graph-algorithms library.

To recap the series:

- Neo4j Marvel Social Graph – importing and projecting the Marvel social graph
- Neo4j Marvel Social Graph Analysis – Social graph statistics and distributions
- Marvel Social Graph Community Detection – Finding communities using Louvain and label propagation algorithm

#### Graph projections via cypher queries:

As we noticed in the previous part using graph projections via cypher queries or for short “cypher loading” is really great as it lets us filter and/or project virtual graphs easily and quickly. To let you fully take advantage of this awesome tool we need to get to know exactly how it works.

Unlike default label and relationship type of loading subsets of graphs, where we can in some cases define direction of the relationship to be either “incoming”,”outgoing” or “both”(birected/undirected) , cypher loading does not support loading single relationship as undirected.

While this may seem bad it’s actually not as cypher loading allows us to get creative with trying out graph algorithms on different virtual networks that we can project using cypher queries. We already did this in the previous post, but I didn’t describe it in detail yet.

Imagine that we have two hero nodes and a single directed relationship between them.

Only difference between loading this graph as undirected or directed is specifying the direction of the relationship in the cypher query. When we do not specify the direction of the relationship in the cypher query, cypher engine returns two patterns in both direction for each relationship in our graph. That in turn projects our network bidirected or undirected.

**projecting directed network:**

MATCH (u1:Hero)-[rel:KNOWS]->(u2:Hero) RETURN id(u1) as source, id(u2) as target

**projecting undirected network:**

MATCH (u1:Hero)-[rel:KNOWS]-(u2:Hero) RETURN id(u1) as source, id(u2) as target

## Centralities

In graph theory and network analysis, indicators of **centrality** identify the most important nodes within a graph. Applications include identifying the most influential person(s) in a social network, key infrastructure nodes in the Internet or urban networks, and super-spreaders of disease. Centrality concepts were first developed in social network analysis, and many of the terms used to measure centrality reflect their sociological origin.[1]

#### Pagerank

PageRank is Google’s popular search algorithm. PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the node is. The underlying assumption is that more important nodes are likely to receive more links from other websites.

More in documentation.

We will use cypher loading to load only the nodes of the biggest component and set a weight threshold of 100 for relationships.

call algo.pageRank.stream( // Match only the biggest component 'MATCH (u:Hero) WHERE u.component = 136 RETURN id(u) as id ',' MATCH (u1:Hero)-[k:KNOWS]-(u2:Hero) // Similarity threshold WHERE k.weight >= 100 RETURN id(u1) as source, id(u2) as target ',{graph:'cypher'} ) yield node,score WITH node,score ORDER BY score DESC limit 10 return node.name as name, score;

Captain America has the highest pagerank score. He is located in the middle of the network with a total of 24 relations and also relations to most of the other important heroes in the network like Thor, Spiderman and Iron Man. If we check all heroes related to Captain America, we can notice, that they have on average higher Pagerank score just because of this relation to Captain America.

* Node color from white (less) to red (more): Pagerank

#### Closeness centrality

Closeness centrality is defined as the total number of links separating a node from all others along the shortest possible paths. In other words, to calculate closeness, one begins by calculating, for each pair of nodes in the network, the length of the shortest path from one to the other (aka the geodesic distance). Then for each node, one sums up the total distance from the node to all other nodes.[2]

Closeness can be interpreted as an index of time-until-arrival of something flowing through the network. The greater the raw closeness score, the greater the time it takes on average for information originating at random points in the network to arrive at the node. Equally, one can interpret closeness as the potential ability of a node to reach all other nodes as quickly as possible.[2]

More in documentation.

We will use cypher loading to load only the nodes of the biggest component and set a weight threshold of 100 for relationships. With closeness centrality it is especially important that we load only a single component.

Unfortunately, when the graph is unconnected, closeness centrality appears to be useless because the distance between two nodes belonging to different components is infinite by convention, which makes the sum in 2 infinite too and therefore its inverse equal to zero. For every node of such a graph, there is always another node belonging to another component: indices of all vertices of the graph are therefore useless and the calculation of the index is limited to the largest component, omitting the roles played by individuals of other components.[3]

CALL algo.closeness.stream( // Match only the biggest component 'MATCH (u:Hero) WHERE u.component = 136 RETURN id(u) as id ',' MATCH (u1:Hero)-[k:KNOWS]-(u2:Hero) // Similarity threshold WHERE k.weight >= 100 RETURN id(u1) as source,id(u2) as target ',{graph:'cypher'}) YIELD nodeId, centrality WITH nodeId,centrality ORDER BY centrality DESC LIMIT 10 MATCH (h:Hero) where id(h)=nodeId RETURN h.name as hero, centrality

Captain America is in such a privileged position, that he will be leading in all categories of centralities. We can observe that nodes in more tight communities have higher closeness centrality indexes while those on the brinks and less connected have smaller values. Second thing we can notice is that also the overall position of nodes in the graph matter as the middle community has on average higher closeness centrality as others. As an example both Iron Man and Vision have higher closeness centrality than Spiderman, while Spiderman has higher Pagerank index than them.

* Node color from white (less) to red (more): Closeness centrality

### Harmonic Centrality

The harmonic mean has been known since the time of Pythagoras and Plato as the mean expressing “harmonious and tuneful ratios”, and later has been employed by musicians to formalize the diatonic scale, and by architects as paradigm for beautiful proportions.[4]

Social network analysis is a rapid expanding interdisciplinary field, growing from work of sociologists, physicists, historians, mathematicians, political scientists, etc. Some methods have been commonly accepted in spite of defects, perhaps because of the rareness of synthetic work like (Freeman, 1978; Faust & Wasserman, 1992). Harmonic centrality was proposed as an alternative index of closeness centrality defined on undirected networks. Results show its computation on real cases are identical to those of the closeness centrality index, with same computational complexity and we give some interpretations. An important property is its use in the case of unconnected networks.[3]

CALL algo.closeness.harmonic.stream( // Match only the biggest component 'MATCH (u:Hero) WHERE u.component = 136 RETURN id(u) as id ',' MATCH (u1:Hero)-[k:KNOWS]-(u2:Hero) // Similarity threshold WHERE k.weight >= 100 RETURN id(u1) as source,id(u2) as target ' ,{graph:'cypher'}) YIELD nodeId, centrality WITH nodeId,centrality ORDER BY centrality DESC LIMIT 10 MATCH (h:Hero) where id(h)=nodeId RETURN h.name as hero, centrality

Harmonic centrality was proposed as an alternative for closeness centrality to help solve the problem of disconnected components. Because of this we get back very similar results, given that we also have a single connected component.

#### Betweenness Centrality

In graph theory, **betweenness centrality** is a measure of centrality in a graph based on shortest paths. For every pair of nodes in a connected graph, there exists at least one shortest path between the vertices such that either the number of relationships that the path passes through (for unweighted graphs) or the sum of the weights of the edges (for weighted graphs) is minimized. The betweenness centrality for each node is the number of these shortest paths that pass through the node.[6]

More in documentation.

We will use cypher loading to load only the nodes of the biggest component and set a weight threshold of 100 for relationships.

CALL algo.betweenness.stream( // Match only the biggest component 'MATCH (u:Hero) WHERE u.component = 136 RETURN id(u) as id ',' MATCH (u1:Hero)-[k:KNOWS]-(u2:Hero) // Similarity threshold WHERE k.weight >= 100 RETURN id(u1) as source,id(u2) as target ',{graph:'cypher'}) YIELD nodeId, centrality WITH nodeId,centrality ORDER BY centrality DESC LIMIT 10 MATCH (h:Hero) where id(h)=nodeId RETURN h.name as hero, centrality

As always Captain America is in first place and this time Beast being in the second place. This comes as no surprise as we can observe that he is the sole bridge between middle and right community. Spiderman and Incredible Hulk play a similar role as Beast, but have smaller communities behind them and hence also smaller betweenness scores.

* Node color from white (less) to red (more): Betweenness centrality

#### References:

[1] https://en.wikipedia.org/wiki/Centrality

[2] http://qualquant.org/wp-content/uploads/networks/2008%201-7-3.pdf

[3] https://infoscience.epfl.ch/record/200525/files/[EN]ASNA09.pdf?

[4] https://arxiv.org/pdf/cond-mat/0008357.pdf

[6] https://en.wikipedia.org/wiki/Betweenness_centrality

## One thought on “Neo4j Marvel Social Graph Algorithms Centralities”