Article recommendation system on a citation network using Personalized Pagerank and Neo4j

Recently personalized pageRank was added to Neo4j graph algorithms library and I was looking for a cool example to demonstrate its applications. I found a great example in a Efficient Algorithms for Personalized PageRank research paper written by Peter Lofgren, in which he states:

As an example of how changing the source of the Personalized PageRank algorithm results in different rankings, we consider personalized search on a citation graph. On a citation graph provided by Citeseer, we created a demo which given a keyword and researcher name, finds all papers containing that keyword, and ranks them from the point of view of the given researcher. For a given researcher, we find all papers by that researcher, and define a source distribution giving equal weight to those papers. We then use Personalized PageRank on the citation graph to rank papers matching the given keyword. As an example, the keyword “entropy” means different things to different researchers, so we compare the top results for keyword “entropy” from different points of view.
Let’s recreate this scenario with Neo4j!

Requirements:

Once you have downloaded all the plugins and put them in the Neo4j plugins folder you have to enable them by adding the following configuration to the Neo4j.conf file.

dbms.unmanaged_extension_classes=com.graphaware.server=/graphaware
com.graphaware.runtime.enabled=true
com.graphaware.module.NLP.1=com.graphaware.nlp.module.NLPBootstrapper
dbms.security.procedures.whitelist=ga.nlp.*,algo.*,apoc.*
dbms.security.procedures.unrestricted=apoc.*,algo.*
apoc.import.file.enabled=true

Graph model

graph_modely.png

We start with a simple graph model consisting of nodes labeled Author, which can have one or more relationships AUTHOR to nodes labeled Article. At the same time Article can have a relationship REFERENCE to other nodes labeled Article, indicating the reference.

Graph schema

To optimize our queries and make full use of indexes, we define the graph schema with unique constraints for label Article on “index” property and label Author on “name” property.

CALL apoc.schema.assert(
{},
{Article:['index'],Author:['name']})

Import data

We use the citation network v10 made available on aminer.org. It is the latest version of the dataset and more importantly first version stored as JSON.

Find more about the dataset in the ArnetMiner: Extraction and Mining of Academic Social Networks research paper[1].

Importing data into Neo4j is done in two steps. First we import all articles and their authors. In the second step we import all references of the articles.
We use apoc.periodic.iterate procedure for batch importing.

Import articles and authors

CALL apoc.periodic.iterate(
  'UNWIND ["dblp-ref-0.json","dblp-ref-1.json","dblp-ref-2.json","dblp-ref-3.json"] as file 
   CALL apoc.load.json("file:///neo4j/import/" + file) 
   yield value return value',
  'MERGE (a:Article{index:value.id}) 
   ON CREATE SET a += apoc.map.clean(value,["id","authors","references"],[0]) 
   WITH a,value.authors as authors 
   UNWIND authors as author 
   MERGE (b:Author{name:author}) 
   MERGE (b)-[:AUTHOR]->(a)'
,{batchSize: 10000, iterateList: true})

Import references

CALL apoc.periodic.iterate(
  'UNWIND ["dblp-ref-0.json","dblp-ref-1.json","dblp-ref-2.json","dblp-ref-3.json"] as file 
   CALL apoc.load.json("file:///neo4j/import/" + file) 
   yield value return value',
  'MERGE (a:Article{index:value.id}) 
   WITH a,value.references as references 
   UNWIND references as reference 
   MERGE (b:Article{index:reference}) 
   MERGE (a)-[:REFERENCES]->(b)'
,{batchSize: 10000, iterateList: true})

PageRank algorithm

PageRank was designed to analyze the importance of website pages. It takes into account the number of links a website has and also the quality of those links. It’s quite different if for example a website has a link from the first page of Reddit or for example my blog 🙂

The process can be easily translated to the citation network domain. The citation itself can be viewed as an “upvote” from one article to another. Which article has the most “upvotes” from other quality articles is a question PageRank algorithm can help us answer.

We use pageRank algorithm on the global citation network to find the most important and influential articles within our graph.

Run pageRank and store results as a property of nodes

CALL algo.pageRank('Article', 'REFERENCES')

Most important articles by pagerank

MATCH (a:Article)
RETURN a.title as article,
       a.pagerank as score
ORDER BY score DESC 
LIMIT 10
Results

global_pagerank

Natural language processing

We want to recommend articles by keyword selection, so we need to extract keywords from our graph. Thanks to Graphaware’s NLP plugin this is a simple process even if you have little to no understanding what is going on beneath the hood of NLP algorithms.

The process will enrich our graph schema with additional node labels and relationship types as shown in the below picture.

ciata

Define NLP graph schema

To optimize the NLP process we define a set of unique constraints and indexes neatly available as a single procedure.

CALL ga.nlp.createSchema()

Add pipeline

Define the configuration of our pipeline. Find more details in the documentation.

CALL ga.nlp.processor.addPipeline({
  textProcessor: 'com.graphaware.nlp.processor.stanford.StanfordTextProcessor', 
  name: 'defaultPipeline', 
  threadNumber: 4
  processingSteps: {tokenize: true, 
                     ner: true, 
                     dependency: false}})

Set default pipeline

CALL ga.nlp.processor.pipeline.default('defaultPipeline')

Annotate text

The original text is broken down into words, parts of speech, and functions. This analysis of the text acts as a starting point for the later steps.[2]

If you want to learn more about it I suggest you to check out Reverse Engineering Book Stories with Neo4j and GraphAware NLP blog post written by Christophe Willemsen.

CALL apoc.periodic.iterate(
  "MATCH (n:Article) WHERE exists (n.title) RETURN n",
  "CALL ga.nlp.annotate({text: n.title, id: id(n)}) 
   YIELD result MERGE (n)-[:HAS_ANNOTATED_TEXT]->(result)",
{batchSize:1, iterateList:true})

Keyword extraction

The TextRank algorithm is a relatively simple, unsupervised method of text summarization directly applicable to the topic extraction task. Its objective is to retrieve keywords and construct key phrases that are most descriptive of a given document by building a graph of word co-occurrences and ranking the importance of individual words using the PageRank algorithm.

Taken from Efficient unsupervised keywords extraction using graphs[3]

CALL apoc.periodic.iterate(
  "MATCH (a:AnnotatedText) RETURN a",
  "CALL ga.nlp.ml.textRank({annotatedText: a}) YIELD result
   RETURN distinct 'done' ",
{batchSize:1,iterateList:true}

Get the 10 most occurring keywords in the articles titles

Lets check which are the keywords that describe the most articles in our graph.

MATCH (k:Keyword)-[:DESCRIBES]->()
WHERE k.numTerms > 1
RETURN k.value as Keyphrase, 
       count(*) AS n_articles
ORDER BY n_articles DESC
LIMIT 10
Results

keyphrase.png

Basic article recommendation

If you followed this blog post step by step, you are now able to create a basic article recommendation system based on the global pageRank score of articles and keywords extracted from the NLP process.

Recommend top ten articles described by the keyword “social networks“.

MATCH (k:Keyword)-[:DESCRIBES]->()<-[:HAS_ANNOTATED_TEXT]-(a:Article)
WHERE k.value = "social networks"
RETURN a.title as title, a.pagerank as p
ORDER BY p DESC
LIMIT 10
Results

basic

Personalized Pagerank algorithm

Personalized PageRank algorithm returns the pageRank score of nodes in a graph from the point of view of one or more given source node.

Lets calculate the global pageRank again, but this time using articles described by keyword “social networks” as source nodes.

MATCH (k:Keyword)-[:DESCRIBES]->()<-[:HAS_ANNOTATED_TEXT]-(a:Article)
WHERE k.value = "social networks"
WITH collect(a) as articles
CALL algo.pageRank.stream('Article', 'REFERENCES', {sourceNodes: articles})
YIELD nodeId, score
WITH nodeId,score order by score desc limit 10
MATCH (n) where id(n) = nodeId
RETURN n.title as article, score
Results

pes

The anatomy of a large-scale hypertextual Web search engine research paper by Sergey Brin and Larry Page comes out top. We could conclude that Google’s early work with graphs and pageRank had a big effect on social network research papers. Cool easter egg 🙂

Personalized recommendation system

If you remember the goal of this blog post was to recreate the following example.

As an example, the keyword “entropy” means different things to different researchers, so we compare the top results for keyword “entropy” from different points of view.

We start by matching all articles written by a specific author. We collect them as we will use them as source nodes in the personalized pageRank algorithm. We then run the algorithm and project as nodes only articles that are described by the keyword “entropy”. We also project the REFERENCES relationships between these articles.

We don’t have to filter out any relationships in the relationship statement as the following rule is enforced while projecting with cypher projection.

Relationships described in the relationship-statement will only be projected if both source and target nodes are described in the node-statement. Relationships that don’t have both source and target nodes described in the node-statement will be ignored.

Taken from documentation

Example recommendation

Recommendation of articles described by keyword “entropy” from the point of view of Jose C. Principe.

MATCH (a:Article)<-[:AUTHOR]-(author:Author)
WHERE author.name="Jose C. Principe"
WITH collect(a) as articles
CALL algo.pageRank.stream(
  'MATCH (a:Article)-[:HAS_ANNOTATED_TEXT]->()<-[:DESCRIBES]-(k:Keyword) 
   WHERE k.value contains "entropy" RETURN distinct id(a) as id',
  'MATCH (a1:Article)-[:REFERENCES]->(a2:Article) 
   RETURN id(a1) as source,id(a2) as target', 
  {sourceNodes: articles,graph:'cypher'})
YIELD nodeId, score
WITH nodeId,score order by score desc limit 10
MATCH (n) where id(n) = nodeId
RETURN n.title as article, score
Results

blogy1.png

Recommendation of articles described by keyword “entropy” from the point of view of Hong Wang.

MATCH (a:Article)<-[:AUTHOR]-(author:Author)
WHERE author.name="Hong Wang"
WITH collect(a) as articles
CALL algo.pageRank.stream(
  'MATCH (a:Article)-[:HAS_ANNOTATED_TEXT]->()<-[:DESCRIBES]-(k:Keyword) 
   WHERE k.value contains "entropy" RETURN distinct id(a) as id',
  'MATCH (a1:Article)-[:REFERENCES]->(a2:Article) 
   RETURN id(a1) as source,id(a2) as target', 
  {sourceNodes: articles,graph:'cypher'})
YIELD nodeId, score
WITH nodeId,score order by score desc limit 10
MATCH (n) where id(n) = nodeId
RETURN n.title as article, score
Results

blogy4.png

Conclusion

As expected we get different recommendations for the two authors as their point of view is different.

Neo4j is a powerful tool, but knowing when and which plugins to use in a specific use case, makes it an even greater tool.

References

[1] Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. ArnetMiner: Extraction and Mining of Academic Social Networks. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD’2008). pp.990-998. [PDF] [Slides] [System] [API]

[2] https://github.com/graphaware/neo4j-nlp#pipelines-and-components

[3] https://graphaware.com/neo4j/2017/10/03/efficient-unsupervised-topic-extraction-nlp-neo4j.html

Advertisements

Using random walk algorithm to shuffle cards with Neo4j and graph algorithms

I found this cool article Group representations in probability and statistics by Persi Diaconis where he describes how to use random walk algorithm to shuffle cards. I decided to try it out and see if I can shuffle cards with Neo4j.

We will be simulating a simple 2 player game, where each players gets one card and whoever get the higher card wins.

Requirements:

Import dataset

We use the card dataset made available by Garrett Grolemund in the form of a github gistI love how you can find all sorts of useful data on github and import it directly to Neo4j with a single cypher query.

We run MERGE by both face and suit property of the card as any single property is not an unique identifier in our case. There are other options how to handle this scenario like creating a single unique identifier.

LOAD CSV WITH HEADERS FROM 
"https://gist.githubusercontent.com/garrettgman/9629323/raw/ee5dfc039fd581cb467cc69c226ea2524913c3d8/deck.csv" AS row
MERGE (c:Card{face:row.face,suit:row.suit})
SET c.value = toINT(row.value)

For now we have imported cards as nodes. We need to create a complete graph where all nodes are connected to each other as in our game any card can follow another.

We will treat the relationships as bidirectional in our graph as there is no semantic difference in the direction of the relationship in our case. We will achieve this with the filter WHERE id(c1) > id(c2) to avoid duplication and not specifying the direction of the relationship in the MERGE statement.

MATCH (c1:Card),(c2:Card)
WHERE id(c1) > id(c2)
MERGE (c1)-[:NEXT_CARD]-(c2)

Random walk algorithm

Random Walk is an algorithm that provides random paths in a graph.

A random walk means that we start at one node, choose a neighbor to navigate to at random or based on a provided probability distribution, and then do the same from that node, keeping the resulting path in a list. It’s similar to how a drunk person traverses a city.

Taken from the official Neo4j documentation.

Start with a random card and make 7 random walks from it. Use direction:BOTH as we treated our relationship NEXT_CARD as bidirectional during the import.

// Start with a random card
MATCH (card:Card)
WHERE toINT(rand() * 13) + 1 = toINT(card.value)
WITH card limit 1
// Run the random walk algorithm
CALL algo.randomWalk.stream(id(card), 7, 1, 
{path: true, direction:'BOTH'})
YIELD nodeIds
UNWIND nodeIds AS nodeId
MATCH (node) WHERE id(node) = nodeId
RETURN node.face + " of " + node.suit as card

Results:

poker_results

We return a sequence of 8 cards in our result. In the first game player 1 gets the nine of spades and player 2 gets the king of hearts. The way we defined our game player 2 wins the first game. In the second game player 1 gets the three of clubs and player 2 wins again with the queen of clubs and so on.

We can now use Neo4j to shuffle cards, which is really cool and can be expanded to other card games aswell 🙂

 

 

Approximation of betweenness centrality on Twitter dataset with Neo4j

Exact betweenness centrality is computationally intensive and isn’t practical for (near) real-time calculation on large graphs. Because of this approximation algorithms of betweenness centrality were developed to allow for a faster calculation.

We will use Twitter dataset made available by Stanford University to demonstrate how to approximate betweenness centrality with  Neo4j graph algorithms library.

Import

First define the graph schema.

CREATE CONSTRAINT ON (u:User) ASSERT u.id IS UNIQUE;

There are a total of 1,8 million relationships in the dataset, so we will import only a single egonet to keep the graph small for the example.

LOAD CSV FROM "file:///61086747.edges" as row fieldterminator " "
MERGE (u1:User{id:row[0]})
MERGE (u2:User{id:row[1]})
MERGE (u1)-[:FOLLOW]->(u2);

Betweenness Centrality

Betweenness centrality is a way of detecting the amount of influence a node has over the flow of information in a graph. It is often used to find nodes that serve as a bridge from one part of a graph to another.

Taken from documentation.

Run the original Brandes betweenness centrality for future reference.

CALL algo.betweenness.stream('User','FOLLOW')
YIELD nodeId, centrality
RETURN nodeId,centrality
ORDER BY centrality desc limit 10;

tt1

The RA-Brandes algorithm

RA-Brandes algorithm is different from original Brandes’ algorithm in only one main respect: Brandes’ algorithm considers all nodes within the graph, whereas RA-Brandes considers only a subset of nodes from the graph.

Taken from documentation.

There are two strategies for selecting the subset of nodes.

Default strategy is the “random” strategy where nodes are selected at random with defined probability of selection. If we set probability to one it behaves as exact betweenness centrality where all of the nodes are selected and considered in calculation of the centrality values.

Second strategy is the “dense” strategy where first the mean degree of nodes is calculated and then only nodes with degree higher than the mean are selected.

Run the RA-Brandes algorithm with parameter probability : 1 to validate above statement that the two algorithms are fundamentally the same and deliver the same results if all nodes are selected.

CALL algo.betweenness.sampled.stream('User','FOLLOW',
  {probability:1})
YIELD nodeId, centrality
RETURN nodeId,centrality
order by centrality desc limit 10;

tt2

Approximation of Betweenness Centrality

If we set the probability : 0.1 we will consider only one tenth of nodes at random in the calculation of betweenness centrality. You could experiment with lower values of probability and compare results.

CALL algo.betweenness.sampled.stream('User','FOLLOW',
  {probability:0.1})
 YIELD nodeId, centrality
 RETURN nodeId,centrality
 ORDER BY centrality DESC LIMIT 10;

tt3.png

Ego networks

Ego network has one “ego” node that is in the center of the network. Neighbours are connected to it with a relationship and are called “altar” nodes.

ego1_1

For more details on ego networks check out the following YT video.

First step in calculating betweenness centrality is running the shortest path algorithm. Parameter maxDepth limits the depth of traversal used by the shortest path algorithm. This effectively allows us to run betweenness centrality on ego networks where for each node we only load its neighbours or neighbours x hops away when calculating betweenness centrality.

CALL algo.betweenness.sampled.stream('User','FOLLOW',
  {probability:1,maxDepth:1})
YIELD nodeId, centrality
RETURN nodeId,centrality
ORDER BY centrality DESC LIMIT 10;

tt4.png

Conclusion

Now that we learned how to approximate betweenness centrality with Neo4j graph algorithms we can import the whole 1.8 million relationships into our graph and play around with different strategies and traversal limits to optimize the approximation of betweenness centrality.

Finding alternative routes in California road network with Neo4j

The focus of this blog post is to introduce Yen’s k-shortest path algorithm that was recently added to Neo4j graph algorithms. I will use Californian road network dataset made available by Feifei Li.

Next we will enrich our graph using Google’s reverse geocoding API and then proceed to find the shortest path with Dijkstra algorithm and the second shortest path using Yen’s k-shortest algorithm.

Graph schema

Our graph has a simple schema of nodes labeled Intersection connected with relationship CONNECTION to other intersections.

cas.png

Import

Lets first define the constraint in our graph schema.

CREATE CONSTRAINT ON (i:Intersection) ASSERT i.id IS UNIQUE;

Dataset is split into nodes and relationship files. Lets import them both to get all the data we need.

Import nodes

USING PERIODIC COMMIT
LOAD CSV FROM
"https://www.cs.utah.edu/~lifeifei/research/tpq/cal.cnode" 
as row fieldterminator " "
MERGE (i:Intersection{id:row[0]})
ON CREATE SET i.longitude = toFloat(row[1]),
              i.latitude = toFloat(row[2])

Import relationships

USING PERIODIC COMMIT
LOAD CSV FROM
"https://www.cs.utah.edu/~lifeifei/research/tpq/cal.cedge" 
as row fieldterminator " "
MATCH (start:Intersection{id:row[1]})
MATCH (end:Intersection{id:row[2]})
MERGE (start)-[c:CONNECTION{id:row[0]}]->(end)
ON CREATE SET c.length = toFloat(row[3])

Reverse geocode API

For every intersection in our graph we can get the address based on the GPS location with the help of Google’s reverse geocoding API . I used apoc.util.sleep(50) to throttle and wait 50 ms between each API call. It cost me around €7 to get this data as I couldn’t find a free version of the API :/.

MATCH (i:Intersection)
CALL apoc.util.sleep(50)
WITH "https://maps.googleapis.com/maps/api/geocode/json?latlng=" + 
toString(i.latitude) + "," + toString(i.longitude) + "&key={google_api_key}" as json,i
CALL apoc.load.json(json) yield value
SET i.name = value.results[0].formatted_address

Analysis

Lets start with visualizing Santa Barbara’s part of the road network with neovis.js.

 

santa_barbara.png

Neovis configuration
var config = {
   container_id: "viz",
   server_url: "bolt://localhost:7687",
   server_user: "neo4j",
   server_password: "neo4j",
   labels: {
     "Intersection": {
      "caption": "title"
      }
    },
   relationships: {
     "CONNECTION": {
      "caption": false
      }
    },
   initial_cypher: "MATCH p=(i1:Intersection)-[:CONNECTION]->(i2:Intersection)" +
     "WHERE i1.name contains 'Santa Barbara' AND i2.name contains 'Santa Barbara'" +
     "RETURN p limit 500"
 };

Shortest path

We use algo.shortestPath, that uses Dijkstra algorithm,  to find the shortest path between “Cathedral Peak Trail” and “5750 Stagecoach Rd”. We set direction:BOTH to treat our graph as undirected.

MATCH (start:Intersection{title:"Cathedral Peak Trail"}),
      (end:Intersection{title:"5750 Stagecoach Rd"})
CALL algo.shortestPath.stream(start,end,'length',{direction:'BOTH'}) YIELD nodeId,cost
MATCH (n) where id(n)= nodeId
RETURN n.title,cost

Visualization made with neovis.js.

shortest_path.png

Neovis configuration
var config = {
   container_id: "viz",
   server_url: "bolt://localhost:7687",
   server_user: "neo4j",
   server_password: "neo4j",
   labels: {
     "Intersection": {
       "caption": "title"
     }
   },
   relationships: {
     "CONNECTION": {
       "thickness": "length",
       "caption": false
     }
   },
   initial_cypher: "MATCH (start:Intersection{title:'Cathedral Peak Trail'}),(end:Intersection{title:'5750 Stagecoach Rd'}) " +
     "CALL algo.shortestPath.stream(start,end,'length',{direction:'BOTH',nodeQuery:'Intersection',relationshipQuery:'CONNECTION'}) YIELD nodeId,cost " +
     "MATCH (n) where id(n)=nodeId " + 
     "WITH collect(n) as nodes " +
     "UNWIND range(0, length(nodes)-2) as index " +
     "WITH nodes[index] as from, nodes[index + 1] as to " + 
     "MATCH p=(from)-[:CONNECTION]-(to) " +
     "RETURN p"
 };

Yen’s k-shortest paths

Yen’s k-shortest paths algorithm computes single-source K-shortest loopless paths for a graph with non-negative relationship weights.

It uses Dijkstra algorithm to find the shortest path and then proceeds to find k-1 deviations of the shortest paths. If we want to find the second shortest path we use k=2 as shown below.

MATCH (start:Intersection{title:"Cathedral Peak Trail"}),
(end:Intersection{title:"5750 Stagecoach Rd"})
CALL algo.kShortestPaths(start, end, 2, 'length' ,{})
YIELD resultCount
RETURN resultCount

Results are stored as paths in our graph.

MATCH p=()-[:PATH_0|:PATH_1]->()
RETURN p

Shortest path is visualized in red and second shortest path in yellow. We can easily observe that the paths diverge right at the start and join together 2 hops before the end.

yens-k2.png

Conclusion

With the addition of Yen’s k-shortest algorithm to the Neo4j graph algorithms library we can now search for alternative shortest routes in our graph. This can come in handy in various domains.

Neo4j A* Algorithm

Just recently A* Algorithm was added to Neo4j graph algorithms and I decided to show how nicely APOC spatial functions fit with it as it uses GPS location for heuristic.

Import

I found this cool github repo geoiq/acetate/ where there is geospatial data available and has many contributors that we can thank to. I picked a file containing a list of 100+ cities in Europe and imported them into Neo4j.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/geoiq/acetate/master/places/Europe-z4-z6.txt"
as row FIELDTERMINATOR "\t"
MERGE (city:City{name: row.name})
ON CREATE SET city.population = toINT(row.population)
MERGE (country:Country{code: row.`country code`})
MERGE (city)-[:IS_IN]->(country)

Apoc spatial

Even though the GPS location of the cities is included in the CSV, I did not import it, just so I can show how you can get it yourself using geocoding API, that is hidden in apoc.spatial.geocodeOnce procedure. Find more details in my Neo4j to ELK post and documentation.

MATCH (city:City)-[:IS_IN]->(country)
CALL apoc.spatial.geocodeOnce(city.name + " " + country.code) 
YIELD location
// Save response
SET city.latitude = location.latitude,
    city.longitude = location.longitude

Distance calculation

Lets assume we want to go on a trip through Europe and visit cities on the list on the way. Our only requirement is that we don’t travel more than 250km per day so that we have time and energy left to act a tourist and explore cities.

We will calculate distance between cities and save it as a relationship property between them where the distance is less than 250km. This way we set a threshold to travel less than 250km per day on our trip and still wind up in one of the cities on the list every day.

WITH 250 as distanceInKm
MATCH (c1:City),(c2:City)
WHERE id(c1) < id(c2)
WITH c1,c2,
distance(point({longitude:c1.longitude,latitude:c1.latitude}), 
         point({longitude:c2.longitude,latitude:c2.latitude})) as distance
WHERE distance < (distanceInKm * 1000) 
MERGE (c1)-[l:LINK]->(c2)
ON CREATE SET l.distance = distance

A* Algorithm

A* is an informed search algorithm, or a best-first search, meaning that it solves problems by searching among all possible paths to the solution (goal) for the one that incurs the smallest cost (least distance travelled, shortest time, etc.), and among these paths it first considers the ones that appear to lead most quickly to the solution. It is formulated in terms of weighted graphs: starting from a specific node of a graph, it constructs a tree of paths starting from that node, expanding paths one step at a time, until one of its paths ends at the predetermined goal node.

Taken from Wikipedia

Current implementation of the A* algorithm in graph algorithms uses geospatial information as heuristic function to be able to do best-first search through the graph.

I live near Ljubljana and lets say I want to travel to Amsterdam with my backyard helicopter :).

MATCH (start:City{name:"Ljubljana"}),(end:City{name:"Amsterdam"})
CALL algo.shortestPath.astar.stream(start, end, 'distance',
  'latitude','longitude',{
  nodeQuery:'City',relationshipQuery:'LINK',direction:'BOTH'})
YIELD nodeId, cost
MATCH (n) where id(n)=nodeId
RETURN n.name,cost

Our graph is not very detailed as it only contains 100ish cities, so I got back quite interesting results. At first glance it doesn’t really seem like the fastest way to get to Amsterdam, but it is definitely an interesting one judging by the cities we would visit and all the scenery ranging from Adriatic sea to Alps and finally flat lands of Netherlands.

a.png

Cost is the distance in meters.

Let’s also visualize the resulted path with neovis.js.

The query that we use to get back the shortest path from A* algorithm. Could also be used to visualize in Neo4j Browser.

MATCH (start:City{name:'Ljubljana'}),(end:City{name:'Amsterdam'})
 CALL algo.shortestPath.astar.stream(start, end, 'distance', 
 'latitude','longitude',{ 
 nodeQuery:'City',relationshipQuery:'LINK',direction:'BOTH'}) 
YIELD nodeId, cost
MATCH (n) where id(n)=nodeId 
WITH collect(n) as nodes
UNWIND range(0, length(nodes)-2) as index
WITH nodes[index] as from, nodes[index + 1] as to 
MATCH p=(from)-[:LINK]-(to)
RETURN p

I used population of the cities for defining the size of the nodes and distance between cities for defining the width of the relationships.

neovisa.png

Code available on github.

Knowledge graph

Lets assume I made it to Amsterdam. Now what? What is worth checking out or visiting?

We can use Google knowledge graph API to find attractions we could visit and enrich our graph with the data. Find more details in my previous post Neo4j APOC triggers and web APIs.

WITH "api_key" as google_api_key
MATCH (c:City{name:"Amsterdam"})-[:IS_IN]->(country)
CALL apoc.load.json("https://kgsearch.googleapis.com/v1/entities:search?query=" 
                    + apoc.text.urlencode(c.name) + " " + country.code + 
                     "&key=" + google_api_key + "&limit=20&indent=True")
YIELD value
UNWIND value.itemListElement as row
WITH row.result as results,c 
WHERE results.name is not null
MERGE (p:Attraction{name:results.name})
ON CREATE SET p.description = results.description,
 p.detailedDescription = results.detailedDescription.articleBody
MERGE (p)-[:IS_IN]->(c)
WITH results,p
UNWIND (results.`@type`) as type
MERGE (t:Type{name:type})
MERGE (p)-[:HAS_TYPE]->(t)

Now that we stored the data we got from the API into our graph, we can search for things to do or visit in Amsterdam.

MATCH (:City{name:"Amsterdam"})<-[:IS_IN]-(a:Attraction)
RETURN a.name as attraction,
       a.description as description

asmte

Conclusion

I always loved how easily you can scrape the internet with Neo4j and cypher/APOC procedures. Neo4j allows us to easily enrich our graph and make it a proper knowledge graph of our own by combining multiple data sources ranging from other databases to online APIs. Combining this with the graph algorithms it becomes an even more serious analytics tool, that is fun to work with.

Neo4j Graph Visualizations using GoT dataset

Once again I will use the data made available by Andrew Beveridge to first demonstrate the use of categorical pageRank and breakdown pageRank by the sequence of the books, that will help us find the winners of game of thrones and secondly to show some visualizations options Neo4j community has to offer.

To find more details about the dataset check Analyzing the Graph of Thrones by William Lyon or my Neo4j GoT social graph analysis.

Import

Lets first define the schema of our graph.

We only need one constraint on Person label. This will speed up our import and later queries.

CREATE CONSTRAINT ON (p:Person) ASSERT p.id IS UNIQUE;

As the data of all five books is available on github we can import all 5 books using a single cypher query and APOC‘s load.json.

We will differentiate networks from different books using separate relationship types. We need to use apoc.merge.relationship  as Cypher does not allow using parameters for relationship types. Network from the first book will be stored as relationship type INTERACTS_1, second INTERACTS_2 and so on.

UNWIND ['1','2','3','45'] as book
LOAD CSV WITH HEADERS FROM 
'https://raw.githubusercontent.com/mathbeveridge/asoiaf/master/data/asoiaf-book' + book + '-edges.csv' as value
MERGE (source:Person{id:value.Source})
MERGE (target:Person{id:value.Target})
WITH source,target,value.weight as weight,book
CALL apoc.merge.relationship(source,'INTERACTS_' + book, {}, {weight:toFloat(weight)}, target) YIELD rel
RETURN distinct 'done'

Categorical pagerank

As described in my previous blog post, categorical pageRank is a concept where we break down the global pageRank into categories and run pageRank on each category subset of the graph separately to get a better understanding of the global pageRank.

Here we will use books as categories, so that we get character’s importance breakdown by the sequence of books.

UNWIND ['1','2','3','45'] as sequence
MERGE (book:Book{sequence:sequence})
WITH book,sequence
CALL algo.pageRank.stream(
 'MATCH (p:Person) WHERE (p)-[:INTERACTS_' + sequence + ']-() RETURN id(p) as id',
 'MATCH (p1:Person)-[INTERACTS_' + sequence + ']-(p2:Person) RETURN id(p1) as source,id(p2) as target',
 {graph:'cypher'})
YIELD node,score
// filter out nodes with default pagerank 
// for nodes with no incoming rels
WITH node,score,book where score > 0.16
MERGE (node)<-[p:PAGERANK]-(book)
SET p.score = score

 

Biggest winner of the game of thrones by books so far

Basically we will order pageRank values by the sequence of the books and return top ten characters with the highest positive changes in pageRank.

MATCH (person:Person)<-[pagerank:PAGERANK]-(book:Book)
// order by book sequence
WITH person,pagerank,book order by book asc
WITH person,collect(pagerank) as scores
RETURN person.id as person,
       scores[0].score as first_score,
       scores[-1].score as last_score,
       length(scores) as number_of_books 
ORDER BY last_score - first_score DESC LIMIT 10

While Jon Snow leads by absolute positive difference in pageRank, Victarion Greyjoy is very interesting. He had pageRank score 0.59 in the second book, was missing in third, and jumped to 4.43 in fourth and fifth book.

Stannis Baratheon is probably at the peak of his career judging by the show and is surprisingly in second place. Other than that the list is made out of the usual suspects.

goty.png

I also checked the characters with the biggest negative change, but it turns out that they are mostly dead so it’s not all that interesting.

Spoonjs

Thanks to Michael Hunger spoonJS is back. With it we can visualize charts directly in Neo4j browser.

Within a few clicks you can get it set up following the guide.

:play spoon.html

In our example we will visualize characters sorted by pageRank in the last two books combined.

MATCH (p:Person)<-[r:PAGERANK]-(:Book{sequence:'45'})
RETURN p.id as person,r.score as score
ORDER BY score DESC LIMIT 15

 

pagerank_book45

Three out of the first four places belong to the Lannisters, with the most important being Tyrion. If you think about it from this perspective what GoT is really about, you might think it’s just a family crisis of the Lannisters with huge collateral damage 🙂

3d force graph

Another cool visualization project by Michael is called 3d-force-graph. It lets us visualize and explore graphs.

We will use pageRank to define the size of the nodes, so that the most important nodes will be the biggest. To represent communities in the graph we use the color of the nodes.

We need to run label propagation or Louvain algorithm to find communities within our graph and store them as a property of the nodes.

We run label propagation using only the network of characters from the last two books.

CALL algo.labelPropagation('Person','INTERACTS_45','BOTH',{partitionProperty:'lpa',iterations:10})

I like this visualization because it is 3d and you can approach from different angles and zoom in and out of the graph while exploring it.

got_gif.gif

Neovisjs

We can also use neovis.js, developed by William Lyon, to visualize graphs. Similarly as before we use label propagation results to color the nodes. To mix it up a bit we will use betweenness centrality of the nodes, instead of pageRank, to represent the size of the nodes in the graph.

Run betweenness centrality algorithm.

CALL algo.betweenness('Person','INTERACTS_45',{direction:'BOTH',writeProperty:'betweenness'})

In the visualization we also defined the size of the relationships based on the weight and colored them according to the community of the pair of nodes they are connecting.

got_2.gif

 

Code for Neovis and 3d-force-graph visualization used in the post can be found on github. Have fun!

Paradise papers analysis with Neo4j

I haven’t used a real-world dataset yet in my blog so I decided it’s time to try out some real-world networks. I will be using the Paradise papers dataset that we can explore thanks to ICIJ.

Paradise papers are a leak of millions of documents containing information about companies and trusts in offshore tax havens that have revealed information on tens of thousands of companies and people, including some high-profile figures like the Queen of England.

We can get this dataset and any of the other leaks at ICIJ official site in CSV or Neo4j Desktop form. If you are lazy like me you can just use Neo4j online sandbox, where you get your own Neo4j with APOC and graph algorithms library setup within a matter of seconds.

Graph model:

We have a graph of officers, who are persons or a banks in real world and entities that are companies. Entities can also have intermediaries and they all have one or more registered addresses that we know of.

I will focus today more on algorithms use, but if you want to learn more about the dataset itself and Neo4j you should check out analysing paradise papers and an in depth analysis of paradise papers by Michael Hunger and William Lyon.

jaha

Michael Hunger & William Lyon,An In-Depth Graph Analysis of the Paradise Papers, https://neo4j.com/blog/depth-graph-analysis-paradise-papers/

 

Infer the network

As I mentioned before we will focus more on graph algorithms use cases for Paradise papers.

We will assume that officers who are related to the same entity might know each other or at least have some contacts with one another. With this assumption we will create a social network of officers who are related to same entities and have a registered address in Switzerland only. I filtered Switzerland only so we might get a better understanding of their local investment network.

MATCH (o1:Officer)-[:OFFICER_OF]->()<-[:OFFICER_OF]-(o2:Officer)
WHERE id(o1) > id(o2) AND o1.countries contains "Switzerland" 
AND o2.countries contains "Switzerland"
WITH o1,o2,count(*) as common_investments
MERGE (o1)-[c:COMMON_INVESTMENTS]-(o2)
ON CREATE SET c.weight = common_investments

 

Analysis

We start by analyzing the degree of nodes in our network. There are 1130 officer with registered address in Switzerland. Each officer has on average 6 contacts to other officers through his entities.

MATCH (o:Officer)
WHERE (o)-[:COMMON_INVESTMENTS]-()
WITH o,size((o)-[:COMMON_INVESTMENTS]-()) as degree
RETURN count(o) as officers,
       avg(degree) as average_degree,
       stdev(degree) as stdev_degree,
       max(degree) as max_degree,
       min(degree) as min_degree

paradise_sat.png

We can search for pairs of officers with the most common investments as we stored this value as a property of relationship.

MATCH (o1:Officer)-[w:COMMON_INVESTMENTS]->(o2)
RETURN o1.name as officer1,
                 o2.name as officer2,
                 w.weight as common_investments 
order by common_investments desc limit 10

Barnett – Kevin Alan David seems to be very intertwined with the Mackies as he has got 322 common investments with Thomas Louis and 233 with Jacqueline Anne. Actually eight out of first ten places belong to Barnett Kevin Alan David, Hartland Georgina Louise and the Mackies. This would indicate that they cooperate on a large scale.

paradise_a.png

Weakly connected components

With weakly connected components algorithm we search for so called “islands” in our graph. An island or a connected component is a connected graph where all nodes are reachable between each other and any disconnected part of the global graph is it’s own component.

In our scenario it would be an useful algorithm to search for people who have common investments in companies and might know each other or maybe just have easier access to communicate with.

CALL algo.unionFind.stream(
    'MATCH (o:Officer) WHERE (o)-[:COMMON_INVESTMENTS]-()
    RETURN id(o) as id
   ','
    MATCH (o1:Officer)-[:COMMON_INVESTMENTS]-(o2)
    RETURN id(o1) as source,id(o2) as target',
    {graph:'cypher'})
YIELD nodeId,setId
RETURN setId as component,count(*) as componentSize
ORDER BY componentSize desc limit 10

As with most real-world graphs I have encountered so far we get one larger component and some smaller ones. If we wanted we could dig deeper into smaller components and check out their members and see if something interesting comes up.

paradise_component

Lets visualize component 14 for example.

paradise_compt.png

Studhalter – Alexander Walter seem to be quite interlaced with Gadzhiev – Nariman as they have 60 common investments. To complete the triangle there is Studhalter – Philipp Rudolf with 15 common investments with Alexander Walter and 12 with Nariman.  Alexander Walter is positioned at the center of this graph with connection to 8 different officers and we could assume that he holds some influence over this network.

Pagerank

PageRank was first used to measure importance of websites to help users find better results when searching the internet. In the domain of websites and links each link is treated as a vote from one website to another indicating that there is some quality content over there. When calculating pageRank it is also taken into account how important is the voter website as a link from amazon.com means something completely different as a link from tbgraph.wordpress.com.

In the Paradise papers domain we can use it to find potential influencer in our inferred “common_investments” network as officer who have common investments with other important officers in the network will come on top.

CALL algo.pageRank.stream(
    'MATCH (o:Officer) WHERE (o)-[:COMMON_INVESTMENTS]-()
     RETURN id(o) as id
    ','
     MATCH (o1:Officer)-[:COMMON_INVESTMENTS]-(o2)
     RETURN id(o1) as source,id(o2) as target',
    {graph:'cypher'})
YIELD node,score
WITH node,score order by score desc limit 10
RETURN node.name as officer,score

Cabral – Warren Wilton comes out on top by a large margin. I checked him out and it turns out he is an officer of 430 different entities and he has got connection to 116 other officers  from Switzerland through his entities. Find out more about Cabral – Warren Wilton. Following is the Swiss Reinsurance Company, which is a shareholder of 19 different entities. You can get same detailed look as above for Swiss Reinsurance thanks to ICIJ.

paradise_pagerank

Harmonic closeness centrality

We can interpret closeness centrality as the potential ability of a node to reach all other nodes as quickly as possible. This works both ways in our example as also other nodes can reach a specific node quickly through shortest paths between them. Harmonic centrality is a variation of closeness centrality that deals nicely with disconnected graphs.

In our domain we could interpret it as the potential ability for “insider trading” as having quick access to other nodes in the network could potentially lead to an advantage such as having access to (confidential) information before others.

CALL algo.closeness.harmonic.stream(
    'MATCH (o:Officer) WHERE (o)-[:COMMON_INVESTMENTS]-()
     RETURN id(o) as id
    ','
     MATCH (o1:Officer)-[:COMMON_INVESTMENTS]-(o2)
     RETURN id(o1) as source,id(o2) as target',
    {graph:'cypher'})
YIELD nodeId,centrality
WITH nodeId,centrality order by centrality desc limit 10
MATCH (n) where id(n)=nodeId
RETURN n.name as officer,centrality

Cabral – Warren Wilton also leads by harmonic centrality. He seems to be a big player in Switzerland. Swiss Reinsurance Company and PricewaterhouseCoopers are the only two that were also in pagerank top 10 leaderboard. All the others are new candidates we haven’t seen before.  We can deep dive on Schröder – Stefan and observe that he has connections in SwissRe.

paradise_harmonic.png

Betweenness centrality

Betweenness centrality is useful in finding nodes that serve as a bridge from one group of users to another in a graph. Betweenness centrality in a social network can be interpreted as a rudimentary measure of the control that a specific node exerts over the information flow throughout the graph.

CALL algo.betweenness.stream(
    'MATCH (o:Officer) WHERE (o)-[:COMMON_INVESTMENTS]-()
     RETURN id(o) as id
    ','
     MATCH (o1:Officer)-[:COMMON_INVESTMENTS]-(o2)
     RETURN id(o1) as source,id(o2) as target',
    {graph:'cypher'})
YIELD nodeId,centrality
WITH nodeId,centrality order by centrality desc limit 10
MATCH (n) where id(n)=nodeId
RETURN n.name as officer,centrality

The usual players are in the top 3 spots. We can also spot Schröder – Stefan in the fifth spot and the other officers we haven’t come across yet. It’s interesting to see Zulauf – Hans-Kaspar up there as he is an officer of only two entities, but looks like his network position makes him so interesting.

paradise_betweenness

Label propagation algorithm

Label propagation algorithm is a community detection algorithm. Algorithm divides the network into communities of nodes with dense connections internally and sparses connections between communities.

CALL algo.labelPropagation(
    'MATCH (o:Officer) WHERE (o)-[:COMMON_INVESTMENTS]-()
     RETURN id(o) as id
    ','
     MATCH (o1:Officer)-[q:COMMON_INVESTMENTS]-(o2)
     RETURN id(o1) as source,id(o2) as target',
    'OUT',
    {graph:'cypher',partitionProperty:'community',iterations:10})

To help us with analyzing communities we will use Gephi to visualize our network.

Visualize with Gephi:

I like to use Gephi for visualizing networks. It is a really cool tool that lets you draw nice network visualizations based on centrality and community values.

Check out my previous blog post Neo4j to Gephi for more information.

We would need to save centrality and label propagation results to nodes if we wanted to export them to Gephi. Assuming we have done that we can use the following query to export data from Neo4j to Gephi.

MATCH path = (o:Officer)-[:COMMON_INVESTMENTS]->(:Officer)
CALL apoc.gephi.add(null,'workspace1',path,'weight',['pagerank','labelpropagation','betweeneess']) 
yield nodes
return distinct "done"

Here we have a visualization of only the biggest component in the graph with 344 members. There are 10+ communities that we can easily identify just by looking at this picture. I used label propagation results for color of nodes, betweenness centrality results for node size and pageRank results for node title.

We can’t really see much except that Cabral – Warren Wilton is very important in our network and positioned at the center of it.

paradise

 

Lets zoom in on the center of the network to get a better understanding of the graph.

As we noticed at the start that Barnet – Kevin Alan David is deeply connected with the Mackies and Hartlands. I have also noticed there is Hartland Mackie – Thomas Alan located on the bottom left so this might answer why the Hartlands and Mackies are so deeply connected. We can also find Barnett – Emma Louise in this community, which would makes this community(red) a community of Barnetts, Hartlands, Mackies primarily.

On the bottom right we can find Schroder Stefan very near to Swiss Reinsurance Company.

 

middle.png

Conclusion:

I think that understanding of the graph and proper visualizations tool is a powerful tool in the hand of an explorer of data. With Neo4j and Gephi we are able to understand the graph and find insights even when we have little prior knowledge about the data and what exactly are we looking for in the first place.

Neo4j Categorical Pagerank

I found this cool Neo4j blog post written by Kenny Bastani, that describes a concept called categorical pagerank.
I will try to recreate it using neo4j-graph-algorithms library and GoT dataset.

 

categorical_pagerank

Kenny Bastani, Categorical PageRank Using Neo4j and Apache Spark, https://neo4j.com/blog/categorical-pagerank-using-neo4j-apache-spark/

Idea behind it is pretty simple. As shown in the example above we have a graph of pages that have links between each other and might also belong to one or more categories. To better understand global pagerank score of nodes in a network, we can breakdown our graph into several subgraphs, one for each category and execute pagerank algorithm on each of that subgraphs. We store results as a relationship property between category and pages.
This way we can break down which are the contributing categories to page’s global pagerank score.

 

Requirements:

Graph Model:

We will use the dataset made available by Joakim Skoog through his API of ice and fire.

I first encountered this dataset when Michael Hunger showed us how to import the data in his game of data blog post. I thought the dataset was pretty nice and as all I had to do was copy/paste the import queries I decided to play around with it and wrote a Neo4j GoT Graph Analysis post.

Michael’s import query of house data:

// create Houses and their relationships
call apoc.load.jsonArray('https://raw.githubusercontent.com/joakimskoog/AnApiOfIceAndFire/master/data/houses.json') yield value
// cleanup
with apoc.map.clean(apoc.convert.toMap(value), [],['',[''],[],null]) as data
// lowercase keys
with apoc.map.fromPairs([k in keys(data) | [toLower(substring(k,0,1))+substring(k,1,length(k)), data[k]]]) as data

// create House
MERGE (h:House {id:data.id}) 
// set attributes
SET 
h += apoc.map.clean(data, ['overlord','swornMembers','currentLord','heir','founder','cadetBranches'],[])

// create relationships to people or other houses
FOREACH (id in data.swornMembers | MERGE (o:Person {id:id}) MERGE (o)-[:ALLIED_WITH]->(h))
FOREACH (s in data.seats | MERGE (seat:Seat {name:s}) MERGE (seat)-[:SEAT_OF]->(h))
FOREACH (id in data.cadetBranches | MERGE (b:House {id:id}) MERGE (b)-[:BRANCH_OF]->(h))
FOREACH (id in case data.overlord when null then [] else [data.overlord] end | MERGE (o:House {id:id}) MERGE (h)-[:SWORN_TO]->(o))
FOREACH (id in case data.currentLord when null then [] else [data.currentLord] end | MERGE (o:Person {id:id}) MERGE (h)-[:LED_BY]->(o))
FOREACH (id in case data.founder when null then [] else [data.founder] end | MERGE (o:Person {id:id}) MERGE (h)-[:FOUNDED_BY]->(o))
FOREACH (id in case data.heir when null then [] else [data.heir] end | MERGE (o:Person {id:id}) MERGE (o)-[:HEIR_TO]->(h))
FOREACH (r in case data.region when null then [] else [data.region] end | MERGE (o:Region {name:r}) MERGE (h)-[:IN_REGION]->(o));

After we have imported the dataset our graph will have a schema as shown below. You can always check the schema of your graph using CALL db.schema

meta

Categorical pagerank:

As in my previous blog post we will use the SWORN_TO network of houses to demonstrate categorical pagerank and this time use regions as categories. This way we will try to understand and breakdown from which regions do the houses get their power and support.

We first match all regions so that we will iterate our algorithm through all regions. In the node-statement of cypher projection we project nodes belonging to only a specific region using a parameter. As we already filtered nodes from a specific region we don’t have to filter out any relationships as only relationships with both source and target nodes described in node-statement will be projected and all other relationships that don’t have both the source and target nodes described in node-statement will be ignored in the projection.

We will then save the results as a relationship property between region and house.

MATCH (r:Region)
CALL algo.pageRank.stream('
    MATCH (h:House)-[:IN_REGION]->(r:Region)
    WHERE r.name ="' + r.name +
    '" RETURN id(h) as id
    ','
    MATCH (h1:House)-[:SWORN_TO]->(h2:House)
    RETURN id(h1) as source,id(h2) as target',
    {graph:'cypher'}) 
YIELD nodeId,score
MATCH (h:House) where id(h) = nodeId
MERGE (r)-[p:PAGERANK]->(h)
ON CREATE SET p.score = score

Lets first examine the North.

MATCH (r:Region{name:"The North"})-[p:PAGERANK]->(h)
RETURN h.name as house,p.score as pagerank 
ORDER BY pagerank DESC LIMIT 5

House Bolton leads with House Stark following in second place. This might be disheartening to some fans as Starks is more lovable than Boltons, but we all know how things ended for Boltons in the TV series.

north

Westerlands region is the home of the Lannister House. Lets see how well they do in their home region.

MATCH (r:Region{name:"The Westerlands"})-[p:PAGERANK]->(h)
RETURN h.name as house,p.score as pagerank
ORDER BY pagerank DESC LIMIT 5

Lannisters have a very strong direct support in their home region. This is shown in that House Farman is the only other house in Westerlands that has the support of at least one house.

west

 

Second version:

As I was writing the blog post and running the above algorithm I thought to myself that even though a house might not be in a specific region it might still have support from a house in that region and hence support from that region.

For that reason I turned our projection of the graph to be analyzed around a bit and now we project all the houses of our graph and filter SWORN_TO relationships that have the source node based in a specific region only. This directly translates to support from a house in that region.

We filter out pagerank scores below 0.151 as 0.15 is the default value for a node with no inbound relationships and save results as a relationship between a region and a house. This way we keep our graph tidy.

MATCH (r:Region)
CALL algo.pageRank.stream('
    MATCH (h:House)
    RETURN id(h) as id
    ','
    MATCH (r:Region)<-[:IN_REGION]-(h1:House)-[:SWORN_TO]->(h2:House)
    WHERE r.name ="' + r.name +
    '" RETURN id(h1) as source,id(h2) as target',
    {graph:'cypher'})
YIELD nodeId,score
WITH nodeId,score,r where score > 0.151
MATCH (h:House) where id(h) = nodeId
MERGE (r)-[p:SUPPORT]->(h)
ON CREATE SET p.score = score

As we get back only 51 created relationships we can easily visualize this network in Neo4j Browser. It is pretty obvious that House Baratheon of King’s Landing has the support from most regions lacking only The Neck and Beyond the Wall region support.

categ.png

Check top 20 individual regional pagerank scores.

MATCH (h:House)<-[s:SUPPORT]-(r:Region)
RETURN r.name as region,h.name as house,s.score as score 
ORDER BY score DESC LIMIT 20

Both Baratheon houses are very dominant in the Crownlands region. House Tyrell comes in third in regional pagerank score from The Reach region. House Tyrell is sworn to House Baratheon of King’s Landing and solely because of this relationship House Baratheon comes in immediately after house Tyrell by support from the Reach region. This is a pattern occurring through most of the graph except for the North Region, where House Baratheon comes in before the Starks and Boltons having support from both of them.

cate_table.png

Conclusion:

With cypher projections we get all the freedom cypher query language provides. We can even parametrize graph algorithms to run on only specific subgraphs as shown in this blog post. Cypher projections are a very powerful tool that can be used to extract useful insights from our graphs and if you are familiar with cypher also quite easy to use.

Neo4j GoT social graph analysis

Lately I have been showing how to project a bi-partite graph to mono-partite and run algorithms on it. To mix it up a little I will demonstrate how to project a network between nodes using cosine similarity of certain features that the node possess.  We will be using the Network of thrones dataset based on A Storm of Swords book. Check out also their analysis on the same network.

To find communities of people, that are in a similar position of power in the network, we will first calculate pagerank, harmonic centrality, triangles count and cluster coefficient for each node and use those values as features from which we will infer a similarity network using cosine similarity. This will allow us to find communities of people who are in a similar position of power.

Requirements:

Graph model:

got-datamodel

William Lyon has also wrote a blog post how to import and analyse this network. I have stolen both the graph model picture aswell as the cypher query for importing the data into Neo4j.

LOAD CSV WITH HEADERS FROM 
"https://www.macalester.edu/~abeverid/data/stormofswords.csv" AS row
MERGE (src:Character {name: row.Source})
MERGE (tgt:Character {name: row.Target})
MERGE (src)-[r:INTERACTS]->(tgt)
ON CREATE SET r.weight = toInt(row.Weight)

Centrality

In graph theory and network analysis, indicators of centrality identify the most important nodes within a graph.

Given this assumption we will use centrality indicators as features for the k-nearest neighbors algorithm (k-NN) that will infer a similarity network. As the centralities are not the focus of this blog post I will skip the theory and just run the write back versions of algorithms.

Calculate and write back triangles count and clustering coefficient.

CALL algo.triangleCount('Character', 'INTERACTS',
{write:true, writeProperty:'triangles',clusteringCoefficientProperty:'coefficient'}) 
YIELD nodeCount, triangleCount, averageClusteringCoefficient;

Calculate and write back pageRank score

CALL algo.pageRank(
 'MATCH (c:Character) RETURN id(c) as id',
 'MATCH (p1:Character)-[:INTERACTS]-(p2:Character) 
  RETURN id(p1) as source, id(p2) as target',
{graph:'cypher'});

Calculate and write back harmonic centrality

CALL algo.closeness.harmonic(
  'MATCH (c:Character) RETURN id(c) as id',
  'MATCH (c1:Character)-[:INTERACTS]-(c2:Character) 
   RETURN id(c1) as source, id(c2) as target',
{graph:'cypher', writeProperty: 'harmonic'});

k-NearestNeighbour

First we will normalize our features using the min-max method. Cluster coefficient is already normalized so there is no need to do it again. If you want to do something similar on bigger graphs I would suggest you use apoc.periodic.iterate  for batching.

WITH ["pagerank","harmonic","triangles"] as keys
UNWIND keys as key
MATCH (c:Character)
WITH max(c[key]) as max,min(c[key]) as min,key
MATCH (c1:Character)
WITH c1, key + "normalized" AS newKey, 
    (toFloat(c1[key]) - min) / (max - min) as normalized_value
CALL apoc.create.setProperty(c1, newKey, normalized_value) 
YIELD node
RETURN COUNT(*)

First we match all pairs of nodes and compare their cosine similarity. To avoid a complete graph where all nodes are connected between each other, we will set a similarity threshold, meaning that all relationships with cosine similarity less than 0.9 will be ignored and not stored. I am slightly cheating as with the kNN algorithm you define to how many nearest neighbours should the node be connected to and here we are defining how similar or close should the pair of nodes be to store a relationship.

Again for bigger graphs you should use APOC for batching. I wrote a blog post with a very similar example.

MATCH (c1:Character),(c2:Character) where id(c1) < id(c2)
WITH c1,c2,apoc.algo.cosineSimilarity([c1.pageranknormalized,
                                       c1.coefficient,
                                       c1.harmonicnormalized,
                                       c1.trianglesnormalized],
                                      [c2.pageranknormalized,
                                       c2.coefficient,
                                       c2.harmonicnormalized,
                                       c2.trianglesnormalized]) 
                                         as cosine_similarity
WHERE cosine_similarity > 0.9
MERGE (c1)-[s:SIMILAR_POSITION]-(c2)
SET s.cosine = cosine_similarity

Community Detection

A network is said to have community structure if the nodes of the network can be easily grouped into sets of nodes such that each set of nodes is densely connected internally. In the particular case of non-overlapping community finding, this implies that the network divides naturally into communities of nodes with dense connections internally and sparser connections between communities.

 

Let’s check what community structure will Louvain algorithm find in our network.

CALL algo.louvain.stream('Character', 'SIMILAR_POSITION',
{}) 
YIELD nodeId, community
MATCH (c:Character) where id(c) = nodeId
RETURN community,
       count(*) as communitySize,
       collect(c.name) as members 
ORDER BY communitySize ASC LIMIT 20;

Illyrio Mopatis is all alone in Pentos and probably has no network power at all. The most interesting group is community 8, where all the cream of the book is collected ranging from Starks, Targaryens and Lannisters to interestingly also Stannis, Mance and Davos.

Community 106 looks like a community of captains and maesters and differs from largest community in that the largest community has a higher average cluster coefficient.

GOTlouvain.png

 

Let’s try the Label Propagation algorithm and check what it finds.

CALL algo.labelPropagation(
'MATCH (c:Character) RETURN id(c) as id, 1 as weight, id(c) as value',
'MATCH (c1:Character)-[f:INTERACTS]->(c2:Character) 
RETURN id(c1) as source, id(c2) as target,f.weight as weight',
"OUT",{graph:'cypher',partitionProperty:'lpa',iterations:10});

gotlpa.png

We can immediately observe that LPA returns much more granular communities than Louvain in this network. The biggest community consists of Starks and Lannisters with an addition of Varys. It’s safe to say he deserves this spot.  On the other hand Jon is in the community with all members from Night Watch. Daenerys is also left out of the “strongest” community and shares community with Jorah and ser Barristan. She just wasn’t such a badass in season 3 as she has became in season 7 🙂

 

Neo4j Marvel Social Graph Algorithms Centralities

To top off the Marvel Social Graph series we will look at how to use centralities on a projected graph via cypher queries to find influencers or otherwise important nodes in our network using Neo4j and neo4j-graph-algorithms library.

To recap the series:

Graph projections via cypher queries:

As we noticed in the previous part using graph projections via cypher queries or for short “cypher loading”  is really great as it lets us filter and/or project virtual graphs easily and quickly. To let you fully take advantage of this awesome tool we need to get to know exactly how it works.

Unlike default label and relationship type of loading subsets of graphs, where we can in some cases define direction of the relationship to be either “incoming”,”outgoing” or “both”(birected/undirected) , cypher loading does not support loading single relationship as undirected.

While this may seem bad it’s actually not as cypher loading allows us to get creative with trying out graph algorithms on different virtual networks that we can project using cypher queries. We already did this in the previous post, but I didn’t describe it in detail yet.

Imagine that we have two hero nodes and a single directed relationship between them.
Only difference between loading this graph as undirected or directed is specifying the direction of the relationship in the cypher query. When we do not specify the direction of the relationship in the cypher query, cypher engine returns two patterns in both direction for each relationship in our graph. That in turn projects our network bidirected or undirected.

projecting directed network:

MATCH (u1:Hero)-[rel:KNOWS]->(u2:Hero)
RETURN id(u1) as source, id(u2) as target

projecting undirected network:

MATCH (u1:Hero)-[rel:KNOWS]-(u2:Hero)
RETURN id(u1) as source, id(u2) as target

Centralities

In graph theory and network analysis, indicators of centrality identify the most important nodes within a graph. Applications include identifying the most influential person(s) in a social network, key infrastructure nodes in the Internet or urban networks, and super-spreaders of disease. Centrality concepts were first developed in social network analysis, and many of the terms used to measure centrality reflect their sociological origin.[1]

Pagerank

PageRank is Google’s popular search algorithm. PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the node is. The underlying assumption is that more important nodes are likely to receive more links from other websites.

More in documentation.

We will use cypher loading to load only the nodes of the biggest component and set a weight threshold of 100 for relationships.

call algo.pageRank.stream(
// Match only the biggest component 
'MATCH (u:Hero) WHERE u.component = 136 RETURN id(u) as id
','
MATCH (u1:Hero)-[k:KNOWS]-(u2:Hero)
// Similarity threshold 
WHERE k.weight >= 100
RETURN id(u1) as source, id(u2) as target
',{graph:'cypher'}
) yield node,score 
WITH node,score ORDER BY score DESC limit 10 
return node.name as name, score;

Captain America has the highest pagerank score. He is located in the middle of the network with a total of 24 relations and also relations to most of the other important heroes in the network like Thor, Spiderman and Iron Man. If we check all heroes related to Captain America, we can notice, that they have on average higher Pagerank score just because of this relation to Captain America.

marvel_pagerank.png

* Node color from white (less) to red (more): Pagerank

Closeness centrality

Closeness centrality is defined as the total number of links separating a node from all others along the shortest possible paths. In other words, to calculate closeness, one begins by calculating, for each pair of nodes in the network, the length of the shortest path from one to the other (aka the geodesic distance). Then for each node, one sums up the total distance from the node to all other nodes.[2]

Closeness can be interpreted as an index of time-until-arrival of something flowing through the network. The greater the raw closeness score, the greater the time it takes on average for information originating at random points in the network to arrive at the node. Equally, one can interpret closeness as the potential ability of a node to reach all other nodes as quickly as possible.[2]

More in documentation.

We will use cypher loading to load only the nodes of the biggest component and set a weight threshold of 100 for relationships. With closeness centrality it is especially important that we load only a single component.

Unfortunately, when the graph is unconnected, closeness centrality appears to be useless because the distance between two nodes belonging to different components is infinite by convention, which makes the sum in 2 infinite too and therefore its inverse equal to zero. For every node of such a graph, there is always another node belonging to another component: indices of all vertices of the graph are therefore useless and the calculation of the index is limited to the largest component, omitting the roles played by individuals of other components.[3]

CALL algo.closeness.stream(
// Match only the biggest component 
'MATCH (u:Hero) WHERE u.component = 136 RETURN id(u) as id
','
MATCH (u1:Hero)-[k:KNOWS]-(u2:Hero) 
// Similarity threshold 
WHERE k.weight >= 100
RETURN id(u1) as source,id(u2) as target
',{graph:'cypher'}) YIELD nodeId, centrality
WITH nodeId,centrality 
ORDER BY centrality DESC LIMIT 10
MATCH (h:Hero) where id(h)=nodeId
RETURN h.name as hero, centrality

Captain America is in such a privileged position, that he will be leading in all categories of centralities. We can observe that nodes in more tight communities have higher closeness centrality indexes while those on the brinks and less connected have smaller values. Second thing we can notice is that also the overall position of nodes in the graph matter as the middle community has on average higher closeness centrality as others. As an example both Iron Man and Vision have higher closeness centrality than Spiderman, while Spiderman has higher Pagerank index than them.

marvel_closeness.png

* Node color from white (less) to red (more): Closeness centrality

Harmonic Centrality

The harmonic mean has been known since the time of Pythagoras and Plato as the mean expressing “harmonious and tuneful ratios”, and later has been employed by musicians to formalize the diatonic scale, and by architects as paradigm for beautiful proportions.[4]

Social network analysis is a rapid expanding interdisciplinary field, growing from work of sociologists, physicists, historians, mathematicians, political scientists, etc. Some methods have been commonly accepted in spite of defects, perhaps because of the rareness of synthetic work like (Freeman, 1978; Faust & Wasserman, 1992). Harmonic centrality was proposed as an alternative index of closeness centrality defined on undirected networks. Results show its computation on real cases are identical to those of the closeness centrality index, with same computational complexity and we give some interpretations. An important property is its use in the case of unconnected networks.[3]

CALL algo.closeness.harmonic.stream( 
// Match only the biggest component 
'MATCH (u:Hero) WHERE u.component = 136 RETURN id(u) as id 
',' 
MATCH (u1:Hero)-[k:KNOWS]-(u2:Hero) 
// Similarity threshold 
WHERE k.weight >= 100 
RETURN id(u1) as source,id(u2) as target '
,{graph:'cypher'}) YIELD nodeId, centrality 
WITH nodeId,centrality 
ORDER BY centrality DESC LIMIT 10 
MATCH (h:Hero) where id(h)=nodeId 
RETURN h.name as hero, centrality

Harmonic centrality was proposed as an alternative for closeness centrality to help solve the problem of disconnected components. Because of this we get back very similar results, given that we also have a single connected component.

marvel_harmonic

Betweenness Centrality

In graph theory, betweenness centrality is a measure of centrality in a graph based on shortest paths. For every pair of nodes in a connected graph, there exists at least one shortest path between the vertices such that either the number of relationships that the path passes through (for unweighted graphs) or the sum of the weights of the edges (for weighted graphs) is minimized. The betweenness centrality for each node is the number of these shortest paths that pass through the node.[6]

More in documentation.

We will use cypher loading to load only the nodes of the biggest component and set a weight threshold of 100 for relationships.

CALL algo.betweenness.stream(
// Match only the biggest component
'MATCH (u:Hero) WHERE u.component = 136 RETURN id(u) as id
','
MATCH (u1:Hero)-[k:KNOWS]-(u2:Hero) 
// Similarity threshold
WHERE k.weight >= 100
RETURN id(u1) as source,id(u2) as target
',{graph:'cypher'}) YIELD nodeId, centrality
WITH nodeId,centrality 
ORDER BY centrality DESC LIMIT 10
MATCH (h:Hero) where id(h)=nodeId
RETURN h.name as hero, centrality

As always Captain America is in first place and this time Beast being in the second place. This comes as no surprise as we can observe that he is the sole bridge between middle and right community. Spiderman and Incredible Hulk play a similar role as Beast, but have smaller communities behind them and hence also smaller betweenness scores.

marvel_betweeness.png

* Node color from white (less) to red (more): Betweenness centrality

References:

[1] https://en.wikipedia.org/wiki/Centrality

[2] http://qualquant.org/wp-content/uploads/networks/2008%201-7-3.pdf

[3] https://infoscience.epfl.ch/record/200525/files/[EN]ASNA09.pdf?

[4] https://arxiv.org/pdf/cond-mat/0008357.pdf

[6] https://en.wikipedia.org/wiki/Betweenness_centrality