Chaining graph algorithms in Neo4j on the countries of the world graph

The idea for this blog post is to show how to chain similarity algorithms with community detection algorithms in Neo4j without ever storing any information back to the graph. If for example you have only read rights to Neo4j or you don’t want to store anything to the graph while analyzing it, then chaining algorithms is for you.

As I used a new dataset to play around, the post also shows how to import and preprocess the graph for the analysis.


We use Countries of the World dataset, made available by Fernando Lasso on Kaggle. It contains information such as birthrate, GDP, infant mortality and others about 227 countries of the world.

Let’s start by researching how many missing values are there in the CSV file.

Countries of the world.csv file must be copied to the Neo4j/import folder to be able to run the missing values query.

Missing values query

LOAD CSV WITH HEADERS FROM "file:///countries%20of%20the%20world.csv" as row
UNWIND keys(row) as key
  sum(CASE WHEN row[key] is null THEN 1 ELSE 0 END) as missing_value
ORDER BY missing_value DESC LIMIT 15

Missing values results


As expected with any real-world dataset there are some missing values. Fortunately, only five features have some missing values while others have close to zero. Features with more than five percent of missing values (11+) are not considered in our analysis.


Graph model

Schema of the graph is simple. We have one type of nodes representing countries. They have some features stored as properties and are also connected to the to the region they belong in with a relationship.


We could define unique constraints for labels Country and Region, but since our graph is so tiny we’ll skip this today and proceed with the import.

Import query

LOAD CSV WITH HEADERS FROM "file:///countries%20of%20the%20world.csv" as row
MERGE (c:Country{name:row.Country})
ON CREATE SET c +=,['Country','Region'],[])
MERGE (r:Region{name:row.Region})
MERGE (c)-[:IN]->(r)


When I first explored this dataset in Neo4j I got back weird results and didn’t exactly understand why. After a couple of minutes, I found the culprit. If we run the following cypher query RETURN toFloat("0,1")  it returns a null value. I later found out that this is a Java limitation and not specifically Neo4j. The numbers in this dataset use the comma as a decimal point. This doesn’t work for us and we need to replace the commas with the dots to be able to store them as a float.

Let’s replace the commas for dots and store the new values as floats.

MATCH (c:Country)
UNWIND keys(c) as key
WITH c, key, toFloat(replace(c[key],',','.')) as fixed_float
WHERE NOT key = 'name'
CALL apoc.create.setProperty(c, key, fixed_float) YIELD node
RETURN distinct 'done'

Feature selection

We don’t use all the features of countries that are available in our analysis. I cherry-picked a couple of features that have little to no missing values, specifically:

  • Birthrate
  • Infant mortality
  • GDP
  • Arable %
  • Crops %
  • Phones (per 1000)

Let’s check basic statistics about our features with apoc.agg.statistics procedure.

UNWIND ["Birthrate", "Infant mortality (per 1000 births)","GDP ($ per capita)",
"Arable (%)", "Crops (%)", "Phones (per 1000)"] as potential_feature
MATCH (c:Country) 
WITH potential_feature, apoc.agg.statistics(c[potential_feature],[0.5,0.75,0.9,0.95,0.99]) as stats
RETURN potential_feature, 
       apoc.math.round(stats.min,2) as min, 
       apoc.math.round(stats.max,2) as max, 
       apoc.math.round(stats.mean,2) as mean, 
       apoc.math.round(stats.stdev,2) as stdev,
       apoc.math.round(stats.`0.5`,2) as p50,
       apoc.math.round(stats.`0.75`,2) as p75,
       apoc.math.round(stats.`0.95`,2) as p95,
       apoc.math.round(stats.`0.99`,2) as p99



I found it interesting that Monaco has 1035 phones per 1000 persons and so more phones than people. On the other side of the spectrum, I found it shocking that Angola has (not sure exactly which year) 191,19 infant deaths per 1000 birth.

MinMax Normalization

MinMax normalization is used to scale all values of features between 0 and 1. It is better known as a process of feature scaling or data normalization.

WITH ["Birthrate", 
  "Infant mortality (per 1000 births)",
  "GDP ($ per capita)",
  "Arable (%)",
  "Crops (%)",
  "Phones (per 1000)"] as features
UNWIND features as key
MATCH (c:Country)
WITH max(c[key]) as max,min(c[key]) as min,key
MATCH (c1:Country)
WITH c1, "normalized_" + key AS newKey, 
(toFloat(c1[key]) - min) / (max - min) as normalized_value
CALL apoc.create.setProperty(c1, newKey, normalized_value) 
YIELD node
RETURN distinct 'normalization done'

Missing values

As we observed at the start, some features of the countries have missing values. A simple average of the region is used to fill in the missing values.

UNWIND ["normalized_Birthrate",
  "normalized_Infant mortality (per 1000 births)",
  "normalized_GDP ($ per capita)",
  "normalized_Arable (%)",
  "normalized_Crops (%)",
  "normalized_Phones (per 1000)"] as feature
MATCH (c:Country)
WHERE c[feature] IS null
MATCH (c)-[:IN]->(r:Region)<-[:IN]-(other:Country)
WHERE other[feature] IS NOT null
WITH c,feature,avg(other[feature]) as avg_value
CALL apoc.create.setProperty(c, feature, avg_value) 
YIELD node
RETURN distinct 'missing values populated'


The analysis is split into two parts.

In the first part we find similar countries using cosine similarity algorithm.

In the second part we chain cosine similarity with Louvain algorithm to find communities of similar countries.

Cosine similarity algorithm

If you are new to cosine similarity metric I suggest you check out the documentation or the Wikipedia.

To get a feeling of how similar our countries are we run algo.similarity.cosine algorithm with write:false parameter. It returns global statistics about the similarity graph.

MATCH (c:Country) 
WITH id(c) as id, [c["normalized_Birthrate"], 
                   c["normalized_Infant mortality (per 1000 births)"], 
                   c["normalized_GDP ($ per capita)"], 
                   c["normalized_Arable (%)"], 
                   c["normalized_Crops (%)"], 
                   c["normalized_Phones (per 1000)"]] as weights 
WITH {item:id, weights: weights} as userData 
WITH collect(userData) as data 
CALL algo.similarity.cosine(data, {similarityCutoff: 0.0, write:false}) 
YIELD nodes, similarityPairs, min, max, mean, stdDev, p25, p50, p75, p90
RETURN nodes,similarityPairs,apoc.math.round(min,2) as min,
                             apoc.math.round(max,2) as max,
                             apoc.math.round(mean,2) as mean,
                             apoc.math.round(stdDev,2) as stdDev,
                             apoc.math.round(p25,2) as p25,
                             apoc.math.round(p50,2) as p50,
                             apoc.math.round(p75,2) as p75, 
                             apoc.math.round(p90,2) as p90



Even with similarityCutoff: 0.0, cosine similarity ranges between -1 and 1, we get an average of more than 100 relationships per node. Community detection algorithms don’t perform well on a very dense network. In our results each node is connected to about 50% of the nodes in the graph and community detection algorithms will most likely find one giant community that spans over most of the graph.

This is not a desired outcome when searching for communities. To avoid it we use a combination of similarityCutoff:0.8 and topK:10. Read more in the documentation.

Stream cosine similarity query

MATCH (c:Country)
WITH id(c) as id,
   c["normalized_Infant mortality (per 1000 births)"],
   c["normalized_GDP ($ per capita)"],
   c["normalized_Arable (%)"],
   c["normalized_Crops (%)"],
   c["normalized_Phones (per 1000)"]] as weights 
WITH {item:id, weights: weights} as userData
WITH collect(userData) as data
YIELD item1, item2, count1, count2, similarity
WITH item1,item2,similarity 
where item1 > item2
RETURN algo.getNodeById(item1).name AS from,
       algo.getNodeById(item2).name AS to,
ORDER BY similarity DESC LIMIT 10



Results show that African countries are the most similar countries in the world judging by our features.

Chaining Cosine Similarity and Louvain algorithm

We have prepared everything so now we can finally chain cosine similarity with Louvain algorithm. Louvain algorithm is run using cypher projection.

To project a graph with cypher projection we must provide two cypher statements. The first one, so-called node-statement, is where we provide all the ids of nodes the algorithm should consider. In the second statement, also called relationship-statement, we provide all the relationships of our graph. This is done by providing source and target node ids for all relationships we want to consider.

Node-statement is fairly trivial as we just provide all the countries. Cosine similarity algorithm will be put into the relationship-statement to provide all the relationships of the similarity graph to the Louvain algorithm.

  'MATCH (c:Country) RETURN id(c) as id',
  'MATCH (c:Country)
   WITH id(c) as id,[c["normalized_Birthrate"],
                     c["normalized_Infant mortality (per 1000 births)"],
                     c["normalized_GDP ($ per capita)"],
                     c["normalized_Arable (%)"],
                     c["normalized_Crops (%)"],
                     c["normalized_Phones (per 1000)"]] as weights 
   WITH {item:id, weights: weights} as userData
   WITH collect(userData) as data
   YIELD item1, item2, similarity
   RETURN item1 AS source, item2 AS target',
   {graph: 'cypher'})
YIELD nodeId,community
WITH algo.getNodeById(nodeId) as node, community
RETURN community,count(*) as community_size,
       apoc.math.round(avg(node["Birthrate"]),2) as avg_birthrate,
       apoc.math.round(avg(node["Infant mortality (per 1000 births)"]),2) as infant_mortality,
       apoc.math.round(avg(node["GDP ($ per capita)"]),2) as gdp,
       apoc.math.round(avg(node["Arable (%)"]),2) as arable, 
       apoc.math.round(avg(node["Crops (%)"]),2) as crops, 
       apoc.math.round(avg(node["Phones (per 1000)"]),2) as phones



Louvain algorithm found 7 communities in our graph. First community is the richest judging by the GDP, has the lowest infant mortality rate and interestingly also almost the lowest birthrate.

On the other side, there is the poorest community of countries with an average value of GDP of only 1633$. They lead in infant mortality, which is quite high at 88,61 deaths per 1000 births and also lead in birthrate.


I was also thinking of showing how to visualize communities in the graph using only streaming algorithms, but I decided not to as we would have to run the similarity algorithm twice in a single query and it doesn’t make sense.

Still, I want to end on a visualization of communities in the graph. Made in Gephi.



ArticleRank algorithm on a citation network in Neo4j

This article is a preview of the ArticleRank algorithm that will be available in the next release of Neo4j graph algorithms. If you can’t wait till then you can build the latest build yourself and try it out.

ArticleRank algorithm is a slight variation of the PageRank algorithm. It reduces the bias that PageRank has towards assigning higher scores to nodes with relationships from nodes that have few outgoing relationships.

Specifically, to find the score for a given node, PageRank counts up all the incoming scores and divides them by the number of outgoing links for each given incoming page. ArticleRank assigns a score by counting up all the incoming scores divides, and divides them by the average number of outgoing links plus the number of outgoing links for each given incoming page.[1]


We will be using High-energy physics theory citation network made available by Stanford Network Analysis Project. It contains 352,807 relationships (citations) between 27,770 articles.


Define graph schema


Import citations

LOAD CSV FROM "file:///Cit-HepTh.txt" as row FIELDTERMINATOR " "
WITH row skip 4
MERGE (a1:Article{id:row[0]})
MERGE (a2:Article{id:row[1]})
MERGE (a1)-[:CITES]->(a2);

Import titles of articles

For each article in the graph, there exists an .abs file containing its title and description. To import them programmatically we start with a MATCH of all articles and concatenate the filenames from the unique ids of the articles. In the second step we load the .abs file and write back the title of the article. For batching purposes, we are using apoc.periodic.iterate procedure with the parallel parameter set true.

CALL apoc.periodic.iterate("MATCH (a:Article)
    WITH a,case length( WHEN 4 THEN '000' + toString(
                             WHEN 5 THEN '00' + toString(
                             WHEN 6 THEN '0' + toString(
                             ELSE toString( END AS filename 
                             RETURN a,filename",
   "LOAD CSV FROM 'file:///' + filename + '.abs' as row FIELDTERMINATOR '\n'
    WITH a,row
    UNWIND row as r
    WITH a,split(r,'Title: ')[1] as title where r contains 'Title: '
    SET a.title = title",{batchSize:10,parallel:true})


Let’s find top ten articles using ArticleRank algorithm.

ArticleRank Query

CALL'Article', 'CITES', 
  {iterations:20, dampingFactor:0.85})
YIELD nodeId, score
RETURN algo.getNodeById(nodeId).title AS article,score

ArticleRank Results

article score
“The Large N Limit of Superconformal Field Theories and Supergravity” 14.572952999999998
“String Theory Dynamics In Various Dimensions” 11.064807499999999
“Monopole Condensation, And Confinement In N=2 Supersymmetric Yang-Mills” 11.061662500000002
“Dirichlet-Branes and Ramond-Ramond Charges” 10.9979975
“Anti De Sitter Space And Holography” 10.440066000000002
“Gauge Theory Correlators from Non-Critical String Theory” 9.797559499999998
“M Theory As A Matrix Model: A Conjecture” 9.138546
“Unity of Superstring Dualities” 7.868136000000001
“Monopoles, Duality and Chiral Symmetry Breaking in N=2 Supersymmetric” 7.5854685
“Bound States Of Strings And $p$-Branes” 7.5395855

Let’s also run PageRank algorithm for comparison

PageRank Query

CALL'Article', 'CITES', 
{iterations:20, dampingFactor:0.85})
YIELD nodeId, score
RETURN algo.getNodeById(nodeId).title AS article,score

PageRank Results

article score
“Monopole Condensation, And Confinement In N=2 Supersymmetric Yang-Mills” 83.25198399999998
“Noncompact Symmetries in String Theory” 78.86716550000001
“An Algorithm to Generate Classical Solutions for String Effective Action” 70.79724850000001
“String Theory Dynamics In Various Dimensions” 61.099139499999986
“Dirichlet-Branes and Ramond-Ramond Charges” 57.558133000000005
“Exact Results on the Space of Vacua of Four Dimensional SUSY Gauge” 52.274805
“The Large N Limit of Superconformal Field Theories and Supergravity” 46.086371500000006
“Unity of Superstring Dualities” 44.9701855
“Monopoles, Duality and Chiral Symmetry Breaking in N=2 Supersymmetric” 42.74964549999999

Article “The Large N Limit of Superconformal Field Theories and Supergravity” leads the score when using ArticleRank algorithm, but is only seventh when using PageRank. We can observe a couple of changes in the top ten list.


As mentioned ArticleRank algorithm is a slight variant of the PageRank algorithm. It can be used in various domains and scenarios just like its predecessor.

kNN Classification of members of congress using similarity algorithms in Neo4j

k-Nearest Neighbors algorithm

Image taken from wikipedia,

Couple of days ago I was presenting “How to use similarity algorithms” in a Neo4j online meetup with Mark Needham. Among other use-cases we discussed how they could be used as a part of kNN classification algorithm. kNN classification algorithm works as follows. To predict the class of a node, we find its top k neighbours and select their majority class as the predicted class.

This blog post will show how to use kNN classification in Neo4j.



This data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes identified by the CQA. The CQA lists nine different types of votes: voted for, paired for, and announced for (these three simplified to yea), voted against, paired against, and announced against (these three simplified to nay), voted present, voted present to avoid conflict of interest, and did not vote or otherwise make a position known (these three simplified to an unknown disposition).

The information about the dataset and the dataset itself, made available by Jeff Schlimmer, can be found at UCI machine learning repository.

LOAD CSV FROM "" as row
CREATE (p:Person)
SET p.class = row[0],
    p.features = row[1..];

Missing votes

Let’s check how many members of congress have at least one missing vote.

MATCH (n:Person) 
WHERE "?" in n.features
RETURN count(n)



Almost half of the members of the congress have missing votes. That’s quite significant so let’s dig deeper. We’ll check out what’s the distribution of missing votes per member.

MATCH (p:Person)
WHERE '?' in p.features
WITH p,apoc.coll.occurrences(p.features,'?') as missing
RETURN missing,count(*) as times ORDER BY missing ASC


missing times
1 124
2 43
3 16
4 6
5 5
6 4
7 1
9 1
14 1
15 1
16 1

Three members almost never voted (14,15,16 missing votes) and two of the them (7,8  missing votes) have more than 50% missing votes. We will exclude them from our further analysis in order to try to reduce noise.

MATCH (p:Person)
WITH p,apoc.coll.occurrences(p.features,'?') as missing
WHERE missing > 6

Training and test data

Lets divide our dataset into two subsets, where 80% of nodes will be marked as training data and the rest 20% as test data. There is a total of 430 nodes in our graph. We will be mark 344 nodes as the training subset and the rest as test.

Mark training data

MATCH (p:Person)
SET p:Training;

Mark test data

MATCH (p:Person)
SET p:Test;

Feature vector

There are three possible values in the feature sets. We will map them as follows:

  • “y” to 1
  • “n” to 0
  • “?” to 0.5

Transform to feature vector

MATCH (n:Person)
UNWIND n.features as feature
WITH n,collect(CASE feature WHEN 'y' THEN 1
                            WHEN 'n' THEN 0
                            ELSE 0.5 END) as feature_vector
SET n.feature_vector = feature_vector

Example feature set transformation to feature vector

features feature_vector
[“n”, “y”, “n”, “y”, “y”, “n”, “n”, “n”, “n”, “n”, “?”, “?”, “y”, “y”, “n”, “n”] [0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0.5, 0.5, 1, 1, 0, 0]

kNN classifier algorithm

We will use euclidian distance as the distance function and topK value of 3. It is advisable to use an odd number as K to avoid producing edge cases, where for example with top two neighbours and each having a different class, we end up with no majority class, but a 50/50 split between the two.

We have a specific situation where we want to start with all nodes labelled Test and find top three neighbour nodes from Training subset only. Otherwise all the nodes labelled for Test would also be considered part of Training data, which is something we want to avoid. This is exactly the reason why we can’t use a simple query like:

CALL algo.similarity.euclidean(data, {topK: 3, write:true})

We will need to use in order to limit match results per row instead of per query. This will help us get back top 3 neighbours per node. Find more in the knowledge base.

We will also use a combination of apoc.coll.frequencies and apoc.coll.sortMaps to get back the most occurring element in a collection.

apoc.coll.sortMaps(apoc.coll.frequencies(collection), '^count')[-1]

Find more in documentation.



MATCH (test:Test)
WITH test,test.feature_vector as feature_vector
CALL'MATCH (training:Training)
    // calculate euclidian distance between each test node and all training nodes
    WITH training,algo.similarity.euclideanDistance($feature_vector, training.feature_vector) AS similarity
    // return only top 3 nodes
    ORDER BY similarity ASC LIMIT 3
    RETURN collect(training.class) as classes',
    {feature_vector:feature_vector}) YIELD value
WITH test.class as class, apoc.coll.sortMaps(apoc.coll.frequencies(value.classes), '^count')[-1].item as predicted_class
WITH sum(CASE when class = predicted_class THEN 1 ELSE 0 END) as correct_predictions, count(*) as total_predictions
RETURN correct_predictions,total_predictions, correct_predictions / toFloat(total_predictions) as ratio


correct_predictions total_predictions ratio
77 86 0.8953488372093024

We correctly predicted the result 77 out of 86 times, which is really nice.


I hope in this blog post you have learned how to use graph algorithms and APOC together to run kNN classification algorithm in Neo4j and will now be able to use it on your own graph.

In theory, this approach could be used for more than just binary classification as the algorithm takes the majority of neighbors classes and there is no technical barrier to using more than two classes.

Feature extraction on a peer to peer network with DeepGL embedding and Neo4j

I came across an article Extracting ML-Features from Graph Data with DeepGL on Neo4j, written by Mark Needham, where he introduced the latest addition to the Neo4j Statistics/Machine Learning Procedures project, which is DeepGL embedding algorithm. It looked cool so I decided to try it out myself.

I am not a math wizard, but as far as I understand, DeepGL embedding algorithm extracts features of each node based on the position of the node in the graph and other characteristics of the node such as in-degree and out-degree. We can also define some custom characteristics. Check the above article for more information.


We will be using Gnutella peer-to-peer network made available by Stanford Academy. Dataset contains information about hosts in the Gnutella network and connections between the Gnutella hosts.

First we define the schema.


Import the data.

"file:///p2p-Gnutella09.txt" as row fieldterminator ' '
MERGE (h1:Host{id:row[0]})
MERGE (h2:Host{id:row[1]})
MERGE (h1)-[:CONNECTION]->(h2)

Graph embedding

We will include pageRank values as base features for the DeepGL algorithm, so we need to calculate it and write it back.

CALL algo.pageRank('Host', 'CONNECTION')

Run DeepGL embedding with 2 iterations and pageRank as an additional node feature.

CALL embedding.deepgl("Host","CONNECTION", {
  nodeFeatures: ['pagerank'],
  iterations: 2

Cosine similarity

Now that we have extracted node’s features with DeepGL algorithm, we can infer a new network based on the cosine similarity of these features between hosts.

I prefer to use similarityCutoff parameter that represents the similarity threshold above which relationships should be returned or stored as in our case, but you could also use topK parameter which would effectively turn this algorithm into a kNN algorithm.

MATCH (h:Host)
WITH {item:id(h), weights: h.embedding} as userData
WITH collect(userData) as data
CALL algo.similarity.cosine(data, {similarityCutoff: 0.8, write:true})
YIELD nodes, similarityPairs
RETURN nodes, similarityPairs,

Community detection

Run Louvain algorithm on this new inferred network to find communities of similar hosts. I will not pretend that I understand how deepGL embedding works, but I saw that it takes in-degree and out-degree as base features, so we will compare average in-degree, out-degree and pageRank values between biggest five communities.

CALL'Host', 'SIMILAR', {})
YIELD nodeId, community
MATCH (n) WHERE id(n)=nodeId
RETURN community,
       avg(size((n)-[:CONNECTION]->())) as out_degree,
       avg(size((n)<-[:CONNECTION]-())) as in_degree,
       avg(n.pagerank) as pagerank,
       count(*) as size
ORDER by size DESC limit 5

deepgl_louvain.pngThe biggest community seems to be also the most important community judging by pageRank. Members of biggest community have high out-degree and in-degree compared to others. They could be regarded as the heart that keeps the network alive. Second largest community is the least important judging by pageRank and usually members have no outgoing relationship and only one incoming relationship. They just silently sit in the corner. Third community could be regarded as “receivers” as they have little to none outgoing relationships, but more than 8 incoming relationships on average and so on…


Next step would be to find a cool use-case, where we would feed DeepGL features to a machine learning model and see how it behaves. Maybe that will be my next blog post, who knows 🙂

Discovering hiearchical community structure in GoT world using Louvain Algorithm and Neo4j

In the recent past phase two of Louvain algorithm was implemented in Neo4j graph algorithms library. It may not seem much, but it has great implications on our ability to explore community structures in Neo4j.

Here I will show you some of those implications.


As always lets first define the schema.


We will be using Network of thrones dataset made available by Andrew Beveridge, specifically volume 4 through 5 titled “A feast with dragons”. It contains information about 506 characters and 1329 relationships between them.

MERGE (src:Character {name: row.Source}) 
MERGE (tgt:Character {name: row.Target}) 
MERGE (src)-[r:INTERACTS]->(tgt) 
ON CREATE SET r.weight = toInt(row.Weight)

Louvain algorithm

Louvain algorithm is roughly defined as follows:

The method consists of repeated application of two steps. The first step is a “greedy” assignment of nodes to communities, favoring local optimizations of modularity. The second step is the definition of a new coarse-grained network, based on the communities found in the first step. These two steps are repeated until no further modularity-increasing reassignments of communities are possible.

Taken from the documentation

As stated Louvain algorithm repeats two steps until no further modularity reassignments are possible. The great thing about this is that we can check community structure at the end of each iteration of two steps mentioned above and get a glimpse of how small communities are expanding into bigger ones.

Run the Louvain algorithm with includeIntermediateCommunities parameter to be true. This will save community id at each iteration into an array property named communities.

CALL algo.louvain('Character', 'INTERACTS', 
  {includeIntermediateCommunities: true})

Lets check how many distinct communities exist at the end of each iteration to get a rough sense of results .

MATCH (c:Character)
UNWIND range(0,length(c.communities) - 1) as x
RETURN x + 1 as iteration,
       count(distinct c.communities[x]) as number_of_communities,
       count(*) / count(distinct c.communities[x]) as average_community_size
ORDER BY iteration ASC

We can observe that at the end of first iteration we have quite small communities with an average community size of only 5. This quickly changes at the end of second iteration as average community size rises up 42. Third iteration feels like a minor fix and the average community size ends up 50.

iteration number_of_communities average_community_size
1 88 5
2 12 42
3 10 50

Jon Snow

Lets focus on Jon Snow for a moment and check how his community looks like at the end of each iteration. As we saw before we should get a fairly small community after first iteration and his community will expand with each sequential iteration.

MATCH (c:Character{name:"Jon-Snow"})
UNWIND range(0,length(c.communities) - 1) as x
MATCH (c1:Character)
WHERE c.communities[x] = c1.communities[x]
RETURN x + 1 as iteration, 
       collect( as community,
       count(*) as community_size
ORDER BY iteration ASC

After first iteration Jon Snow’s community has only 8 members, which compose of a couple of free folks and some Night’s Watch men if my GoT knowledge is any accurate. As expected second and third iteration community expand quite a bit.

iteration community community_size
1 [“Jon-Snow”, “Alliser-Thorne”, “Benjen-Stark”, “Cotter-Pyke”, “Denys-Mallister”, “Soren-Shieldbreaker”, “Tom-Barleycorn”, “Ulmer”] 8
2 [“Aegon-V-Targaryen”, “Aemon-Targaryen-(Maester-Aemon)”, “Clydas”, “Dareon”,  “Dalla”, “Eddison-Tollett”, “Melisandre”, …] 70
3 [“Aegon-V-Targaryen”, “Aemon-Targaryen-(Maester-Aemon)”, “Alleras”, “Clydas” …] 91

Gephi visualizations

I am more visual type than anything else so I decided to use Gephi to visualize how communities expand over each iteration. For more detailed instructions how to combine Neo4j and Gephi read my previous blog.

We need to store community id of each iteration as a separate property in order to be able to export to and visualize with Gephi.

MATCH (c:Character)
UNWIND range(0,length(c.communities) - 1) as x
CALL apoc.create.setProperty(c, 'communityy_' + toString(x), c.communities[x]) YIELD node
RETURN distinct 'done'

Finally send the data to Gephi.

MATCH path = (:Character)-[:INTERACTS]->(:Character)
CALL apoc.gephi.add(null,'workspace1',path,'weight',
  ['community_0','community_1','community_2']) yield nodes
return distinct true

First iteration


Communities: 88
Average community size: 5

Second iteration


Communities: 12
Average community size: 42

Third iteration


Communities: 10
Average community size: 50


Checking intermediate community structure at the end of each iteration in Louvain algorithm is a powerful tool that lets us find small communities that are otherwise overlooked. It also allows us to search for hierarchical community structures as for example a single community at the third iteration of the algorithm can be composed out of three different communities at the second iteration. This would imply some sort of hierarchical connection and structure.

Running weighted pageRank on a telecommunications network in Neo4j

Couple of days guys at Neo4j graph algorithms library added weighted pageRank support. Here I’ll demonstrate it and show you how to use it in your own graph analysis.

We will be using AT&T Network dataset made available by MritunjayMohitesh on kaggle. Dataset contains data about connections between IPs with additional meta-data such as as number of bytes sent/received.

First we define the graph schema.


Download the dataset and copy it into $neo4j/import folder. We use PERIODIC COMMIT option as there is almost 340k relationships in the dataset.

LOAD CSV WITH HEADERS FROM "file:///network_data.csv" as row
MERGE (source:IP{ip:row.source_ip})
MERGE (destination:IP{ip:row.destination_ip})
CREATE (source)-[:CONNECTION{bytes:toINT(row.num_bytes)}]->(destination);

Non-weighted pageRank

Lets run non-weighted pageRank for future reference. Check my previous blog posts for more details about pageRank.

YIELD nodeId,score
RETURN algo.getNodeById(nodeId).ip as ip, score
ip score
“135.0777d.04511.232” 131.4211185
“135.0777d.04511.237” 86.0445825
“135.76479.7ef60.21” 2.288821
“135.76479.7ef60.13” 1.6532760000000002
“135.fa7cd.19ca1.167” 1.1636505000000004
“135.b1d10.115f8.14” 1.0137785
“135.b1d10.115f8.10” 1.0137785
“155.8f532.f0935.124” 0.9813935000000001
“64.903ce.0266e.184” 0.9667224999999998
“135.b1d10.ed3d2.151” 0.9667224999999998

Results show two key players among IPs that are far away from the third one with more than 40x+ the pageRank value.

Weighted pageRank

With weighted pageRank we can take into account also how much traffic was transferred in bytes total between IPs when calculating pageRank.

To keep it easy to understand we will first infer a graph where we store the total bytes transferred between IPs as a property of a new relationship.

WITH i1,i2,sum(c.bytes) as weight
MERGE (i1)-[s:SUM]->(i2)
ON CREATE SET s.weight = weight

We can now run weighted pageRank on this new network

YIELD nodeId,score
RETURN algo.getNodeById(nodeId).ip as ip, score
ip score
“135.0777d.04511.237” 191.78568750000002
“135.0777d.04511.232” 49.986859999999986
“135.b1d10.d1c38.89” 6.7915685
“135.b1d10.d1c38.90” 6.2789505000000005
“135.b1d10.d1c38.197” 6.006040999999999
“135.b1d10.9cfdf.111” 5.547508500000002
“135.b1d10.b1d10.82” 5.360330000000001
“135.b1d10.b1d10.73” 5.312220000000001
“135.b1d10.d1c38.196” 5.302249499999999
“135.b1d10.6f3ef.100” 5.291539499999999

Two key players shift positions with IP “135.0777d.04511.237” now having four times the “135.0777d.04511.232” pageRank score whereas before it had just 65% of it. That’s a big difference.

Cypher projection

To keep our graph tidy and avoid storing new relationships we can use cypher projection. This way we project virtual relationships between IPs and use total bytes transferred between IPs as a weight in one step without having to store new relationship.

  "MATCH (i:IP) RETURN id(i) as id",
  "MATCH (s:IP)-[c:CONNECTION]->(t:IP) 
   RETURN id(s) as source,id(t) as target, sum(c.bytes) as weight",
YIELD nodeId,score
RETURN algo.getNodeById(nodeId).ip as ip, score
ip score
“135.0777d.04511.237” 191.78568750000002
“135.0777d.04511.232” 49.986859999999986
“135.b1d10.d1c38.89” 6.7915685
“135.b1d10.d1c38.90” 6.2789505000000005
“135.b1d10.d1c38.197” 6.006040999999999
“135.b1d10.9cfdf.111” 5.547508500000002
“135.b1d10.b1d10.82” 5.360330000000001
“135.b1d10.b1d10.73” 5.312220000000001
“135.b1d10.d1c38.196” 5.302249499999999
“135.b1d10.6f3ef.100” 5.291539499999999

As expected we get the same results as before when we stored new relationships and ran the weighted pageRank algorithm on it.

It’s that easy to run weighted pageRank algorithm!


Community detection based on Jaccard similarity index with Neo4j

Jaccard index Keyword Tool Similarity Keyword research Hash function - intersection @kisspng

Recently similarity algorithms were introduced in Neo4j graph algorithms library, so I decided to show how easy it has become to infer a graph using Jaccard similarity and then run Community detection algorithms on it.

We will be using the Electronic Products and Pricing Data from Kaggle.

Lets first define the schema.

CALL apoc.schema.assert(

Download the dataset and put in into the $neo4j/import folder. Use the following query to import the data into Neo4j.

"file:///DatafinitiElectronicsProductsPricingData.csv" as row
MERGE (p:Product{})
              p.weight = row.weight,
              p.price_max = row.`prices.amountMax`, 
              p.price_min = row.`prices.amountMin`
WITH p,split(row.categories,",") as categories
UNWIND categories as category
MERGE (c:Category{name:category})
MERGE (p)-[:PART_OF]->(c)

Infer graph with Jaccard index similarity

The Jaccard index is a statistic used for comparing the similarity between pairs of sample sets or nodes in our example. It is defined as the size of the intersection divided by the size of the union of the sample sets.

We will infer a similarity graph of categories based on Jaccard similarity of the set of products that are in a specific category.

An example would be category A has 5 products and category B has 2 products it shares with category A and 2 products that aren’t common.

  • A = {0,1,2,3,4}
  • B = {3,4,5,6}

Similarity: J(A,B) = |A∩B| / |A∪B| = |{3,4}| / |{0,1,2,3,4,5,6}| = 2/7 = 0.28

MATCH (p:Product)-[:PART_OF]->(category)
WITH {item:id(category), categories: collect(id(p))} as productData
WITH collect(productData) as data
CALL algo.similarity.jaccard(data, {similarityCutoff:0.5, write:true})
YIELD nodes, similarityPairs, stdDev, p25, p50, p75, p90, p95
RETURN nodes, similarityPairs, stdDev, p25, p50, p75, p90, p95

Community detection

We will use label propagation algorithm to find communities of similar categories in our graph. Algorithm returns non-overlapping communities.

YIELD nodeId, label
RETURN label, 
       collect(algo.getNodeById(nodeId).name) as categories, 
       count(*) as size 

Results came out quite nice. One of the communities has the following members:

  • “Wall Switches”
  • “Network Hubs & Routers – Works with Google Assistant”
  • “Smart Home Assistants & Voice Control”,
  • “Smart Energy and Lighting”,
  • “See more Logitech Harmony Smart Control with Smartphone…”,
  • “Smart Switches”,
  • “Outlets & Dimmers”,
  • “See more Logitech Harmony Smart Control Universal Remot…”

We have done the whole analysis with only 2 queries, which is very neat and allows for fast and fun explorations of various datasets. Analysis could be easily turned around to find similar products based on Jaccard index and then find communities with the highest price per kilogram value or whatever might interest you.

Article recommendation system on a citation network using Personalized Pagerank and Neo4j

Recently personalized pageRank was added to Neo4j graph algorithms library and I was looking for a cool example to demonstrate its applications. I found a great example in a Efficient Algorithms for Personalized PageRank research paper written by Peter Lofgren, in which he states:

As an example of how changing the source of the Personalized PageRank algorithm results in different rankings, we consider personalized search on a citation graph. On a citation graph provided by Citeseer, we created a demo which given a keyword and researcher name, finds all papers containing that keyword, and ranks them from the point of view of the given researcher. For a given researcher, we find all papers by that researcher, and define a source distribution giving equal weight to those papers. We then use Personalized PageRank on the citation graph to rank papers matching the given keyword. As an example, the keyword “entropy” means different things to different researchers, so we compare the top results for keyword “entropy” from different points of view.
Let’s recreate this scenario with Neo4j!


Once you have downloaded all the plugins and put them in the Neo4j plugins folder you have to enable them by adding the following configuration to the Neo4j.conf file.


Graph model


We start with a simple graph model consisting of nodes labeled Author, which can have one or more relationships AUTHOR to nodes labeled Article. At the same time Article can have a relationship REFERENCE to other nodes labeled Article, indicating the reference.

Graph schema

To optimize our queries and make full use of indexes, we define the graph schema with unique constraints for label Article on “index” property and label Author on “name” property.

CALL apoc.schema.assert(

Import data

We use the citation network v10 made available on It is the latest version of the dataset and more importantly first version stored as JSON.

Find more about the dataset in the ArnetMiner: Extraction and Mining of Academic Social Networks research paper[1].

Importing data into Neo4j is done in two steps. First we import all articles and their authors. In the second step we import all references of the articles.
We use apoc.periodic.iterate procedure for batch importing.

Import articles and authors

CALL apoc.periodic.iterate(
  'UNWIND ["dblp-ref-0.json","dblp-ref-1.json","dblp-ref-2.json","dblp-ref-3.json"] as file 
   CALL apoc.load.json("file:///neo4j/import/" + file) 
   yield value return value',
  'MERGE (a:Article{}) 
   ON CREATE SET a +=,["id","authors","references"],[0]) 
   WITH a,value.authors as authors 
   UNWIND authors as author 
   MERGE (b:Author{name:author}) 
   MERGE (b)-[:AUTHOR]->(a)'
,{batchSize: 10000, iterateList: true})

Import references

CALL apoc.periodic.iterate(
  'UNWIND ["dblp-ref-0.json","dblp-ref-1.json","dblp-ref-2.json","dblp-ref-3.json"] as file 
   CALL apoc.load.json("file:///neo4j/import/" + file) 
   yield value return value',
  'MERGE (a:Article{}) 
   WITH a,value.references as references 
   UNWIND references as reference 
   MERGE (b:Article{index:reference}) 
   MERGE (a)-[:REFERENCES]->(b)'
,{batchSize: 10000, iterateList: true})

PageRank algorithm

PageRank was designed to analyze the importance of website pages. It takes into account the number of links a website has and also the quality of those links. It’s quite different if for example a website has a link from the first page of Reddit or for example my blog 🙂

The process can be easily translated to the citation network domain. The citation itself can be viewed as an “upvote” from one article to another. Which article has the most “upvotes” from other quality articles is a question PageRank algorithm can help us answer.

We use pageRank algorithm on the global citation network to find the most important and influential articles within our graph.

Run pageRank and store results as a property of nodes

CALL algo.pageRank('Article', 'REFERENCES')

Most important articles by pagerank

MATCH (a:Article)
RETURN a.title as article,
       a.pagerank as score


Natural language processing

We want to recommend articles by keyword selection, so we need to extract keywords from our graph. Thanks to Graphaware’s NLP plugin this is a simple process even if you have little to no understanding what is going on beneath the hood of NLP algorithms.

The process will enrich our graph schema with additional node labels and relationship types as shown in the below picture.


Define NLP graph schema

To optimize the NLP process we define a set of unique constraints and indexes neatly available as a single procedure.

CALL ga.nlp.createSchema()

Add pipeline

Define the configuration of our pipeline. Find more details in the documentation.

CALL ga.nlp.processor.addPipeline({
  textProcessor: 'com.graphaware.nlp.processor.stanford.StanfordTextProcessor', 
  name: 'defaultPipeline', 
  threadNumber: 4
  processingSteps: {tokenize: true, 
                     ner: true, 
                     dependency: false}})

Set default pipeline

CALL ga.nlp.processor.pipeline.default('defaultPipeline')

Annotate text

The original text is broken down into words, parts of speech, and functions. This analysis of the text acts as a starting point for the later steps.[2]

If you want to learn more about it I suggest you to check out Reverse Engineering Book Stories with Neo4j and GraphAware NLP blog post written by Christophe Willemsen.

CALL apoc.periodic.iterate(
  "MATCH (n:Article) WHERE exists (n.title) RETURN n",
  "CALL ga.nlp.annotate({text: n.title, id: id(n)}) 
   YIELD result MERGE (n)-[:HAS_ANNOTATED_TEXT]->(result)",
{batchSize:1, iterateList:true})

Keyword extraction

The TextRank algorithm is a relatively simple, unsupervised method of text summarization directly applicable to the topic extraction task. Its objective is to retrieve keywords and construct key phrases that are most descriptive of a given document by building a graph of word co-occurrences and ranking the importance of individual words using the PageRank algorithm.

Taken from Efficient unsupervised keywords extraction using graphs[3]

CALL apoc.periodic.iterate(
  "MATCH (a:AnnotatedText) RETURN a",
  "CALL{annotatedText: a}) YIELD result
   RETURN distinct 'done' ",

Get the 10 most occurring keywords in the articles titles

Lets check which are the keywords that describe the most articles in our graph.

MATCH (k:Keyword)-[:DESCRIBES]->()
WHERE k.numTerms > 1
RETURN k.value as Keyphrase, 
       count(*) AS n_articles
ORDER BY n_articles DESC


Basic article recommendation

If you followed this blog post step by step, you are now able to create a basic article recommendation system based on the global pageRank score of articles and keywords extracted from the NLP process.

Recommend top ten articles described by the keyword “social networks“.

MATCH (k:Keyword)-[:DESCRIBES]->()<-[:HAS_ANNOTATED_TEXT]-(a:Article)
WHERE k.value = "social networks"
RETURN a.title as title, a.pagerank as p


Personalized Pagerank algorithm

Personalized PageRank algorithm returns the pageRank score of nodes in a graph from the point of view of one or more given source node.

Lets calculate the global pageRank again, but this time using articles described by keyword “social networks” as source nodes.

MATCH (k:Keyword)-[:DESCRIBES]->()<-[:HAS_ANNOTATED_TEXT]-(a:Article)
WHERE k.value = "social networks"
WITH collect(a) as articles
CALL'Article', 'REFERENCES', {sourceNodes: articles})
YIELD nodeId, score
WITH nodeId,score order by score desc limit 10
MATCH (n) where id(n) = nodeId
RETURN n.title as article, score


The anatomy of a large-scale hypertextual Web search engine research paper by Sergey Brin and Larry Page comes out top. We could conclude that Google’s early work with graphs and pageRank had a big effect on social network research papers. Cool easter egg 🙂

Personalized recommendation system

If you remember the goal of this blog post was to recreate the following example.

As an example, the keyword “entropy” means different things to different researchers, so we compare the top results for keyword “entropy” from different points of view.

We start by matching all articles written by a specific author. We collect them as we will use them as source nodes in the personalized pageRank algorithm. We then run the algorithm and project as nodes only articles that are described by the keyword “entropy”. We also project the REFERENCES relationships between these articles.

We don’t have to filter out any relationships in the relationship statement as the following rule is enforced while projecting with cypher projection.

Relationships described in the relationship-statement will only be projected if both source and target nodes are described in the node-statement. Relationships that don’t have both source and target nodes described in the node-statement will be ignored.

Taken from documentation

Example recommendation

Recommendation of articles described by keyword “entropy” from the point of view of Jose C. Principe.

MATCH (a:Article)<-[:AUTHOR]-(author:Author)
WHERE"Jose C. Principe"
WITH collect(a) as articles
  'MATCH (a:Article)-[:HAS_ANNOTATED_TEXT]->()<-[:DESCRIBES]-(k:Keyword) 
   WHERE k.value contains "entropy" RETURN distinct id(a) as id',
  'MATCH (a1:Article)-[:REFERENCES]->(a2:Article) 
   RETURN id(a1) as source,id(a2) as target', 
  {sourceNodes: articles,graph:'cypher'})
YIELD nodeId, score
WITH nodeId,score order by score desc limit 10
MATCH (n) where id(n) = nodeId
RETURN n.title as article, score


Recommendation of articles described by keyword “entropy” from the point of view of Hong Wang.

MATCH (a:Article)<-[:AUTHOR]-(author:Author)
WHERE"Hong Wang"
WITH collect(a) as articles
  'MATCH (a:Article)-[:HAS_ANNOTATED_TEXT]->()<-[:DESCRIBES]-(k:Keyword) 
   WHERE k.value contains "entropy" RETURN distinct id(a) as id',
  'MATCH (a1:Article)-[:REFERENCES]->(a2:Article) 
   RETURN id(a1) as source,id(a2) as target', 
  {sourceNodes: articles,graph:'cypher'})
YIELD nodeId, score
WITH nodeId,score order by score desc limit 10
MATCH (n) where id(n) = nodeId
RETURN n.title as article, score



As expected we get different recommendations for the two authors as their point of view is different.

Neo4j is a powerful tool, but knowing when and which plugins to use in a specific use case, makes it an even greater tool.


[1] Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. ArnetMiner: Extraction and Mining of Academic Social Networks. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD’2008). pp.990-998. [PDF] [Slides] [System] [API]



Using random walk algorithm to shuffle cards with Neo4j and graph algorithms

I found this cool article Group representations in probability and statistics by Persi Diaconis where he describes how to use random walk algorithm to shuffle cards. I decided to try it out and see if I can shuffle cards with Neo4j.

We will be simulating a simple 2 player game, where each players gets one card and whoever get the higher card wins.


Import dataset

We use the card dataset made available by Garrett Grolemund in the form of a github gistI love how you can find all sorts of useful data on github and import it directly to Neo4j with a single cypher query.

We run MERGE by both face and suit property of the card as any single property is not an unique identifier in our case. There are other options how to handle this scenario like creating a single unique identifier.

"" AS row
MERGE (c:Card{face:row.face,suit:row.suit})
SET c.value = toINT(row.value)

For now we have imported cards as nodes. We need to create a complete graph where all nodes are connected to each other as in our game any card can follow another.

We will treat the relationships as bidirectional in our graph as there is no semantic difference in the direction of the relationship in our case. We will achieve this with the filter WHERE id(c1) > id(c2) to avoid duplication and not specifying the direction of the relationship in the MERGE statement.

MATCH (c1:Card),(c2:Card)
WHERE id(c1) > id(c2)
MERGE (c1)-[:NEXT_CARD]-(c2)

Random walk algorithm

Random Walk is an algorithm that provides random paths in a graph.

A random walk means that we start at one node, choose a neighbor to navigate to at random or based on a provided probability distribution, and then do the same from that node, keeping the resulting path in a list. It’s similar to how a drunk person traverses a city.

Taken from the official Neo4j documentation.

Start with a random card and make 7 random walks from it. Use direction:BOTH as we treated our relationship NEXT_CARD as bidirectional during the import.

// Start with a random card
MATCH (card:Card)
WHERE toINT(rand() * 13) + 1 = toINT(card.value)
WITH card limit 1
// Run the random walk algorithm
CALL, 7, 1, 
{path: true, direction:'BOTH'})
YIELD nodeIds
UNWIND nodeIds AS nodeId
MATCH (node) WHERE id(node) = nodeId
RETURN node.face + " of " + node.suit as card



We return a sequence of 8 cards in our result. In the first game player 1 gets the nine of spades and player 2 gets the king of hearts. The way we defined our game player 2 wins the first game. In the second game player 1 gets the three of clubs and player 2 wins again with the queen of clubs and so on.

We can now use Neo4j to shuffle cards, which is really cool and can be expanded to other card games aswell 🙂



Approximation of betweenness centrality on Twitter dataset with Neo4j

Exact betweenness centrality is computationally intensive and isn’t practical for (near) real-time calculation on large graphs. Because of this approximation algorithms of betweenness centrality were developed to allow for a faster calculation.

We will use Twitter dataset made available by Stanford University to demonstrate how to approximate betweenness centrality with  Neo4j graph algorithms library.


First define the graph schema.


There are a total of 1,8 million relationships in the dataset, so we will import only a single egonet to keep the graph small for the example.

LOAD CSV FROM "file:///61086747.edges" as row fieldterminator " "
MERGE (u1:User{id:row[0]})
MERGE (u2:User{id:row[1]})
MERGE (u1)-[:FOLLOW]->(u2);

Betweenness Centrality

Betweenness centrality is a way of detecting the amount of influence a node has over the flow of information in a graph. It is often used to find nodes that serve as a bridge from one part of a graph to another.

Taken from documentation.

Run the original Brandes betweenness centrality for future reference.

YIELD nodeId, centrality
RETURN nodeId,centrality
ORDER BY centrality desc limit 10;


The RA-Brandes algorithm

RA-Brandes algorithm is different from original Brandes’ algorithm in only one main respect: Brandes’ algorithm considers all nodes within the graph, whereas RA-Brandes considers only a subset of nodes from the graph.

Taken from documentation.

There are two strategies for selecting the subset of nodes.

Default strategy is the “random” strategy where nodes are selected at random with defined probability of selection. If we set probability to one it behaves as exact betweenness centrality where all of the nodes are selected and considered in calculation of the centrality values.

Second strategy is the “dense” strategy where first the mean degree of nodes is calculated and then only nodes with degree higher than the mean are selected.

Run the RA-Brandes algorithm with parameter probability : 1 to validate above statement that the two algorithms are fundamentally the same and deliver the same results if all nodes are selected.

YIELD nodeId, centrality
RETURN nodeId,centrality
order by centrality desc limit 10;


Approximation of Betweenness Centrality

If we set the probability : 0.1 we will consider only one tenth of nodes at random in the calculation of betweenness centrality. You could experiment with lower values of probability and compare results.

 YIELD nodeId, centrality
 RETURN nodeId,centrality
 ORDER BY centrality DESC LIMIT 10;


Ego networks

Ego network has one “ego” node that is in the center of the network. Neighbours are connected to it with a relationship and are called “altar” nodes.


For more details on ego networks check out the following YT video.

First step in calculating betweenness centrality is running the shortest path algorithm. Parameter maxDepth limits the depth of traversal used by the shortest path algorithm. This effectively allows us to run betweenness centrality on ego networks where for each node we only load its neighbours or neighbours x hops away when calculating betweenness centrality.

YIELD nodeId, centrality
RETURN nodeId,centrality
ORDER BY centrality DESC LIMIT 10;



Now that we learned how to approximate betweenness centrality with Neo4j graph algorithms we can import the whole 1.8 million relationships into our graph and play around with different strategies and traversal limits to optimize the approximation of betweenness centrality.