Graph model of Facebook post reactions in Neo4j part 1

I was watching Neo4j Online conference, conveniently named NODES 2019, and stumbled into a great presentation titled Tuning cypher by Andrew Bowman. I highly recommend you to check it out.  It inspired me to write my own blog post about cypher query optimizations.

In this blog post, we will take a look at how to model Facebook posts and users’ reactions to them. As some of you might know there are 6 different types of reactions available on Facebook today:

  • Like
  • Love
  • Haha
  • Wow
  • Sad
  • Angry

There are a couple of different ways how to model this domain in Neo4j. We will use three different graph models to compare cypher query performance across different graph models. The dataset will be generated synthetically. It contains 100 Facebook posts with 10000 reactions each. The distribution of reactions is evenly split between all types. Our goal is to count as fast as possible how many reactions each post has grouped by reaction type.

We will compare both the execution plans of the cypher queries as well as the actual execution time. Cypher execution plans can be visualized using the PROFILE statement. For measuring the execution time we will use:

  • result_available_after(The time it took for the server to have the result available.)
  • result_consumed_after(The time it took for the server to consume the result.)

These two parameters are available as the summary of the response in the official Neo4j python bolt driver. The idea is to run 1000 queries and calculate basic statistics about the execution time.

In this blog post I used:

  • Neo4j Enterprise 3.5.11
  • APOC 3.5.0.5
  • Neo4j bolt driver for python 1.6.3

All graph models are visualized with arrows tool.

Reaction type as an attribute of a relationship

Graph model:

attribute.png

In all three graphs models, there will be nodes labeled User and Post. In the first graph model, we store the reaction type as an attribute of the relationship between a User and a Post.

Import

We use apoc.periodic.iterate() to speed up our import.

CALL apoc.periodic.iterate(
  "UNWIND range(0,100) as i
   return i
  ",
  "CREATE (post:Post{id:i})
   WITH i,post
   UNWIND range (0,10000) as j
   WITH i,j,post, CASE j%6 WHEN 0 THEN 'like'
                           WHEN 1 THEN 'love'
                           WHEN 2 THEN 'haha'
                           WHEN 3 THEN 'wow'
                           WHEN 4 THEN 'sad'
                           WHEN 5 THEN 'angry' END as reaction
   CREATE (post)<-[:REACTION{type:reaction}]-(:User)",
   {parallel:True})

Testing execution time

Before each test, apoc.warmup.run() was executed to quickly warm the database. As mentioned before, each query will be executed 1000 times to calculate the average response time and other metrics.

The first query is pretty straightforward.  We visit all the nodes labeled Post in the graph and expand all their incoming relationships. The second step is to aggregate by an attribute “id” of the node and attribute “type” of the relationship. We finish the query with ordering and a limit.

PROFILE
MATCH (p:Post)<-[rel]-()
RETURN p.id as id, 
       rel.type as type, 
       count(*) AS count
ORDER by count DESC
LIMIT 5

sm1.png

Results

Average response: 799.739ms
Max response: 1860ms
Min response: 768ms
25th percentile: 783.0ms
50th percentile: 787.0ms
75th percentile: 792.0ms
90th percentile: 806.0ms
95th percentile: 860.0999999999999ms
99th percentile: 1071.03ms

The execution plan reports 3030506 total db hits and it takes about 800ms on average to execute the query. We can definitely improve this.

 

 

 

The presentation by Andrew Bowman made me realize that if you want to aggregate and group by a node’s unique id, it is better to group by the node and only then return its unique id. This way we do the aggregations, ordering, and limit first and only then access the “id” attribute of the top five rows.

PROFILE
MATCH (p:Post)<-[rel]-()
WITH p, 
     rel.type as type, 
     count(*) AS count
ORDER by count DESC
LIMIT 5
RETURN p.id as id,type,count

attribute2.png

Results

Average response: 657.88ms
Max response: 908ms
Min response: 644ms
25th percentile: 650.0ms
50th percentile: 653.0ms
75th percentile: 657.0ms
90th percentile: 667.0ms
95th percentile: 687.0ms
99th percentile: 760.1099999999999ms

Just by grouping by the node, instead of by the node’s unique property, we saved a million db hits and are left with a total of 2020410 db hits. The average response time dropped down to 650ms.

Not much else could be optimized here, so let’s move to the next possible graph model.

Reaction type as a node label

Graph model

label.png

Here we introduce an intermediary node between the user and the post. The reaction type will be stored as a label of the intermediary node. This node has no other purpose, it will always have one incoming and one outgoing relationship.

Import

CALL apoc.periodic.iterate(
  "UNWIND range(0,100) as i
   return i
  ",
  "CREATE (post:Post{id:i})
   WITH i,post
   UNWIND range(0,10000) as j
   WITH i,post, CASE j%6 WHEN 0 THEN 'like'
                         WHEN 1 THEN 'love'
                         WHEN 2 THEN 'haha'
                         WHEN 3 THEN 'wow'
                         WHEN 4 THEN 'sad'
                         WHEN 5 THEN 'angry' END as reaction
  // Create node with a dynamic label
   call apoc.create.node([reaction], {}) yield node
   CREATE (post)<-[:TO]-(node)
   CREATE (node)<-[:REACTION]-(:User)",
   {parallel:True})

Testing execution time

As we are only interested in counting reaction types for a given post, we don’t have to traverse all the way to the user. We start by matching all the posts in our graph, expand their incoming relationships and count the labels of the nodes on the other end of the relationships.

PROFILE
MATCH (p:Post)<--(reaction)
WITH p,
     labels(reaction) as labels,
     count(*) as count
ORDER BY count DESC 
LIMIT 5
RETURN p.id as id,labels[0] as type,count

label1.png

Results

Average response: 569.038ms
Max response: 1255ms
Min response: 548ms
25th percentile: 553.0ms
50th percentile: 556.0ms
75th percentile: 560.0ms
90th percentile: 570.1ms
95th percentile: 637.0999999999999ms
99th percentile: 903.3099999999997ms

Even though the total db hits did not change compared to the previous cypher query (2020410 db hits), the execution time improved by almost 100ms. This indicates that not all db hits are the same. It could be concluded that counting labels is faster than counting attributes, even though they both have the same cost in db hits.

Reaction type as a relationship type

Graph model

rel_type.png

As the title mentions, we will store the reaction type as a type of relationship between a user and a post. To me, this one feels the most natural, but it always depends on what do you want to achieve. There are no silver bullets when modeling graphs and you should use whatever works best for your use-case.

Import

CALL apoc.periodic.iterate(
  "UNWIND range(0,100) as i
   return i
  ",
  "CREATE (post:Post{id:i})
   WITH i,post
   UNWIND range(0,10000) as j
   CREATE (u:User)
   WITH i,post,u, CASE j%6 WHEN 0 THEN 'like'
                           WHEN 1 THEN 'love'
                           WHEN 2 THEN 'haha'
                           WHEN 3 THEN 'wow'
                           WHEN 4 THEN 'sad'
                           WHEN 5 THEN 'angry' END as reaction
  // Create relationship with a dynamic type
   CALL apoc.create.relationship(u, reaction, {}, post) YIELD rel
   RETURN count(*)",
  {parallel:True})

Testing execution time

The first query uses a naive approach. We match all the posts, iterate over their incoming relationships and count the types of those relationships. Nothing special here.

PROFILE
MATCH (p:Post)<-[rel]-()
WITH p,
     type(rel) as type,
     count(*) as count
ORDER BY count DESC
LIMIT 5
RETURN p.id as id,type,count

rel_type 1

Average response: 629.604ms
Max response: 903ms
Min response: 607ms
25th percentile: 616.0ms
50th percentile: 628.0ms
75th percentile: 635.0ms
90th percentile: 648.1ms
95th percentile: 656.0ms
99th percentile: 698.0699999999999ms

When I first saw that we dropped down to 1010309 total db hits, I was expecting faster execution time, but to my surprise, the execution time was actually a bit slower than before when we were counting labels. Definitely, not all db hits cost the same.

Access node’s degree store

There isn’t much documentation available about the degree store, but I found this knowledge base explanation. Basically, each node has the count of relationships by the relationship type and direction stored in a “degree store” of the node. We can access it using the function size().

PROFILE
MATCH (p:Post)
WITH p
UNWIND [{key:'like',count:size((p)<-[:like]-())},
        {key:'love',count:size((p)<-[:love]-())},
        {key:'haha',count:size((p)<-[:haha]-())},
        {key:'wow', count: size((p)<-[:wow]-())},
        {key:'sad', count: size((p)<-[:sad]-())},
        {key:'angry',count: size((p)<-[:angry]-())}] as k
WITH p,
     k.key as key,
     k.count as count
ORDER BY count DESC 
LIMIT 5
RETURN p.id as id,key,count

degree_store.png

Results

Average response: 0.998ms
Max response: 82ms
Min response: 0ms
25th percentile: 1.0ms
50th percentile: 1.0ms
75th percentile: 1.0ms
90th percentile: 1.0ms
95th percentile: 2.0ms
99th percentile: 2.0ms

Instead of iterating over all relationships and counting their types, we just iterate over nodes and check the degree value for each relationship type stored in the degree store of the node. We have dropped down to 1319 total db hits, which is 3 orders of magnitude better than before. Execution times are down to 1ms.

 

 

Predefining all the relationship types in the above example feels a bit ugly and something we want to avoid in our applications. To solve this, we can use node functions in APOC procedures, specifically apoc.node.relationship.types() and apoc.node.degree.in().

MATCH (p:Post)
// Get all relationship types for a given node
UNWIND apoc.node.relationship.types(p) as type
WITH p, 
     type,
     // Get the incoming degree value for the relationship type
     apoc.node.degree.in(p, type) as count
ORDER BY count DESC
LIMIT 5
RETURN p.id as post, type, count

Cypher profiler is not accurate when there are APOC procedures involved, so we will skip it. Looking at execution times, it looks like APOC has a bit of overhead in this scenario as the average response time almost doubled to 2ms. This is still very nice. Having said that, APOC usually speeds things up, so I would recommend experimenting with it when optimizing your queries. This is just an example where we trade a bit of speed for a prettier cypher query as we don’t have to specify relationship types in advance. And I am sure if we optimized the APOC procedure for this exact scenario we could get better execution times.

Average response: 1.869ms
Max response: 74ms
Min response: 1ms
25th percentile: 1.0ms
50th percentile: 2.0ms
75th percentile: 2.0ms
90th percentile: 3.0ms
95th percentile: 3.0ms
99th percentile: 4.0ms

Conclusion

We explored a few different graph models and compared execution times for a very simple counting task. In the next part, I will add the date attribute of the posts to the graph, and explore what happens when we want to do a simple date filter and order by. Stay tuned!

All the code is available on GitHub.

Advertisements

Network analysis of Prisoners of Zenda book with Spacy and Neo4j

Throughout the years, the internet has become a social network of sorts: connections and relationships are hiding in plain sight — waiting to be discovered. Due to the sheer volume of data, the relationships are hard to spot with the naked eye. However, that doesn’t mean they don’t exist or that it’s impossible to find them.

The idea for this blog post came when I was brainstorming with Jeremy Davies from Signalfish.io. They use SpaCy, Neo4j, and Golang in production to deliver curated news feeds. SpaCy features an entity recognition system. It provides a default NLP model which identifies a variety of named entities, including persons, organizations and more. Neo4j is a native graph database designed from the ground up to work with relationships and graphs.

I was hugely inspired by the Network of thrones analysis. If I remember correctly, Andrew Beveridge used the distance between persons in the text in the Game of Thrones books to infer relationships between persons. I decided to do something similar using Spacy and Neo4j.

Requirements

Agenda

  • Preprocess text
  • Extract relationships from text
  • Import into Neo4j
  • Run Louvain and Pagerank algorithm
  • Visualization of results

Data

I was looking for some books without copyright so that everyone can follow this tutorial. Along the way, I found the Gutenberg Project that provides free books with mostly expired copyright. We will be using the Prisoner of Zenda book written by Anthony Hope.

Graph Model

Graph model consists of nodes with label Person. Each person can have one or more relationships RELATED to other persons. Each relationship has an attribute score. In our case, it represents the number of interaction the pair of persons had in the text. Another important thing to note is that we will treat the relationships as undirected. Check this article for more details.

Graph model

Preprocessing

Gutenberg project is so nice to provide us with a text version of The Prisoner of Zenda book. We will fetch the text file, remove special characters, and split the text into chapters.

# https://www.gutenberg.org/ebooks/95 Prisoner of Zelda
# Fetch the data
target_url = 'https://www.gutenberg.org/files/95/95-0.txt'
import urllib.request
data = urllib.request.urlopen(target_url)
raw_data = data.read().decode('utf8').strip()
# Preprocess text into chapters 
import re
chapters = re.sub('[^A-z0-9 -]', ' ', raw_data).split('CHAPTER')[1:]
chapters[-1] = chapters[-1].split('End of the Project Gutenberg EBook')[0]

We are ready to run Spacy’s named entity recognition. To simplify our analysis we will use only the first chapter of the book. Otherwise, we would need to come up with some name mapping system as sometimes the full name of a person is used and sometimes not. This requires a more manual approach which we will avoid here.

We will also replace the names of persons with a single-word id in the text to simplify our algorithm.

# import spacy and load a NLP model
import spacy
nlp = spacy.load("en_core_web_lg", disable=["tagger", "parser"])
# Analyze the first chapter
c = chapters[0]
# Get a list of persons 
doc=nlp(c)
involved = list(set([ent.text for ent 
            in doc.ents if ent.label_=='PERSON']))
# replace names of involved in the text
# with an id and save the mapping
decode = dict()
for i,x in enumerate(involved):
    # Get mapping    
    decode['$${}$$'.format(i)] = x
    # Preprocess text
    c = c.replace(x,' $${}$$ '.format(i))

This is a very simple demonstration of SpaCy in action. We only analyzed a single chunk of text. If you want to analyze bigger sets of data you should probably use nlp.pipe(texts) function. Read more in the documentation.

As described before by our graph model, here is the cypher query for it.

save_query ="""
    MERGE (p1:Person{name:$name1})
    MERGE (p2:Person{name:$name2})
    MERGE (p1)-[r:RELATED]-(p2)
    ON CREATE SET r.score = 1
    ON MATCH SET r.score = r.score + 1"""

The algorithm is of a simple design. We infer a relationship between a pair of persons if they are less than 14 words apart in the text. The number of such occurrences between the pair is stored as an attribute of the relationship.

The algorithm iterates through words searching for persons. When it finds one (remember the single-word id system), it checks the next 14 words for any other person.

# Get an array of words
ws = c.split()
l = len(ws) 
# Iterate through words
for wi,w in enumerate(ws):
    # Skip if the word is not a person
    if not w[:2] == '$$':
        continue
    # Check next x words for any involved person
    x = 14
    for i in range(wi+1,wi+x):
        # Avoid list index error
        if i >= l:
            break
        # Skip if the word is not a person
        if not ws[i][:2] == '$$':
            continue
        # Store to Neo4j
        params = {'name1':decode[ws[wi]],'name2':decode[ws[i]]}
        session.run(save_query, params)
        print(decode[ws[wi]],decode[ws[i]])

Graph Algorithms

We will run the PageRank and Louvain algorithms in our analysis. Pagerank is a centrality measure which is commonly used to represent the importance of the nodes in the graph. Louvain algorithm is a community detection algorithm that finds communities within our graph.

Neo4j graph algorithms engine differentiates between directed and undirected graphs. We have to use the parameter direction: ’BOTH’ to tell the engine to treat relationships as undirected.

pagerank ="""
CALL algo.pageRank('Person','RELATED',{direction:'BOTH'})
"""
louvain = """
CALL algo.louvain('Person','RELATED',{direction:'BOTH'})
"""
with driver.session() as session:
    session.run(pagerank)
    session.run(louvain)

Visualizations

I think that for small networks a picture is worth more than 1000 words. We will use NeovisJS to visualize the results of our analysis. The code for the visualization was shamelessly copied from the Neo4j graph notebooks repository made by Mark Needham.

NeovisJS config

We need to provide a config file to the function that generates NeovisJS visualizations. It consists of connection parameters for Neo4j (host, user, password) and three parameters to define the visualization.

Visualization parameters:

  • cypher: Cypher query which defines the (sub)graph to fetch from Neo4j
  • labels_json: Define the visualization of nodes(caption, size, and color)
  • relationships_json: Define the visualization of relationships(caption, size)
cypher = "MATCH (p1:Person)-[r:RELATED]->(p2:Person) RETURN *"
labels_json = {
    "Person": {
        "caption": "name",
        "size": "pagerank",
        "community": "community"
    }
}
relationships_json = {
    "RELATED": {
        "thickness": "score",
        "caption": False
    }
}
generate_vis(host, user, password, cypher, labels_json, relationships_json)

Result

Rudolf seems like he is the main character in the first chapter. Another important character is Burlesdon. Interestingly, he doesn’t have any relationship with Rudolf. In the right corner is Rufold the Third. I don’t know if this is the same person as Rudolf as I haven’t read the book. This is an example where some manual name cleaning would be helpful. We can find four communities in our graph, which are all of a similar size.

Data enrichment

That was just the beginning of what we can achieve. We can add other types of labels that SpaCy finds in our text. Lets, for example, add the organizations to our graph. We simply look at the organizations in our text and check the previous and next five words for persons. The number five is chosen completely arbitrary, you can use whatever number you feel works best.

Preprocess the data

c = chapters[0]
doc = nlp(c)

# Define the mapping
persons = list(set([ent.text for ent in doc.ents if ent.label_=='PERSON']))
orgs = list(set([ent.text for ent in doc.ents if ent.label_=='ORG']))
decode_org = dict()
decode_person = dict()

# Replace person with an id
for i,p in enumerate(persons):
    decode_person['$${}$$'.format(i)] = p
    c = c.replace(p,' $${}$$ '.format(i))

# Replace organizations with an id
for i,o in enumerate(orgs):
    decode_org['&&{}&&'.format(i)] = o
    c = c.replace(o,' &&{}&& '.format(i))

We will use a similar cypher query as before, only now we are storing relationships between persons and organizations.

save_org_query = """MERGE (p:Person{name:$person})
MERGE (o:Organization{name:$org})
MERGE (p)-[r:PART_OF]->(o)
ON CREATE SET r.score = 1
ON MATCH SET r.score = r.score + 1"""

Run the algorithm and store it to Neo4j

ws = c.split()
l = len(ws)
for wi,w in enumerate(ws):
    # Skip if the word is not a organization
    if not w[:2] == '&&':
        continue
    # Check previous and next x words for any involved person
    x = 5
    for i in range(wi-x,wi+x):
    # Avoid list index error
        if i >= l:
            break
        # Skip if the word is not a person
        if (ws[i][:2]!='$$') or (i==wi):
            continue
        # Store to Neo4j
        params = {'org':decode_org[ws[wi]],'person':decode_person[ws[i]]}
        session.run(save_org_query, params)
        print(decode_org[ws[wi]],decode_person[ws[i]])

Results are not so interesting for organizations, but with this technique, you can store all different types of labels your favorite NLP engine returns. This allows you to build a knowledge graph representing relationships between entities in any given text. We could have also used a different similarity metric like Jaccard index or Overlap similarity to infer relationships between entities.

P.s. You can find the Ipython notebook with all the code here.

If you are interested in more network analysis I recommend:

Depth-first search analysis on London Tube network with Neo4j graph algorithms

The Depth-first search algorithm is an algorithm used to search or traverse through the graph. It is a basis for many other graph algorithms but was not directly exposed in Neo4j graph algorithms until recently.

Here is a visual representation of the order in which the Depth-first search algorithm traverses a graph.

390px-depth-first-tree.svg_

Image source: Wikipedia

Graph schema

The graph has a very simple schema. It consists of nodes labeled Station that have relationships with a type CONNECTION to other stations.

We will treat the relationships in our graph as undirected or bidirectional.

schema

Define the constraint and index to optimize all further queries.

CREATE CONSTRAINT ON (s:Station) ASSERT s.id is unique;

CREATE INDEX ON :Station(name);

Import

We will use the London tube dataset made available by Nicola Greco.

Import all nodes (stations with belonging metadata)

LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/nicola/tubemaps/master/datasets/london.stations.csv" as row 
MERGE (s:Station{id:row.id}) 
SET s += apoc.map.clean(row,['id'],[])

Import all relationships (lines between stations)

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/nicola/tubemaps/master/datasets/london.connections.csv" as row 
MATCH (s1:Station{id:row.station1}),(s2:Station{id:row.station2}) 
MERGE (s1)-[:CONNECTION{time:toInteger(row.time),line:row.line}]->(s2)

Analysis

We will focus on the Picadilly Circus station in this demonstration. It will be used as the starting node for the Depth-first search algorithm.

Let’s start with visualizing the Picadilly Circus neighbor stations in Neo4j Browser.

MATCH p=(s:Station{name:"Picadilly Circus"})-[*..3]-()
RETURN p

Results

picadilly

Depth-first search algorithm

The Depth-first search algorithm in Neo4j supports three different exit strategies that stop the algorithm from further traversal through the graph. This is to avoid unnecessary traversal of the whole graph.

To treat relationships as undirected we use value ‘BOTH’ for parameter direction.

Max depth strategy

As the name suggests the algorithm stops traversing the graph once it reaches the max depth specified.

MATCH (s:Station{name:'Picadilly Circus'})
WITH id(s) as start_id 
CALL algo.dfs.stream('Station', 'CONNECTION', 'BOTH', start_id, 
  {maxDepth:2})
YIELD nodeIds
UNWIND nodeIds as nodeId
RETURN algo.asNode(nodeId).name as station

Results

station
“Picadilly Circus”
“Charing Cross”
“Embankment”
“Oxford Circus”
“Regent’s Park”
“Tottenham Court Road”
“Warren Street”
“Bond Street”
“Leicester Square”
“Covent Garden”
“Green Park”
“Hyde Park Corner”
“Victoria”

 

Target nodes strategy

The algorithm keeps on traversing the graph until it reaches any of the target nodes specified.

The targetNodes parameter expects Neo4j internal ids of the nodes to be provided.

MATCH (start:Station{name:'Picadilly Circus'}),
      (target:Station{name:'Canada Water'}) 
CALL algo.dfs.stream('Station', 'CONNECTION', 'BOTH', id(start), 
  {targetNodes:[id(target)]}) YIELD nodeIds 
UNWIND nodeIds as nodeId 
RETURN algo.asNode(nodeId).name as station

Results

station
“Picadilly Circus”
“Charing Cross”
“Embankment”
“Waterloo”
“Lambeth North”
“Elephant & Castle”
“Borough”
“London Bridge”
“Bermondsey”
“Canada Water”

Max cost strategy

The max cost strategy expects relationships to be weighted. It finds all the nodes in the graph that are less than or equal to maxCost parameter distanced from the starting node.

In this case, we use the time attribute of relationships as weights.

MATCH (start:Station{name:'Picadilly Circus'})
CALL algo.dfs.stream('Station', 'CONNECTION', 'BOTH', id(start), 
  {maxCost:3, weightProperty:'time'})
YIELD nodeIds
UNWIND nodeIds as nodeId
RETURN algo.asNode(nodeId).name as station

Results

station
“Picadilly Circus”
“Charing Cross”
“Embankment”
“Oxford Circus”
“Bond Street”
“Leicester Square”
“Covent Garden”
“Green Park”

Conclusion

The Depth-first search algorithm is a nice addition to the Neo4j graph algorithms library and can be used in many scenarios. We can also use more than one exit strategy at once if we would want to.

Register now for your copy of the O’Reilly book, Graph Algorithms: Practical Examples in Apache Spark and Neo4j by Mark Needham and Amy E. Hodler.

Social-AlgoTwitter2 .png

Eigenvector graph centrality analysis on Subreddit hyperlink network with Neo4j and Cypher projection

New algorithms are regularly being added to neo4j graph algorithms library and Eigenvector centrality algorithm is one of them. To demonstrate its usage we will be using the Reddit Hyperlink Network dataset made available by Stanford Network Analysis Project. Along the way, we will also take a deeper dive into how the projected graph model of the graph algorithms work, so you will be able to take full advantage of it in your own analysis.

Graph Schemaeigen(2)

Our graph consists of nodes labeled Subreddit that can have one or more relationships LINK to other Subreddits. The relationships contain information about the date, link sentiment, and id of the post that was linked. Link sentiment has two distinct values, 1 for positive and -1 for negative.

There can be more than one relationship LINK between a specific pair of nodes which is also known in mathematics as multigraph.

Unique constraints

Given our graph schema we define the unique constraints.

CREATE CONSTRAINT ON (s:Subreddit) ASSERT s.id IS UNIQUE;

Import

The first step is to download the Subreddit dataset and copy it into the $Neo4j/import folder. Then, we can proceed with importing the data into Neo4j with the following cypher query.

USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "file:///soc-redditHyperlinks-title.tsv" as row FIELDTERMINATOR "\t"
MERGE (s:Subreddit{id:row.SOURCE_SUBREDDIT})
MERGE (t:Subreddit{id:row.TARGET_SUBREDDIT})
CREATE (s)-[:LINK{post_id:row.POST_ID,
   link_sentiment:toInteger(row.LINK_SENTIMENT),
   date:localDateTime(replace(row['TIMESTAMP'],' ','T'))}]->(t)

Analysis

We will start by running a few exploratory cypher queries to get to know the dataset better.

Links by year

Get the count of links by year.

MATCH ()-[r:LINK]->() 
RETURN r.date.year as year,count(*) as count
ORDER BY year

Results

year count
2013 27
2014 128345
2015 175096
2016 194287
2017 74172

The data contains links extracted between January 2014 and April 2017. This explains the massive drop of links in the year 2017.

We can assume that the Reddit popularity grew steadily as people were linking more and more posts over the years.

Links by posts

Let’s check if there are any posts in the graph that could be important and have been linked many times.

MATCH (s:Subreddit)-[r:LINK]->()
RETURN r.post_id as post,count(*) as count
ORDER BY count DESC LIMIT 10

Results

post count
“3na5zus” 2
“4adtxws” 2
“3oz3jos” 2
“2m4lpis” 2
“35ff4ss” 2
“4oz396s” 1
“3n74jfs” 1
“2ck8nrs” 1
“3s49g8s” 1
“5iq6ats” 1

As we can observe there are no posts with many links from or to other Subreddits. We will ignore them in our further analysis.

p.s. If posts were more important in our graph I would probably use a different graph model.

Eigenvector centrality

Eigenvector centrality is a measure of importance or influence of a node in a graph. Similarly to PageRank, it takes into account the number of neighbors a node has combined with their influence. A high eigenvector centrality means that a node has relationships to many other nodes with high centrality.

We will begin by running the Eigenvector centrality algorithm on the whole network.

Multigraph

Our graph can have more than one relationship between a specific pair of nodes. If you carefully read the projected graph model documentation of the graph algorithms you can find:

The projected graph model does not support multiple relationships between a single pair of nodes.

In practice, this means that the projected graph model will load only one relationship per a single pair of nodes even though we might be dealing with a multigraph. If we want to consider the number of relationships between a pair of nodes in our algorithms, the best way is to use the weighted algorithms such as weighted PageRank and provide the number of relationships as a weight. This can be easily achieved using cypher projection, but we’ll get to that later.

Run eigenvector centrality algorithm.

CALL algo.eigenvector.stream('Subreddit','LINK')
YIELD nodeId,score
RETURN algo.getNodeById(nodeId).id as subreddit,score
ORDER BY score DESC LIMIT 10

Results

subreddit score
“askreddit” 3223.8067184928345
“iama” 2342.020188492834
“worldnews” 2145.6238984928345
“pics” 1820.1818884928337
“news” 1689.625108492834
“videos” 1677.9684684928343
“todayilearned” 1526.197358492834
“politics” 1508.5808184928342
“funny” 1507.3354784928338
“the_donald” 1039.4189084928341

Askreddit ranks first in Eigenvector centrality followed by IAMA and worldnews. I was intrigued by the_donald subreddit on the tenth place, so I checked it out and found that it is a Subreddit dedicated to Donald Trump.

Cypher projection

Cypher projection is the second option for loading projected graphs. Instead of defining labels and relationship-types of the graph we want to analyze, we use one cypher statement to define the ids of nodes that the algorithm should consider and the second cypher statement to define the relationships we want to take into account. Same as the label and relationship-type projection it does not support projecting multiple relationships between a pair of nodes.

Can also be used to project virtual graphs. Check my previous blog post.

I mentioned before that when analyzing multigraphs we can use the number of relationships between a pair of nodes as a relationship weight. Here is an example with weighted PageRank how to accomplish this.

CALL algo.pageRank.stream(
  // Node statement
  'MATCH (s:Subreddit) RETURN id(s) as id',
  // Relationship statement
  'MATCH (s:Subreddit)-[:LINK]->(t:Subreddit)
   RETURN id(s) as source, id(t) as target, count(*) as weight',
  {graph:'cypher',weightProperty:'weight'})
YIELD nodeId,score
RETURN algo.getNodeById(nodeId).id as subreddit, score
ORDER BY score DESC LIMIT 10

Results

subreddit score
“iama” 560.0028400000001
“askreddit” 515.7935835
“pics” 286.43782
“funny” 271.5604145
“videos” 221.54803800000002
“the_donald” 186.77218349999998
“gaming” 168.99667750000003
“music” 155.40514349999998
“todayilearned” 152.09683300000003
“worldnews” 150.478909

Eigenvector centrality by link sentiment

Although the Eigenvector centrality algorithm does not support the weighted graph type (yet?), we can explore the possibilities of combining Eigenvector centrality with cypher projection further.

We will calculate two Eigenvector centrality score for all nodes. The first score will consider only relationships with positive link sentiment and the second score only relationships with negative link sentiment.

UNWIND [-1, 1] as sentiment
CALL algo.eigenvector.stream(
  'MATCH (s:Subreddit) return id(s) as id',
  'MATCH (s:Subreddit)-[r:LINK]->(t:Subreddit)
   // Use parameter
   WHERE r.link_sentiment = $sentiment
   // Deduplicate relationships
   WITH id(s) as source,id(t) as target,count(*) as count
   RETURN source,target',
  {graph:'cypher', params:{sentiment:sentiment}})
YIELD nodeId,score
WITH sentiment,algo.getNodeById(nodeId).id as id,score
ORDER BY score DESC
RETURN sentiment,collect(id)[..5] as top5

Results

sentiment top5
1 [“iama”, “askreddit”, “pics”, “funny”, “videos”]
-1 [“askreddit”, “pics”, “worldnews”, “videos”, “funny”]

It is interesting to see that the top five Subreddits for both positive and negative link sentiment subgraphs have four common members.

IAMA Subreddit on the other hand show up only on the positive list and ranks first.

Eigenvector by link sentiment and year

We can get even more creative and calculate Eigenvector centrality by link sentiment through the years.

We will use similar logic to store results as in Neo4j Categorical Pagerank. For each year we create a node labeled Year and connect Subreddits to it with a relationship that has the Eigenvector centrality score stored as an attribute.

Store results

// Get year and sentiment
UNWIND [-1, 1] as sentiment
UNWIND range(2014,2017) as year
CALL algo.eigenvector.stream(
  'MATCH (s:Subreddit) return id(s) as id',
  'MATCH (s:Subreddit)-[r:LINK]->(t:Subreddit)
  // Use parameters 
   WHERE r.link_sentiment = $sentiment AND r.date.year = $year
   // Deduplicate relationships
   WITH id(s) as source,id(t) as target,count(*) as count
   RETURN source,target',
   {graph:'cypher', params:{sentiment:sentiment,year:year}})
YIELD nodeId,score
WITH sentiment,year,algo.getNodeById(nodeId) as node,score
// Filter out very low eigenvector centrality scores
// before storing results
WHERE score > 1
MERGE (y:Year{value:year})
MERGE (node)-[e:EIGEN{sentiment:sentiment}]->(y)
ON CREATE SET e.score = score

Find top 5 Subreddits per year/sentiment

MATCH (sub:Subreddit)-[e:EIGEN]->(year)
WITH year,e.sentiment as sentiment,e.score as score,sub
ORDER BY score DESC
// One way of limiting by row is by using collect()[..x]
RETURN year.value as year,
       sentiment,
       collect(sub['id'])[..5] as top_5
ORDER BY year,sentiment

Results

year sentiment top_5
2014 -1 [“askreddit”, “adviceanimals”, “iama”, “todayilearned”, “videos”]
2014 1 [“iama”, “askreddit”, “pics”, “videos”, “funny”]
2015 -1 [“askreddit”, “worldnews”, “videos”, “news”, “pics”]
2015 1 [“iama”, “askreddit”, “videos”, “pics”, “funny”]
2016 -1 [“askreddit”, “the_donald”, “worldnews”, “politics”, “news”]
2016 1 [“iama”, “askreddit”, “videos”, “pics”, “funny”]
2017 -1 [“askreddit”, “the_donald”, “politics”, “worldnews”, “news”]
2017 1 [“askreddit”, “iama”, “the_donald”, “videos”, “pics”]

If we follow IAMA Subreddit through years we can observe that in the year 2014 it ranked within top five in both the negative and positive subgraph. From there on something changed positively as it shows up only in the positive list.

Follow a single Subreddit through years

For the last part of the analysis, we will track a single Subreddit through years by sentiment.

MATCH (s:Subreddit{id:'funny'})-[e:EIGEN]->(y:Year) 
RETURN s.id as subreddit, y.value as year,
  sum(CASE WHEN e.sentiment = 1 THEN e.score END) as positive_sentiment,
  sum(CASE WHEN e.sentiment = -1 THEN e.score END) as negative_sentiment
ORDER BY year

Results

subreddit year positive_sentiment negative_sentiment
“funny” 2014 111.95422849283402 34.94747849283404
“funny” 2015 127.09759849283402 37.86008849283403
“funny” 2016 139.10600849283404 33.24432849283402
“funny” 2017 83.16026849283402 17.705438492834027

Conclusion

While the blog post was intended to introduce the Eigenvector centrality algorithm in neo4j graph algorithms I hope you learned how to deal with analyzing multigraphs as well.

Register now for your copy of the O’Reilly book, Graph Algorithms: Practical Examples in Apache Spark and Neo4j by Mark Needham and Amy E. Hodler.

Social-AlgoTwitter2 .png

Community detection of survey responses based on Pearson correlation coefficient with Neo4j

Just a few days ago a new version of Neo4j graph algorithms plugin was released. With the new release come new algorithms and Pearson correlation algorithm is one of them.

To demonstrate how to use Pearson correlation algorithm in Neo4j we will use the data from “Young People Survey” dataset made available by Miroslav Sabo on Kaggle.

It contains results of 1010 filled out surveys with questions ranging from music preferences, hobbies & interests to phobias.

After we will import and preprocess the data we will run Louvain, a community detection algorithm, on the similarity network inferred by Pearson correlation metric between responses.

Import

Download the dataset and copy it into $Neo4j/import folder.  Each row in the responses.csv file represents a single survey with 150 questions filled out. We store it in Neo4j as a single node.

LOAD CSV WITH HEADERS FROM "file:///responses.csv" as row
CREATE (p:Person)
SET p += row

Preprocessing

Most of the answers range from one to five where five is defined as “Strongly agree” and one as “Strongly disagree”. They appear as strings in the csv file and we have to convert them to integers first.

MATCH (p:Person)
UNWIND keys(p) as key
WITH p,key where not key in ['Gender',
                'Left - right handed',
                'Lying','Alcohol',
                'Education','Smoking',
                'House - block of flats',
                'Village - town','Punctuality',
                'Internet usage']
CALL apoc.create.setProperty(p, key, toInteger(p[key])) YIELD node
RETURN distinct 'done'

Category properties

Some of the answers are categorical. An example is the alcohol question, where possible answers are never, social drinker and drink a lot.

As we would like to convert some of them to vectors let’s examine all the possible answers they have.

MATCH (p:Person)
UNWIND ['Gender',
        'Left - right handed',
        'Lying','Alcohol',
        'Education','Smoking',
        'House - block of flats',
        'Village - town','Punctuality',
        'Internet usage'] as property
RETURN property,collect(distinct(p[property])) as unique_values

Results

property unique_values
“Gender” [“female”, “male”]
“Left – right handed” [“left handed”, “right handed”]
“Lying” [“sometimes”, “everytime it suits me”, “only to avoid hurting someone”, “never”]
“Alcohol” [“social drinker”, “never”, “drink a lot”]
“Education” [“secondary school”, “primary school”, “college/bachelor degree”, “masters degree”, “doctorate degree”, “currently a primary school pupil”]
“Smoking” [“current smoker”, “tried smoking”, “never smoked”, “former smoker”]
“House – block of flats” [“block of flats”, “house/bungalow”]
“Village – town” [“city”, “village”]
“Punctuality” [“i am often early”, “i am often running late”, “i am always on time”]
“Internet usage” [“less than an hour a day”, “few hours a day”, “most of the day”, “no time at all”]

Let’s vectorize gender, internet and alcohol answers. We will scale them between one to five to match the integer answers range.

Gender encoding

MATCH (p:Person)
WITH p, CASE p['Gender'] WHEN 'female' THEN 1
                         WHEN 'male' THEN 5
                         ELSE 3
                         END as gender
SET p.Gender_vec = gender

Internet encoding

MATCH (p:Person)
WITH p, CASE p['Internet usage'] WHEN 'no time at all' THEN 1
                        WHEN 'less than an hour a day' THEN 2
                        WHEN 'few hours a day' THEN 4
                        WHEN 'most of the day' THEN 5 
                        END as internet
SET p.Internet_vec = internet

Alcohol encoding

MATCH (p:Person)
WITH p, CASE p['Alcohol'] WHEN 'never' THEN 1
                          WHEN 'social drinker' THEN 3
                          WHEN 'drink a lot' THEN 5
                          ELSE 3 END as alcohol
SET p.Alcohol_vec = alcohol

Dimensionality reduction

There are 150 answers in our dataset that we could use as features. This is a great opportunity to perform some basic dimensionality reduction of the features.

I came across an article about dimensionality reduction techniques written by Pulkit Sharma. It describes twelve dimensionality reduction techniques, and in this post, we will use the first two, which are the low variance filter and the high correlation filter.

Low variance filter

Consider a variable in our dataset where all the observations have the same value, say 1. If we use this variable, do you think it can improve the model we will build? The answer is no, because this variable will have zero variance.

source

We will use the standard deviation metric, which is just the square root of the variance.

MATCH (p:Person)
WITH p LIMIT 1
WITH filter(x in keys(p) where not x in ['Gender','Left - right handed','Lying','Alcohol','Education','Smoking','House - block of flats','Village - town','Punctuality','Internet usage']) as all_keys
UNWIND all_keys as key
MATCH (p:Person)
RETURN key,avg(p[key]) as average,stdev(p[key]) as std 
ORDER BY std ASC LIMIT 10

Results

key average std
“Personality” 3.2922465208747513 0.6434356809234291
“Music” 4.7318768619662395 0.6640489340478045
“Dreams” 3.297029702970299 0.6831477667880559
“Movies” 4.613545816733064 0.694699901420266
“Fun with friends” 4.557654075546722 0.7371830636089882
“Comedy” 4.494538232373384 0.7797894145803115
“Internet_vec” 3.838613861386139 0.8213540389444354
“Happiness in life” 3.7057654075546718 0.8243233683199777
“Slow songs or fast songs” 3.328373015873016 0.8339307217064151
“Parents’ advice” 3.265873015873013 0.8657364119644098

We can observe that everybody likes to listen to music, watch movies and have fun with friends.

Due to the low variance, we will eliminate the following questions from our further analysis:

  • “Personality”
  • “Music”
  • “Dreams”
  • “Movies”
  • “Fun with friends”
  • “Comedy”

High correlation filter

High correlation between two variables means they have similar trends and are likely to carry similar information. This can bring down the performance of some models drastically (linear and logistic regression models, for instance).

source

We will use the Pearson correlation coefficient for this task. Pearson correlation adjusts for different location and scale of features, so any kind of linear scaling (normalization) is unnecessary.

Find top 10 correlations for gender feature.
MATCH (p:Person)
WITH p LIMIT 1
WITH filter(x in keys(p) where not x in ['Gender','Left - right handed','Lying','Alcohol','Education','Smoking','House - block of flats','Village - town','Punctuality','Internet usage','Personality','Music','Dreams','Movies','Fun with friends','Comedy']) as all_keys
MATCH (p1:Person)
UNWIND ['Gender_vec'] as key_1
UNWIND all_keys as key_2
WITH key_1,key_2, collect(coalesce(p1[key_1],0)) as vector_1,collect(coalesce(p1[key_2] ,0)) as vector_2
WHERE key_1 <> key_2
RETURN key_1,key_2, algo.similarity.pearson(vector_1, vector_2) as pearson
ORDER BY pearson DESC limit 10

Results

key_1 key_2 pearson
“Gender_vec” “Weight” 0.541795647440021
“Gender_vec” “PC” 0.4595381175639033
“Gender_vec” “Cars” 0.43821572092706285
“Gender_vec” “Action” 0.4093180552569303
“Gender_vec” “War” 0.40744466090777826
“Gender_vec” “Science and technology” 0.3575550988826724
“Gender_vec” “Western” 0.3482424112983126
“Gender_vec” “Sci-fi” 0.3092600003234222
“Gender_vec” “Physics” 0.3051120080067347
“Gender_vec” “Height” 0.28050171932024015

Most correlated feature to gender is weight, which makes sense. The list includes some other stereotypical gender differences like the preference for cars, action, and PC.

Let’s now calculate the Pearson correlation between all the features.

MATCH (p:Person)
WITH p LIMIT 1
WITH filter(x in keys(p) where not x in ['Gender','Left - right handed','Lying','Alcohol','Education','Smoking','House - block of flats','Village - town','Punctuality','Internet usage','Personality','Music','Dreams','Movies','Fun with friends','Comedy']) as all_keys
MATCH (p1:Person)
UNWIND all_keys as key_1
UNWIND all_keys as key_2
WITH key_1,key_2,p1
WHERE key_1 > key_2
WITH key_1,key_2, collect(coalesce(p1[key_1],0)) as vector_1,collect(coalesce(p1[key_2],0)) as vector_2
RETURN key_1,key_2, algo.similarity.pearson(vector_1, vector_2) as pearson
ORDER BY pearson DESC limit 10

Results

key_1 key_2 pearson
“Medicine” “Biology” 0.6751690175278219
“Chemistry” “Biology” 0.6580361718554997
“Fantasy/Fairy tales” “Animated” 0.6508308637290211
“Shopping centres” “Shopping” 0.6443774884976909
“Medicine” “Chemistry” 0.6119966796637772
“Physics” “Mathematics” 0.5870842251467656
“Opera” “Classical music” 0.5809496903367943
“Snakes” “Rats” 0.5681984607930817
“Weight” “Gender_vec” 0.541795647440021
“Punk” “Metal or Hardrock” 0.5415133015568586

Results show nothing surprising. The only one I found interesting was the correlation between snakes and rats.

We will exclude the following questions due to high correlation from further analysis:

  • “Medicine”
  • “Chemistry”
  • “Shopping centres”
  • “Physics”
  • “Opera”

Pearson similarity algorithm

Now that we have completed the preprocessing step we will infer a similarity network between nodes based on the Pearson correlation of the features(answers) of nodes that we haven’t excluded.

In this step we need all the features we will use in our analysis to be normalized between one and five as now, we will fit all the features of the node in a single vector and calculate correlations between them.

Min-max normalization

Three of the features are not normalized between one to five. These are

  • ‘Height’
  • “Number of siblings”
  • ‘Weight’

Normalize height property between one to five.

MATCH (p:Person)
//get the the max and min value
WITH max(p.Height) as max,min(p.Height) as min
MATCH (p1:Person)
//normalize
SET p1.Height_nor = 5.0 *(p1.Height - min) / (max - min)

Similarity network

Infer the similarity network based on the Pearson correlation of the selected features. We use similarityCutoff: 0.75 and topK: 5. Find more information in the documentation.

MATCH (p:Person)
WITH p LIMIT 1
WITH filter(x in keys(p) where not x in ['Gender','Left - right handed','Lying','Alcohol','Education','Smoking','House - block of flats','Village - town','Punctuality','Internet usage','Personality','Music','Dreams','Movies','Fun with friends','Height',"Number of siblings",'Weight','Medicine', 'Chemistry', 'Shopping centres', 'Physics', 'Opera', 'Comedy']) as all_keys
MATCH (p1:Person)
UNWIND all_keys as key
WITH {item:id(p1), weights: collect(p1[key])} as personData
WITH collect(personData) as data
CALL algo.similarity.pearson(data, {similarityCutoff: 0.75,topK:5,write:true})
YIELD nodes, similarityPairs
RETURN nodes, similarityPairs

Results

nodes similarityPairs
1010 3649

Community detection

Now that we have inferred a similarity network in our graph, we will try to find communities of similar persons with the help of Louvain algorithm.

CALL algo.louvain('Person','SIMILAR')
YIELD nodes,communityCount

Results

nodes communityCount
1010 197

Apoc.group.nodes

For a quick overview of community detection results in Neo4j Browser, we can use apoc.group.nodes. We define the labels we want to include and group by a certain property. In the config part, we define which aggregations we want to perform and get returned in the visualization. Find more in the documentation.

CALL apoc.nodes.group(['Person'],['community'], 
[{`*`:'count', Age:['avg','std'],Alcohol_vec:['avg']}, {`*`:'count'} ])
YIELD nodes, relationships
UNWIND nodes as node 
UNWIND relationships as rel
RETURN node, rel;

Results

gooasd.gif

Each node represents a single community and the caption of the node represents the count of members.

Community preferences

To get to know our communities better, we will examine their average top and bottom 3 preferences. We will explore only the biggest community, but feel free to play around on your own.

MATCH (p:Person)
WITH p LIMIT 1
WITH filter(x in keys(p) where not x in ['Gender','Left - right handed','Lying','Alcohol','Education','Smoking','House - block of flats','Village - town','Punctuality','Internet usage','Personality','Music','Dreams','Movies','Fun with friends','Height','Number of siblings','Weight','Medicine', 'Chemistry', 'Shopping centres', 'Physics', 'Opera','Age','community','Comedy']) as all_keys
MATCH (p1:Person)
UNWIND all_keys as key
WITH p1.community as community,
              count(*) as size,
              key,
              avg(p1[key]) as average,
              stdev(p1[key]) as std
ORDER BY average DESC
WITH community,
     size,
     collect({key:key,avg:average,std:std}) as all_avg
ORDER BY size DESC limit 1
RETURN community,size,
       all_avg[..3] as top_3,
       all_avg[-3..] as bottom_3

Results

biggest_comm

Interestingly, looks like the largest community(210 members) are ladies, who don’t like westerns and metal, but are compassionate, romantic and like fantasy/fairy tales.

Gephi

Let’s finish off with a nice visualization of our communities in Gephi.

Use the following link to export the similarity graph to Gephi.

MATCH path = (:Person)-[:SIMILAR]->(:Person)
CALL apoc.gephi.add(null,'workspace1',path,'weight',['community']) yield nodes
return distinct 'done'

After a bit of tweaking in Gephi, I came up with this visualization. Similarly as with apoc.group.nodes we can observe, that the biggest communities are quite connected between each other. Members of the purple community are the ladies mentioned above.

pruple.png

 

Register now for your copy of the O’Reilly book, Graph Algorithms: Practical Examples in Apache Spark and Neo4j by Mark Needham and Amy E. Hodler.

Social-AlgoTwitter2

Chaining graph algorithms in Neo4j on the countries of the world graph

The idea for this blog post is to show how to chain similarity algorithms with community detection algorithms in Neo4j without ever storing any information back to the graph. If for example you have only read rights to Neo4j or you don’t want to store anything to the graph while analyzing it, then chaining algorithms is for you.

As I used a new dataset to play around, the post also shows how to import and preprocess the graph for the analysis.

Data:

We use Countries of the World dataset, made available by Fernando Lasso on Kaggle. It contains information such as birthrate, GDP, infant mortality and others about 227 countries of the world.

Let’s start by researching how many missing values are there in the CSV file.

Countries of the world.csv file must be copied to the Neo4j/import folder to be able to run the missing values query.

Missing values query

LOAD CSV WITH HEADERS FROM "file:///countries%20of%20the%20world.csv" as row
UNWIND keys(row) as key
RETURN key,
  sum(CASE WHEN row[key] is null THEN 1 ELSE 0 END) as missing_value
ORDER BY missing_value DESC LIMIT 15

Missing values results

missing

As expected with any real-world dataset there are some missing values. Fortunately, only five features have some missing values while others have close to zero. Features with more than five percent of missing values (11+) are not considered in our analysis.

Import

Graph model

Schema of the graph is simple. We have one type of nodes representing countries. They have some features stored as properties and are also connected to the to the region they belong in with a relationship.

graph_mdel.png

We could define unique constraints for labels Country and Region, but since our graph is so tiny we’ll skip this today and proceed with the import.

Import query

LOAD CSV WITH HEADERS FROM "file:///countries%20of%20the%20world.csv" as row
MERGE (c:Country{name:row.Country})
ON CREATE SET c += apoc.map.clean(row,['Country','Region'],[])
MERGE (r:Region{name:row.Region})
MERGE (c)-[:IN]->(r)

Preprocessing

When I first explored this dataset in Neo4j I got back weird results and didn’t exactly understand why. After a couple of minutes, I found the culprit. If we run the following cypher query RETURN toFloat("0,1")  it returns a null value. I later found out that this is a Java limitation and not specifically Neo4j. The numbers in this dataset use the comma as a decimal point. This doesn’t work for us and we need to replace the commas with the dots to be able to store them as a float.

Let’s replace the commas for dots and store the new values as floats.

MATCH (c:Country)
UNWIND keys(c) as key
WITH c, key, toFloat(replace(c[key],',','.')) as fixed_float
WHERE NOT key = 'name'
CALL apoc.create.setProperty(c, key, fixed_float) YIELD node
RETURN distinct 'done'

Feature selection

We don’t use all the features of countries that are available in our analysis. I cherry-picked a couple of features that have little to no missing values, specifically:

  • Birthrate
  • Infant mortality
  • GDP
  • Arable %
  • Crops %
  • Phones (per 1000)

Let’s check basic statistics about our features with apoc.agg.statistics procedure.

UNWIND ["Birthrate", "Infant mortality (per 1000 births)","GDP ($ per capita)",
"Arable (%)", "Crops (%)", "Phones (per 1000)"] as potential_feature
MATCH (c:Country) 
WITH potential_feature, apoc.agg.statistics(c[potential_feature],[0.5,0.75,0.9,0.95,0.99]) as stats
RETURN potential_feature, 
       apoc.math.round(stats.min,2) as min, 
       apoc.math.round(stats.max,2) as max, 
       apoc.math.round(stats.mean,2) as mean, 
       apoc.math.round(stats.stdev,2) as stdev,
       apoc.math.round(stats.`0.5`,2) as p50,
       apoc.math.round(stats.`0.75`,2) as p75,
       apoc.math.round(stats.`0.95`,2) as p95,
       apoc.math.round(stats.`0.99`,2) as p99

Results

feature_stats

I found it interesting that Monaco has 1035 phones per 1000 persons and so more phones than people. On the other side of the spectrum, I found it shocking that Angola has (not sure exactly which year) 191,19 infant deaths per 1000 birth.

MinMax Normalization

MinMax normalization is used to scale all values of features between 0 and 1. It is better known as a process of feature scaling or data normalization.

WITH ["Birthrate", 
  "Infant mortality (per 1000 births)",
  "GDP ($ per capita)",
  "Arable (%)",
  "Crops (%)",
  "Phones (per 1000)"] as features
UNWIND features as key
MATCH (c:Country)
WITH max(c[key]) as max,min(c[key]) as min,key
MATCH (c1:Country)
WITH c1, "normalized_" + key AS newKey, 
(toFloat(c1[key]) - min) / (max - min) as normalized_value
CALL apoc.create.setProperty(c1, newKey, normalized_value) 
YIELD node
RETURN distinct 'normalization done'

Missing values

As we observed at the start, some features of the countries have missing values. A simple average of the region is used to fill in the missing values.

UNWIND ["normalized_Birthrate",
  "normalized_Infant mortality (per 1000 births)",
  "normalized_GDP ($ per capita)",
  "normalized_Arable (%)",
  "normalized_Crops (%)",
  "normalized_Phones (per 1000)"] as feature
MATCH (c:Country)
WHERE c[feature] IS null
MATCH (c)-[:IN]->(r:Region)<-[:IN]-(other:Country)
WHERE other[feature] IS NOT null
WITH c,feature,avg(other[feature]) as avg_value
CALL apoc.create.setProperty(c, feature, avg_value) 
YIELD node
RETURN distinct 'missing values populated'

Analysis

The analysis is split into two parts.

In the first part we find similar countries using cosine similarity algorithm.

In the second part we chain cosine similarity with Louvain algorithm to find communities of similar countries.

Cosine similarity algorithm

If you are new to cosine similarity metric I suggest you check out the documentation or the Wikipedia.

To get a feeling of how similar our countries are we run algo.similarity.cosine algorithm with write:false parameter. It returns global statistics about the similarity graph.

MATCH (c:Country) 
WITH id(c) as id, [c["normalized_Birthrate"], 
                   c["normalized_Infant mortality (per 1000 births)"], 
                   c["normalized_GDP ($ per capita)"], 
                   c["normalized_Arable (%)"], 
                   c["normalized_Crops (%)"], 
                   c["normalized_Phones (per 1000)"]] as weights 
WITH {item:id, weights: weights} as userData 
WITH collect(userData) as data 
CALL algo.similarity.cosine(data, {similarityCutoff: 0.0, write:false}) 
YIELD nodes, similarityPairs, min, max, mean, stdDev, p25, p50, p75, p90
RETURN nodes,similarityPairs,apoc.math.round(min,2) as min,
                             apoc.math.round(max,2) as max,
                             apoc.math.round(mean,2) as mean,
                             apoc.math.round(stdDev,2) as stdDev,
                             apoc.math.round(p25,2) as p25,
                             apoc.math.round(p50,2) as p50,
                             apoc.math.round(p75,2) as p75, 
                             apoc.math.round(p90,2) as p90

Results

csim.png

Even with similarityCutoff: 0.0, cosine similarity ranges between -1 and 1, we get an average of more than 100 relationships per node. Community detection algorithms don’t perform well on a very dense network. In our results each node is connected to about 50% of the nodes in the graph and community detection algorithms will most likely find one giant community that spans over most of the graph.

This is not a desired outcome when searching for communities. To avoid it we use a combination of similarityCutoff:0.8 and topK:10. Read more in the documentation.

Stream cosine similarity query

MATCH (c:Country)
WITH id(c) as id,
  [c["normalized_Birthrate"],
   c["normalized_Infant mortality (per 1000 births)"],
   c["normalized_GDP ($ per capita)"],
   c["normalized_Arable (%)"],
   c["normalized_Crops (%)"],
   c["normalized_Phones (per 1000)"]] as weights 
WITH {item:id, weights: weights} as userData
WITH collect(userData) as data
CALL algo.similarity.cosine.stream(data,{similarityCutoff:0.8,topK:10})
YIELD item1, item2, count1, count2, similarity
WITH item1,item2,similarity 
where item1 > item2
RETURN algo.getNodeById(item1).name AS from,
       algo.getNodeById(item2).name AS to,
       similarity
ORDER BY similarity DESC LIMIT 10

Results

neire.png

Results show that African countries are the most similar countries in the world judging by our features.

Chaining Cosine Similarity and Louvain algorithm

We have prepared everything so now we can finally chain cosine similarity with Louvain algorithm. Louvain algorithm is run using cypher projection.

To project a graph with cypher projection we must provide two cypher statements. The first one, so-called node-statement, is where we provide all the ids of nodes the algorithm should consider. In the second statement, also called relationship-statement, we provide all the relationships of our graph. This is done by providing source and target node ids for all relationships we want to consider.

Node-statement is fairly trivial as we just provide all the countries. Cosine similarity algorithm will be put into the relationship-statement to provide all the relationships of the similarity graph to the Louvain algorithm.

CALL algo.louvain.stream(
  'MATCH (c:Country) RETURN id(c) as id',
  'MATCH (c:Country)
   WITH id(c) as id,[c["normalized_Birthrate"],
                     c["normalized_Infant mortality (per 1000 births)"],
                     c["normalized_GDP ($ per capita)"],
                     c["normalized_Arable (%)"],
                     c["normalized_Crops (%)"],
                     c["normalized_Phones (per 1000)"]] as weights 
   WITH {item:id, weights: weights} as userData
   WITH collect(userData) as data
   CALL algo.similarity.cosine.stream(data,{similarityCutoff:0.8,topK:10})
   YIELD item1, item2, similarity
   RETURN item1 AS source, item2 AS target',
   {graph: 'cypher'})
YIELD nodeId,community
WITH algo.getNodeById(nodeId) as node, community
RETURN community,count(*) as community_size,
       apoc.math.round(avg(node["Birthrate"]),2) as avg_birthrate,
       apoc.math.round(avg(node["Infant mortality (per 1000 births)"]),2) as infant_mortality,
       apoc.math.round(avg(node["GDP ($ per capita)"]),2) as gdp,
       apoc.math.round(avg(node["Arable (%)"]),2) as arable, 
       apoc.math.round(avg(node["Crops (%)"]),2) as crops, 
       apoc.math.round(avg(node["Phones (per 1000)"]),2) as phones

Results

res

Louvain algorithm found 7 communities in our graph. First community is the richest judging by the GDP, has the lowest infant mortality rate and interestingly also almost the lowest birthrate.

On the other side, there is the poorest community of countries with an average value of GDP of only 1633$. They lead in infant mortality, which is quite high at 88,61 deaths per 1000 births and also lead in birthrate.

Conclusion

I was also thinking of showing how to visualize communities in the graph using only streaming algorithms, but I decided not to as we would have to run the similarity algorithm twice in a single query and it doesn’t make sense.

Still, I want to end on a visualization of communities in the graph. Made in Gephi.

countries.png

 

Register now for your copy of the O’Reilly book, Graph Algorithms: Practical Examples in Apache Spark and Neo4j by Mark Needham and Amy E. Hodler.

Social-AlgoTwitter2 .png

ArticleRank algorithm on a citation network in Neo4j

This article is a preview of the ArticleRank algorithm that will be available in the next release of Neo4j graph algorithms. If you can’t wait till then you can build the latest build yourself and try it out.

ArticleRank algorithm is a slight variation of the PageRank algorithm. It reduces the bias that PageRank has towards assigning higher scores to nodes with relationships from nodes that have few outgoing relationships.

Specifically, to find the score for a given node, PageRank counts up all the incoming scores and divides them by the number of outgoing links for each given incoming page. ArticleRank assigns a score by counting up all the incoming scores divides, and divides them by the average number of outgoing links plus the number of outgoing links for each given incoming page.[1]

Dataset

We will be using High-energy physics theory citation network made available by Stanford Network Analysis Project. It contains 352,807 relationships (citations) between 27,770 articles.

citation

Define graph schema

CREATE CONSTRAINT ON (a:Article) ASSERT a.id IS UNIQUE;

Import citations

LOAD CSV FROM "file:///Cit-HepTh.txt" as row FIELDTERMINATOR " "
WITH row skip 4
MERGE (a1:Article{id:row[0]})
MERGE (a2:Article{id:row[1]})
MERGE (a1)-[:CITES]->(a2);

Import titles of articles

For each article in the graph, there exists an .abs file containing its title and description. To import them programmatically we start with a MATCH of all articles and concatenate the filenames from the unique ids of the articles. In the second step we load the .abs file and write back the title of the article. For batching purposes, we are using apoc.periodic.iterate procedure with the parallel parameter set true.

CALL apoc.periodic.iterate("MATCH (a:Article)
    WITH a,case length(a.id) WHEN 4 THEN '000' + toString(a.id)
                             WHEN 5 THEN '00' + toString(a.id)
                             WHEN 6 THEN '0' + toString(a.id)
                             ELSE toString(a.id) END AS filename 
                             RETURN a,filename",
   "LOAD CSV FROM 'file:///' + filename + '.abs' as row FIELDTERMINATOR '\n'
    WITH a,row
    UNWIND row as r
    WITH a,split(r,'Title: ')[1] as title where r contains 'Title: '
    SET a.title = title",{batchSize:10,parallel:true})

ArticleRank

Let’s find top ten articles using ArticleRank algorithm.

ArticleRank Query

CALL algo.articleRank.stream('Article', 'CITES', 
  {iterations:20, dampingFactor:0.85})
YIELD nodeId, score
RETURN algo.getNodeById(nodeId).title AS article,score
ORDER BY score DESC LIMIT 10

ArticleRank Results

article score
“The Large N Limit of Superconformal Field Theories and Supergravity” 14.572952999999998
“String Theory Dynamics In Various Dimensions” 11.064807499999999
“Monopole Condensation, And Confinement In N=2 Supersymmetric Yang-Mills” 11.061662500000002
“Dirichlet-Branes and Ramond-Ramond Charges” 10.9979975
“Anti De Sitter Space And Holography” 10.440066000000002
“Gauge Theory Correlators from Non-Critical String Theory” 9.797559499999998
“M Theory As A Matrix Model: A Conjecture” 9.138546
“Unity of Superstring Dualities” 7.868136000000001
“Monopoles, Duality and Chiral Symmetry Breaking in N=2 Supersymmetric” 7.5854685
“Bound States Of Strings And $p$-Branes” 7.5395855

Let’s also run PageRank algorithm for comparison

PageRank Query

CALL algo.pageRank.stream('Article', 'CITES', 
{iterations:20, dampingFactor:0.85})
YIELD nodeId, score
RETURN algo.getNodeById(nodeId).title AS article,score
ORDER BY score DESC LIMIT 10

PageRank Results

article score
“Monopole Condensation, And Confinement In N=2 Supersymmetric Yang-Mills” 83.25198399999998
“Noncompact Symmetries in String Theory” 78.86716550000001
“An Algorithm to Generate Classical Solutions for String Effective Action” 70.79724850000001
“String Theory Dynamics In Various Dimensions” 61.099139499999986
“Dirichlet-Branes and Ramond-Ramond Charges” 57.558133000000005
“Exact Results on the Space of Vacua of Four Dimensional SUSY Gauge” 52.274805
“The Large N Limit of Superconformal Field Theories and Supergravity” 46.086371500000006
“Unity of Superstring Dualities” 44.9701855
“Monopoles, Duality and Chiral Symmetry Breaking in N=2 Supersymmetric” 42.74964549999999

Article “The Large N Limit of Superconformal Field Theories and Supergravity” leads the score when using ArticleRank algorithm, but is only seventh when using PageRank. We can observe a couple of changes in the top ten list.

Conclusion

As mentioned ArticleRank algorithm is a slight variant of the PageRank algorithm. It can be used in various domains and scenarios just like its predecessor.

Register now for your copy of the O’Reilly book, Graph Algorithms: Practical Examples in Apache Spark and Neo4j by Mark Needham and Amy E. Hodler.

Social-AlgoTwitter2 .png

kNN Classification of members of congress using similarity algorithms in Neo4j

k-Nearest Neighbors algorithm

Image taken from wikipedia, https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

Couple of days ago I was presenting “How to use similarity algorithms” in a Neo4j online meetup with Mark Needham. Among other use-cases we discussed how they could be used as a part of kNN classification algorithm. kNN classification algorithm works as follows. To predict the class of a node, we find its top k neighbours and select their majority class as the predicted class.

This blog post will show how to use kNN classification in Neo4j.

Requirements

Dataset

This data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes identified by the CQA. The CQA lists nine different types of votes: voted for, paired for, and announced for (these three simplified to yea), voted against, paired against, and announced against (these three simplified to nay), voted present, voted present to avoid conflict of interest, and did not vote or otherwise make a position known (these three simplified to an unknown disposition).

The information about the dataset and the dataset itself, made available by Jeff Schlimmer, can be found at UCI machine learning repository.

LOAD CSV FROM "http://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data" as row
CREATE (p:Person)
SET p.class = row[0],
    p.features = row[1..];

Missing votes

Let’s check how many members of congress have at least one missing vote.

MATCH (n:Person) 
WHERE "?" in n.features
RETURN count(n)

Results

count
203

Almost half of the members of the congress have missing votes. That’s quite significant so let’s dig deeper. We’ll check out what’s the distribution of missing votes per member.

MATCH (p:Person)
WHERE '?' in p.features
WITH p,apoc.coll.occurrences(p.features,'?') as missing
RETURN missing,count(*) as times ORDER BY missing ASC

Results

missing times
1 124
2 43
3 16
4 6
5 5
6 4
7 1
9 1
14 1
15 1
16 1

Three members almost never voted (14,15,16 missing votes) and two of the them (7,8  missing votes) have more than 50% missing votes. We will exclude them from our further analysis in order to try to reduce noise.

MATCH (p:Person)
WITH p,apoc.coll.occurrences(p.features,'?') as missing
WHERE missing > 6
DELETE p

Training and test data

Lets divide our dataset into two subsets, where 80% of nodes will be marked as training data and the rest 20% as test data. There is a total of 430 nodes in our graph. We will be mark 344 nodes as the training subset and the rest as test.

Mark training data

MATCH (p:Person)
WITH p LIMIT 344
SET p:Training;

Mark test data

MATCH (p:Person)
WITH p SKIP 344
SET p:Test;

Feature vector

There are three possible values in the feature sets. We will map them as follows:

  • “y” to 1
  • “n” to 0
  • “?” to 0.5

Transform to feature vector

MATCH (n:Person)
UNWIND n.features as feature
WITH n,collect(CASE feature WHEN 'y' THEN 1
                            WHEN 'n' THEN 0
                            ELSE 0.5 END) as feature_vector
SET n.feature_vector = feature_vector

Example feature set transformation to feature vector

features feature_vector
[“n”, “y”, “n”, “y”, “y”, “n”, “n”, “n”, “n”, “n”, “?”, “?”, “y”, “y”, “n”, “n”] [0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0.5, 0.5, 1, 1, 0, 0]

kNN classifier algorithm

We will use euclidian distance as the distance function and topK value of 3. It is advisable to use an odd number as K to avoid producing edge cases, where for example with top two neighbours and each having a different class, we end up with no majority class, but a 50/50 split between the two.

We have a specific situation where we want to start with all nodes labelled Test and find top three neighbour nodes from Training subset only. Otherwise all the nodes labelled for Test would also be considered part of Training data, which is something we want to avoid. This is exactly the reason why we can’t use a simple query like:

CALL algo.similarity.euclidean(data, {topK: 3, write:true})

We will need to use apoc.cypher.run in order to limit match results per row instead of per query. This will help us get back top 3 neighbours per node. Find more in the knowledge base.

We will also use a combination of apoc.coll.frequencies and apoc.coll.sortMaps to get back the most occurring element in a collection.

apoc.coll.sortMaps(apoc.coll.frequencies(collection), '^count')[-1]

Find more in documentation.

Query

 

MATCH (test:Test)
WITH test,test.feature_vector as feature_vector
CALL apoc.cypher.run('MATCH (training:Training)
    // calculate euclidian distance between each test node and all training nodes
    WITH training,algo.similarity.euclideanDistance($feature_vector, training.feature_vector) AS similarity
    // return only top 3 nodes
    ORDER BY similarity ASC LIMIT 3
    RETURN collect(training.class) as classes',
    {feature_vector:feature_vector}) YIELD value
WITH test.class as class, apoc.coll.sortMaps(apoc.coll.frequencies(value.classes), '^count')[-1].item as predicted_class
WITH sum(CASE when class = predicted_class THEN 1 ELSE 0 END) as correct_predictions, count(*) as total_predictions
RETURN correct_predictions,total_predictions, correct_predictions / toFloat(total_predictions) as ratio

Results

correct_predictions total_predictions ratio
77 86 0.8953488372093024

We correctly predicted the result 77 out of 86 times, which is really nice.

Conclusion

I hope in this blog post you have learned how to use graph algorithms and APOC together to run kNN classification algorithm in Neo4j and will now be able to use it on your own graph.

In theory, this approach could be used for more than just binary classification as the algorithm takes the majority of neighbors classes and there is no technical barrier to using more than two classes.

Register now for your copy of the O’Reilly book, Graph Algorithms: Practical Examples in Apache Spark and Neo4j by Mark Needham and Amy E. Hodler.

Social-AlgoTwitter2 .png

Feature extraction on a peer to peer network with DeepGL embedding and Neo4j

I came across an article Extracting ML-Features from Graph Data with DeepGL on Neo4j, written by Mark Needham, where he introduced the latest addition to the Neo4j Statistics/Machine Learning Procedures project, which is DeepGL embedding algorithm. It looked cool so I decided to try it out myself.

I am not a math wizard, but as far as I understand, DeepGL embedding algorithm extracts features of each node based on the position of the node in the graph and other characteristics of the node such as in-degree and out-degree. We can also define some custom characteristics. Check the above article for more information.

Dataset

We will be using Gnutella peer-to-peer network made available by Stanford Academy. Dataset contains information about hosts in the Gnutella network and connections between the Gnutella hosts.

First we define the schema.

CREATE CONSTRAINT ON (h:Host) ASSERT h.id IS UNIQUE;

Import the data.

LOAD CSV FROM
"file:///p2p-Gnutella09.txt" as row fieldterminator ' '
WITH row SKIP 4
MERGE (h1:Host{id:row[0]})
MERGE (h2:Host{id:row[1]})
MERGE (h1)-[:CONNECTION]->(h2)

Graph embedding

We will include pageRank values as base features for the DeepGL algorithm, so we need to calculate it and write it back.

CALL algo.pageRank('Host', 'CONNECTION')

Run DeepGL embedding with 2 iterations and pageRank as an additional node feature.

CALL embedding.deepgl("Host","CONNECTION", {
  nodeFeatures: ['pagerank'],
  iterations: 2
})

Cosine similarity

Now that we have extracted node’s features with DeepGL algorithm, we can infer a new network based on the cosine similarity of these features between hosts.

I prefer to use similarityCutoff parameter that represents the similarity threshold above which relationships should be returned or stored as in our case, but you could also use topK parameter which would effectively turn this algorithm into a kNN algorithm.

MATCH (h:Host)
WITH {item:id(h), weights: h.embedding} as userData
WITH collect(userData) as data
CALL algo.similarity.cosine(data, {similarityCutoff: 0.8, write:true})
YIELD nodes, similarityPairs
RETURN nodes, similarityPairs,

Community detection

Run Louvain algorithm on this new inferred network to find communities of similar hosts. I will not pretend that I understand how deepGL embedding works, but I saw that it takes in-degree and out-degree as base features, so we will compare average in-degree, out-degree and pageRank values between biggest five communities.

CALL algo.louvain.stream('Host', 'SIMILAR', {})
YIELD nodeId, community
MATCH (n) WHERE id(n)=nodeId
RETURN community,
       avg(size((n)-[:CONNECTION]->())) as out_degree,
       avg(size((n)<-[:CONNECTION]-())) as in_degree,
       avg(n.pagerank) as pagerank,
       count(*) as size
ORDER by size DESC limit 5

deepgl_louvain.pngThe biggest community seems to be also the most important community judging by pageRank. Members of biggest community have high out-degree and in-degree compared to others. They could be regarded as the heart that keeps the network alive. Second largest community is the least important judging by pageRank and usually members have no outgoing relationship and only one incoming relationship. They just silently sit in the corner. Third community could be regarded as “receivers” as they have little to none outgoing relationships, but more than 8 incoming relationships on average and so on…

Conclusion

Next step would be to find a cool use-case, where we would feed DeepGL features to a machine learning model and see how it behaves. Maybe that will be my next blog post, who knows 🙂

Register now for your copy of the O’Reilly book, Graph Algorithms: Practical Examples in Apache Spark and Neo4j by Mark Needham and Amy E. Hodler.

Social-AlgoTwitter2 .png

Discovering hiearchical community structure in GoT world using Louvain Algorithm and Neo4j

In the recent past phase two of Louvain algorithm was implemented in Neo4j graph algorithms library. It may not seem much, but it has great implications on our ability to explore community structures in Neo4j.

Here I will show you some of those implications.

Import

As always lets first define the schema.

CREATE CONSTRAINT ON (c:Character) ASSERT c.name IS UNIQUE;

We will be using Network of thrones dataset made available by Andrew Beveridge, specifically volume 4 through 5 titled “A feast with dragons”. It contains information about 506 characters and 1329 relationships between them.

LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/mathbeveridge/asoiaf/master/data/asoiaf-book45-edges.csv" AS row 
MERGE (src:Character {name: row.Source}) 
MERGE (tgt:Character {name: row.Target}) 
MERGE (src)-[r:INTERACTS]->(tgt) 
ON CREATE SET r.weight = toInt(row.Weight)

Louvain algorithm

Louvain algorithm is roughly defined as follows:

The method consists of repeated application of two steps. The first step is a “greedy” assignment of nodes to communities, favoring local optimizations of modularity. The second step is the definition of a new coarse-grained network, based on the communities found in the first step. These two steps are repeated until no further modularity-increasing reassignments of communities are possible.

Taken from the documentation

As stated Louvain algorithm repeats two steps until no further modularity reassignments are possible. The great thing about this is that we can check community structure at the end of each iteration of two steps mentioned above and get a glimpse of how small communities are expanding into bigger ones.

Run the Louvain algorithm with includeIntermediateCommunities parameter to be true. This will save community id at each iteration into an array property named communities.

CALL algo.louvain('Character', 'INTERACTS', 
  {includeIntermediateCommunities: true})

Lets check how many distinct communities exist at the end of each iteration to get a rough sense of results .

MATCH (c:Character)
UNWIND range(0,length(c.communities) - 1) as x
RETURN x + 1 as iteration,
       count(distinct c.communities[x]) as number_of_communities,
       count(*) / count(distinct c.communities[x]) as average_community_size
ORDER BY iteration ASC

We can observe that at the end of first iteration we have quite small communities with an average community size of only 5. This quickly changes at the end of second iteration as average community size rises up 42. Third iteration feels like a minor fix and the average community size ends up 50.

iteration number_of_communities average_community_size
1 88 5
2 12 42
3 10 50

Jon Snow

Let’s focus on Jon Snow for a moment and check how his community looks like at the end of each iteration. As we saw before we should get a fairly small community after the first iteration and his community will expand with each sequential iteration.

MATCH (c:Character{name:"Jon-Snow"})
UNWIND range(0,length(c.communities) - 1) as x
MATCH (c1:Character)
WHERE c.communities[x] = c1.communities[x]
RETURN x + 1 as iteration, 
       collect(c1.name) as community,
       count(*) as community_size
ORDER BY iteration ASC

After first iteration, Jon Snow’s community has only 8 members, which compose of a couple of free folks and some Night’s Watch men if my GoT knowledge is any accurate. As expected second and third iteration community expand quite a bit.

iteration community community_size
1 [“Jon-Snow”, “Alliser-Thorne”, “Benjen-Stark”, “Cotter-Pyke”, “Denys-Mallister”, “Soren-Shieldbreaker”, “Tom-Barleycorn”, “Ulmer”] 8
2 [“Aegon-V-Targaryen”, “Aemon-Targaryen-(Maester-Aemon)”, “Clydas”, “Dareon”,  “Dalla”, “Eddison-Tollett”, “Melisandre”, …] 70
3 [“Aegon-V-Targaryen”, “Aemon-Targaryen-(Maester-Aemon)”, “Alleras”, “Clydas” …] 91

Gephi visualizations

I am more visual type than anything else so I decided to use Gephi to visualize how communities expand over each iteration. For more detailed instructions on how to combine Neo4j and Gephi read my previous blog.

We need to store community id of each iteration as a separate property in order to be able to export to and visualize with Gephi.

MATCH (c:Character)
UNWIND range(0,length(c.communities) - 1) as x
CALL apoc.create.setProperty(c, 'communityy_' + toString(x), c.communities[x]) YIELD node
RETURN distinct 'done'

Finally send the data to Gephi.

MATCH path = (:Character)-[:INTERACTS]->(:Character)
CALL apoc.gephi.add(null,'workspace1',path,'weight',
  ['community_0','community_1','community_2']) yield nodes
return distinct true

First iteration

first_one.png

Communities: 88
Average community size: 5

Second iteration

second_one.png

Communities: 12
Average community size: 42

Third iteration

third_one.png

Communities: 10
Average community size: 50

Conclusion

Checking intermediate community structure at the end of each iteration in Louvain algorithm is a powerful tool that lets us find small communities that are otherwise overlooked. It also allows us to search for hierarchical community structures as for example a single community at the third iteration of the algorithm can be composed out of three different communities at the second iteration. This would imply some sort of hierarchical connection and structure.

Register now for your copy of the O’Reilly book, Graph Algorithms: Practical Examples in Apache Spark and Neo4j by Mark Needham and Amy E. Hodler.

Social-AlgoTwitter2 .png