Neo4j Facebook ego network analysis

Graph algorithms plugin for Neo4j is getting bigger and bigger by the day, so here are some new algorithms, that have been added to the collection. I will use Facebook’s ego network of friends to show triangle count algorithms and betweenness centrality algorithms on an undirected graph in Neo4j.

Requirements:

Graph model:

We will import a friends social network of Facebook’s users. There are 4039 nodes labeled User and 88234 unweighted relationships indicating friendship between users as shown below. You can always get the schema of your graph using CALL dbms.schema().

user

We create a constraint on User labels, so that our queries will run faster.

CREATE CONSTRAINT ON (u:User) ASSERT u.id is unique

Import:

Direct link to data is here. We will model this network as undirected. There is a twist here, that Neo4j supports storing only directed relationships. That is no problem as we can still treat it as undirected in our queries and algorithms.

USING PERIODIC COMMIT 5000
LOAD CSV FROM 
"file:///facebook_combined.txt" as row fieldterminator ' '
MERGE (u:User{id:row[0]})
MERGE (u1:User{id:row[1]})
MERGE (u)-[:FRIEND]-(u1)

We are using MERGE without specifying direction of the relationship as it does not matter in an undirected graph. Using MERGE like this will check for any existing relationships in both direction and if none exists it will choose a random direction to store the new relationship.

Graph Analysis:

Find (weakly) connected components in the graph ?

I always start with checking how many connected components are there in the graph. Most graph algorithms run best on an connected component, so I usually choose the largest component to run further algorithms on. Check documentation for more.

CALL algo.unionFind.stream('User', 'FRIEND', {})
YIELD nodeId,setId
return setId as component, count(*) as size

un

As we can see there exists only one connected component in this graph. This is very good as there are no disconnected components. Graph algorithms library also support finding strongly connected components with algo.scc. In undirected graph every weakly connected component is also strongly connected so we can skip it.

 Get average number of friends and standard deviation.

MATCH (u:User)
return avg(size((u)--())) as average_degree,
       stdev(size((u)--())) as stdev

st

A user has on average 43 friends. Standard deviation is quite high so the number of friends is spread out over a wider range of values. We can check now for users with most friends to start looking for influencers in our network.

Degree:

MATCH (u:User)
RETURN u.id as id,size((u)--()) as degree 
order by degree desc limit 10

deg

Most basic metric is degree, that we can get using plain cypher. This directly translates to the number of friends a user has. It can be used as a good starting ground to finding influencers in our network. Top 4 users have more than 10x the average friends with user 107 being really high at 1045 friends (26% of all users).

Calculate triangle count and clustering coefficient for each node and return total triangle count and average clustering coefficient of the network.

CALL algo.triangleCount('User', 'FRIEND',
{concurrency:4, write:true, writeProperty:'triangles', 
clusteringCoefficientProperty:'coefficient'})
YIELD nodeCount, triangleCount, averageClusteringCoefficient,
loadMillis, computeMillis, writeMillis

global

There are 4039 node with 1612010 triangles in our network. Average clustering is quite high at 0.6055. This is expected as evidence suggests that in most real-world networks, and in particular social networks, nodes tend to create tightly knit groups characterized by a relatively high density of relationships. [1]

Check user local triangle count and cluster coefficient.

The local clustering coefficient of a node in a graph quantifies how close its neighbours are to being a complete graph, where complete graph is a simple undirected graph in which every pair of distinct nodes is connected by a unique relationship.[1]

If put simple the clustering coeffient would be 1 when all of my friends knew eachother.
Above algorithms also saved values of individual nodes as properties of nodes, so we can now query the results.

MATCH (u:User)
return u.id,u.triangles as triangles,
            u.coefficient as cluster_coefficient 
order triangles desc limit 10

tri

User 1912 is a member of more triangles then user 107 , who has the most friends. They both have a small clustering coefficient, which indicates, that their friends don’t really know eachother and they have a great potencial for introducing people to one another. User 1912,107 were third and first by the number of friends, so combined with this results they are solidifying their importance in the network. There is quite a drop in triangle count to third place, while on the other hand their neighbour communities are more tightly knit.

Betweenness centrality:

Betweenness centrality is useful in finding nodes that serve as a bridge from one group of users to another in  a graph. Consequently, betweenness is a rudimentary measure of the control, that a specific node exerts over the information flow throughout the full graph.

We use direction:'BOTH'for algorithms on an undirected graph.

Check documentation for more.

CALL algo.betweenness.stream('User', 'FRIEND', 
{direction:'BOTH'}) 
YIELD nodeId,centrality
MATCH (u:User) where id(u) = nodeId
RETURN u.id, centrality order by centrality desc limit 15

betwennness

User 107 tops the betweenness centrality results, which is not surprising as he has the most amount of friends with a very low clustering coeffient. So he in the position to act as a broker of informations between different groups of users. He sure does appear as the main influencer in this network based on the metrics we checked. Based on the combined view of centrality metrics we can create a list with users, who are likely to be important in our network and do some further research on them.

Conclusion:

As you can see graph algorithms plugin greatly improves our ability to analyse the network in Neo4j. You should check them out! If you think some algorithms are still missing or find some bugs please let us know on github or the Neo4j slack community.

 

[1] https://en.wikipedia.org/wiki/Clustering_coefficient

Advertisements

Neo4j London tube system analysis

I recently came across London tube dataset, that was uploaded by Nicola Greco. I thought it would a cool example to show some algorithms from the new Neo4j graph algorithms plugin. They are really useful as they allow network analysis without using external services to run the algorithms on.

Requirements:

Graph model:

We will create a very simple graph model, with stations and connections between them as shown below. You can check the always check the schema of your graph with db.schema() procedure.

london

Lets define the schema for our graph model with a constraint and an index, so that all queries will run faster.

CREATE CONSTRAINT ON (s:Station) ASSERT s.id is unique;

CREATE INDEX ON :Station(name);

Import:

LOAD CSV can import data from local files or from the internet, which is really cool, so that we can access data on github easily without the need of any intermediate steps.

Import stations

LOAD CSV WITH HEADERS FROM
"https://raw.githubusercontent.com/nicola/tubemaps/master/datasets/london.stations.csv" as row
MERGE (s:Station{id:row.id})
ON CREATE SET s.name = row.name,
              s.latitude=row.latitude,
              s.longitude=row.longitude,
              s.zone=row.zone,
              s.total_lines=row.total_lines

Import connections between stations.
We create relationships in both directions as trains also travel in both directions. We save the time spent traveling between stations as a relationship property, so that we can use it as a weight in our algorithms.

LOAD CSV WITH HEADERS FROM
"https://raw.githubusercontent.com/nicola/tubemaps/master/datasets/london.connections.csv" as row
MATCH (s1:Station{id:row.station1})
MATCH (s2:Station{id:row.station2})
MERGE (s1)-[:CONNECTION{time:row.time,line:row.line}]->(s2)
MERGE (s1)<-[:CONNECTION{time:row.time,line:row.line}]-(s2)

Analysis:

Which stations have the most connections to other stations?

MATCH (n:Station)--(n1)
RETURN n.name as station,
       count(distinct(n1)) as connections 
order by connections desc LIMIT 15

Results:

connections

Find the fastest path based between two stations.

algo.shortestPath is a dijsktra algorithm, that returns shortest path between start and end node. For more info check the documentation.

transfers are not taken into account

MATCH (start:Station{name:"Baker Street"}),(end:Station{name:"Beckton Park"})
CALL algo.shortestPath.stream(start, end, 'time')
YIELD nodeId, cost
MATCH (s:Station) where id(s)=nodeId
RETURN s.name as station,cost as time

Results:

single_connection

Find out which stations take the longest to get to from Baker Street station.

algo.shortestPaths is a dijsktra algorithm, that return shortest paths from start node to all the other nodes in the network.

transfers are not taken into account

MATCH (n:Station {name:'Baker Street'})
CALL algo.shortestPaths.stream(n, 'time')
YIELD nodeId, distance
MATCH (s:Station) where id(s)=nodeId
RETURN s.name as station,distance as time 
ORDER BY time desc LIMIT 15

Results:

single_many

Find the longest shortest paths between stations in the network.

algo.allShortestPaths returns shortest paths between all pairs of nodes in the network

transfers are not taken into account

CALL algo.allShortestPaths.stream('time',
{nodeQuery:'Station',relationshipQuery:'CONNECTION',defaultValue:1.0})
YIELD sourceNodeId, targetNodeId, distance
WITH sourceNodeId, targetNodeId, distance 
ORDER BY distance desc LIMIT 20
// We filter out duplicates
WHERE sourceNodeId > targetNodeId
MATCH (s1:Station),(s2:Station) 
WHERE id(s1)=sourceNodeId AND id(s2)=targetNodeId
RETURN s1.name as station1,s2.name as station2,distance as time

Results:

all_paths.png

Find bridge stations in network.

We can use betweenness centrality to help us identify, which are the stations, where people will be passing by the most, based on the structure of the network.

algo.betweenness is a betweenness algorithm, that helps us find bridges in the network. For more info check documentation.

CALL algo.betweenness.exp1.stream('Station','CONNECTION')
YIELD nodeId, centrality
MATCH (n:Station) where id(n)=nodeId
RETURN n.name,centrality 
ORDER BY centrality desc
LIMIT 15

Results:

bett

Conclusion:

It is very easy to enhance Neo4j with graph algorithms. You just copy the .jar file to /plugins folder, restart Neo4j and you are good to go. They offer lots of cool algorithms, that can help you analyse your graph in Neo4j. You should try them out!

Neo4j GoT Graph Analysis

Winter has come!

Thanks to Joakim Skoog, who built this cool API of ice and fire. It offers data about GoT universe, that we will examine and do some further analysis.

I will show you how to use just released neo4j graph algorithms plugin on the data, that we got from the above API. As the name suggests, they are user defined procedures of graph algorithms, that you can call with cypher. For most algorithms there are two procedures, one that writes results back to the graph as node-properties and another (named algo..stream) that returns a stream of data, e.g. node-ids and computed values.

Requirements:

Import:

Our cool community caretaker Michael Hunger spotted this data and posted a blog post, so that we can easily import the data into our Neo4j using cypher shell. I followed his script, imported the data, and am now sharing the graph.db database, that you can simply copy to your neo4j/data/databases folder and you will have the GoT universe in your Neo4j

Graph Model:

got.png
We have imported the data, that describes a network of persons and houses, with some additional meta-data. There are a couple of relationships between persons and houses like SWORN_TO and LED_BY. We can also find family tree hiearchies in PARENT_OF relationship. I will use the SWORN_TO relationship between houses, that describes house hiearchy of allegiances. It is a directed, unweighted network.

Analysis:

Sworn to network

sworn_to.png

At first glance we can observe, that we have some bridges between different part of communities and that they all link towards the center where the power resides. As you will see this network is from the peaceful times, when the 7 kingdoms were at peace.

Connected Components:

As a part of preprocessing we will check how many disconnected components exist within our network with algo.unionFind.

Connected Components or UnionFind basically finds sets of connected nodes where each node is reachable from any other node in the same set. In graph theory, a connected component of an undirected graph is a subgraph in which any two vertices are connected to each other by paths, and which is connected to no additional vertices in the graph.

Query:

CALL algo.unionFind('House', 'SWORN_TO', {write:true, partitionProperty:"partition"})

Result:
updateconect.png
We see that there is one big component with 393 members and a small one with 5 members. All the rest are disconnected with only one member. That means that either they are a loner house or we are missing some data. We will use centrality algorithms only on the biggest component with 393 members as graph algorithms work better on connected graphs.

Centralities:

In graph theory and network analysis, indicators of centrality identify the most important nodes within a graph. Applications include identifying the most influential person(s) in a social network, key infrastructure nodes in the Internet or urban networks, and super-spreaders of disease. Centrality concepts were first developed in social network analysis, and many of the terms used to measure centrality reflect their sociological origin.

Outgoing degree is always one as each house is sworn to only one house.

Incoming degree:

We simply count how many incoming relationships each house has. In this example it directly translates to how many houses are sworn to each house.

MATCH (h:House)<-[:SWORN_TO]-(:House)
RETURN h.name as house,count(*) as in_degree order by in_degree desc limit 10

updatedsworn

House Tyrell leads with a decent margin in front of Lannisters. The main force is house Baratheon, that consists of two branches and has combined number of sworn houses at 77. Others slowly follow, its funny ( sad ? ) to see, that even house Baelish of Harrenhal, which is ruled by Petyr Baelish has more support than house Stark of Winterfell. Books and show divergence a lot for Littlefinger’s character as he is not a ruler of Harrenhal in the show.

Pagerank centrality:

Pagerank is a version of weighted degree centrality with a feedback loop. You only get your “fair share” of your neighbor’s importance. That is, your neighbor’s importance is split between their neighbors, proportional to the number of interactions with that neighbor. Intuitively, PageRank captures how effectively you are taking advantage of your network contacts. In our context, PageRank centrality nicely captures where the power resides. Houses that have the support of other big houses will come on top.

CALL algo.pageRank(
'MATCH (h:House) where h.partition = 151 return id(h) as id',
'MATCH (p1:House)-[:SWORN_TO]->(p2:House) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', iterations:30, write: true});

pagerank.png

I was surprised with the results to say the least, but after investigating a bit I understood why. Remember in the picture of sworn to network, where I told you that all the different communities point at the center of the graph where the power resides. And at that center are the two Baratheon houses, that are sworn to each other and hence the number is so big. Where incoming degree takes into account only the number of your neighbours, pagerank also takes into account their importance. You can notice that house Stark has a better position than house Baelish, because Stark’s supporters are more important ( have more sworn to houses ).

Betweenness centrality:

Betweenness centrality identifies nodes that are strategically positioned in the network, meaning that information will often travel through that person. Such an intermediary position gives that person power and influence. Betweenness centrality is a raw count of the number of short paths that go through a given node. For example, if a node is located on a bottleneck between two large communities, then it will have high betweenness.

CALL algo.betweenness(
'MATCH (p:House) WHERE p.partition = 151 RETURN id(p) as id',
'MATCH (p1:House)-[:SWORN_TO]->(p2:House) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', write: true, writeProperty: 'betweenness'});

betweenness.png
House Baratheon is uniquely positioned at the center of the graph, so it will be the first in all of the centrality measures. Strong second candidate are the Tyrells, which is a bit surprising as they beat the Lannisters at both pagerank and betweenness centrality. In our case betweenness centrality measures which are the bridging houses with big support of other houses at their back.

Closeness centrality:

The most central nodes according to closeness centrality can quickly interact to all others because they are close to all others. This measure is preferable to degree centrality, because it does not take into account only direct connections among nodes, but also indirect connections.

CALL algo.closeness(
'MATCH (p:House) WHERE p.partition = 151 RETURN id(p) as id',
'MATCH (p1:House)-[:SWORN_TO]->(p2:House) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', write: true, writeProperty:'closeness'});

closeness.png
Closeness centrality is different from others in that the lower the closeness weight is, the better. As always house Baratheon leads with a decent margin, Tyrells are standing firmly on the second place. Starks are better than Lannisters at closeness, but are losing in all other measures, so they get fourth place and Lannisters third.

Conclusion:

This is a network from peaceful times and is therefore less interesting than having two strong disconnected components, that fight each other. What would be cool is to get a network based on the current show or even better that G.R.R.M publishes the Words of Winter, where alliances will change and people will die, to see how the graph evolves through time.

Check also https://github.com/neo4j-examples/game-of-graphs, where you can see how to expose this data as a GraphQL endpoint.

Thank for reading!

Neo4j apoc.load.jdbc

I recently made a simple query, that I used for syncing postgreSQL to Neo4j on an hourly basis with a cron job. My job was to make a simple scripts, that would automate the syncing of Neo4j with a SQL database. This is a very simple task thanks to APOC functions, which has loads of very useful functions, we can call with cypher.

I had to import a couple of SQL tables and join them together in a graph to have a better overview of the data. This is an example of query, that was used to import a table with sales data.

Graph Model:

jdbc2.png

As you can see we are importing some sales data from the postgreSQL database in my case. I made a very simple graph model to capture the data we get about our sales. Main focus is the invoice. We use domain instead of company name as an index as this works out better in our case.

You will need:

Just download them and put them into neo4j /plugins directory and that’s it!

Authorization:

Sample call looks like:

CALL apoc.load.jdbc('jdbc:mysql://localhost:3306/proddb1?user=root&password=football',
'select title from movies where studio=\'MGM Studios\'') 
YIELD row

with jdbc:mysql://localhost:3306/proddb1?user=root&password=football' being the connection string, that contains all the sensitive information such as ip, username and password of your SQL database.
If you want to hide/alias the connection string, this can be accomplished by adding to the conf/neo4j.conf a parameter similar to:

apoc.jdbc.myalias.url=jdbc:mysql://localhost:3306/proddb1?user=root&password=football

and now the above sample call can be rewritten as:

CALL apoc.load.jdbc('myalias',
'select title from movies where studio=\'MGM Studios\'') 
YIELD row

For more info check knowledge base post.

Query:

I made some logic with cypher, where I first get the date of the latest invoice in my Neo4j and then only select those that are were updated on the same day or later. I imagine there is a better way of doing this, but it serves our purpose.

// match existing invoices
MATCH (w:Invoice)
// get the latest date in the desired format
WITH apoc.date.format(max(w.unix),'s','yyyy-MM-dd') as last_invoice
// prepare SQL select statement
WITH 
'SELECT * FROM public.order WHERE updated_at::date >=?::date' as order_query,
last_invoice
// call load.jdbc with the latest date of existing data as a param
CALL apoc.load.jdbcParams(postgresurl, order_query, [last_invoice]) yield row
// we need to parse out domains from the emails
WITH row,split(row.customer_email,"@")[1] as domain
// if the email is in below list, then take whole email as a code
// so that those with gmail are not treated as one customer
WITH *,case when domain in ["gmail.com","yahoo.com","outlook.com","hotmail.com","msn.com","hotmail.co.uk"] 
  then row.customer_email
  else domain end as code
MERGE (i:Invoice{order_id:row.order_number})
// we round all the numbers to 2 digits
ON CREATE SET i.status = row.status,
              i.total_amount = apoc.math.round(row.total,2,"UP"),
              i.quantity = toINT(row.total_line_items_quantity),
              i.total_tax = apoc.math.round(row.total_tax,2,"UP"),
// save a human readable date format
              i.date = apoc.date.format(row.completed_at,"ms","yyyy-MM-dd'T'HH:mm:ss.SSS"),
              i.postcode = row.customer_billing_address_postcode,
              i.address = row.billing_address_address_1,
              i.total_discount = apoc.math.round(row.total_discount,2,"UP"),
              i.total_shipping = apoc.math.round(row.total_shipping,2,"UP"),
              i.unix = (row.completed_at / 1000),
              i.invoice_number = row.order_meta_tzs_invoice_number,
              i.total_refunded = apoc.math.round(row.refunded_total,2,"UP")
// columns, that can change in our invoices in time, we add to ON MATCH statement	
ON MATCH SET i.status = row.status,
             i.total_refunded = apoc.math.round(row.refunded_total,2,"UP"), 
             i.total_tax = apoc.math.round(row.total_tax,2,"UP"), 
             i.total_amount = apoc.math.round(row.total,2,"UP"),
// merge domain and person nodes and connect them 
MERGE (domain:Domain{name:code})
MERGE (person:Person{email:row.customer_email})
MERGE (domain)-[:MEMBER]->(person)
ON CREATE SET 
    person.name = row.billing_address_first_name + " " + row.customer_billing_address_last_name,
              person.ip = row.customer_ip ,
              person.phone = row.customer_billing_address_phone
// merge payment menthod node
MERGE (pay:PaymentMethod{name:row.payment_details_method_id})
MERGE (i)-[:PAYMENT_METHOD]->(pay)
// merge location tree
MERGE (country:Country{code:row.billing_address_country})
MERGE (country)(city)
RETURN distinct 'orders imported'

This is so easy I had to share with you, so you can maybe do a pilot project on your own to see how does Neo4j fit to your needs. Just wrap around it your favourite scripting language and you are good to go.

Game of Thrones part 4 Analysis

If you followed my GoT series so far, you have seen how to import a  GoT dataset, that I found on kaggle.com into Neo4j. There are only a couple days left before new season starts, so let’s do some analysis on the dataset. You need to know, that this data was derived from the books, so it does not contain some battles and data about the events that happened on the show as it has already surpassed the books.

Requirements:

Data:

We have already imported all the data from the CSVs into our Neo4j in previous posts. If we look closely, we can find that some persons have two houses they belong to, and one of them is None, which is the default unknown value in characters_death.csv.

Screen Shot 2017-07-14 at 13.48.42.png

To get rid of this anomaly, we can simply MATCH all persons that belong to House “None” and a separate house also. Then we just delete the relationship pointing to House “None” and we have solved the problem.

match (h1:House{name:"None"})-[r]-(p:Person)--(h2:House)
delete r

Distributions:

We can start by visualizing few distributions with Spoon.js directly in Neo4j Browser. Does not work on 3.2.x as the Neo4j browser was rewritten. 

Which house won/lost most battles ?

match (house:House)
OPTIONAL MATCH (house)-[fight:ATTACKER|:DEFENDER]->()
return house.name as house, 
       sum(case fight.outcome when "win" then 1 else 0 end) as wins,
       sum(case fight.outcome when "loss" then 1 else 0 end) as losses
order by wins + losses desc limit 20

House_battle_results.png

We can easily see the Lannisters are currently on the rise, with 11 wins and 7 losses. Second are the Starks, on the other hand in decline, with 6 wins and 10 losses. Red Wedding and what came afterward was really devastating for the Starks. Freys and Boltons look in a good position, as they were on the winning side of the Red Wedding, but those who watch the show, know that a bitter end is waiting for them.

Distribution of alive/dead people by house:

match (p:Person)--(h:House)
return h.name,
       sum(case when not p:Dead then 1 else 0 end) as alive,
       sum(case when p:Dead then 1 else 0 end) as dead 
order by dead desc limit 15

house_dead.png

It’s a hard life being in a Night’s Watch and we can see that mortality is high. Almost half the characters from Night’s Watch have died. It is also not a good time for Targaryens (yet!), as they were mostly killed off during the Robert’s Rebellion AKA the War of the Usurper. We all know that Starks are dying like flies, but it came as a surprise to see so many dead Lannisters.

Distribution of alive/dead people by culture:

match (p:Person)--(c:Culture)
return c.name as culture,
       sum(case when not p:Dead then 1 else 0 end) as alive,
       sum(case when p:Dead then 1 else 0 end) as dead 
order by dead desc limit 15

culture_dead

We can observe, that the most represented cultures are Northmen and Ironborn. While seeing so many Northmen in the books does not come as a surprise, as the whole story starts with House Stark living in Winterfell. Surprising is that there are so many Ironborns in the books. I think this all has to do with the Kingsmoot at Iron Islands, where we get to know lots of different Ironborn characters.

Valyrians are almost extinct, but they have the hidden joker Daenerys Targaryen, who has a big role yet to play.

Distribution of alive/dead people by books:

match (p:Person)-[r]-(h:Book)
with h.name as book,
     h.sequence as sequence,
     sum(case when type(r)="APPEARED_IN" then 1 else 0 end) as total,
     sum(case when type(r)="DIED_IN" then 1 else 0 end) as dead 
     order by sequence asc
return book,total,dead

book_dead.png

This figures of dead people seem small-ish for GoT, so I guess not all deaths are recorded. We can see, how the author slowly build up the community of all the people in the books. It started already with 400 characters in the first book and rose up to almost 1200 characters in the fourth book. That is a lot of characters to keep track. Storm of swords was the bloodiest book with 97 deaths, while Feast for Crows has unusual little deaths, only 27.

Analysis:

We will use algo.unionFind, that is a (weakly) connected components algorithm.  Taken from the docs.

Connected Components or UnionFind basically finds sets of connected nodes where each node is reachable from any other node in the same set. In graph theory, a connected component of an undirected graph is a subgraph in which any two vertices are connected to each other by paths, and which is connected to no additional vertices in the graph.

Which houses fought on the same side ?

We can create a network of Houses from indirect patterns of battles they fought together. We will also assume that an ally of my ally is also my ally. In this case we can use algo.unionFind to solve the problem.

Create a network of houses that fought on the same side

match (house1:House)-[r1]-(battle:Battle)-[r2]-(house2:House)
WHERE type(r1)=type(r2)
MERGE (house1)-[:ALLY]-(house2)

Run the algorithms and write back the results as a property of the node community.

CALL algo.unionFind('House', 'ALLY', {write:true, partitionProperty:"community"})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis

Check results

match (h:House)
return distinct(h.community) as community,
       collect(h.name) as members,
       count(*) as count 
order by count desc limit 4

House_community.png

We observe 4 different communities (partitions) came back as a result. We can see that Lannisters teamed up with Boltons and Freys to defeat the Starks. Night’s Watch and Baratheon fought on the same side when they defended the Wall from Wildlings (Free folk, Thenns). Stark and Tully have an alliance with marriage (Catelyn Tully and Ned Stark), so it is no wonder to see them fight battles on the same side.

Which persons fought on the same side in the year 299 ?

We will do something similar for persons also, but we will focus only on the year 299 as there were most battles fought that year. The reason why I chose a specific year is also, because alliances change over time and if we looked at all battles, then we get most persons in one community as they fought on the same side at least once.

Create a network of persons that fought on the same side in the year 299.

MATCH (p1:Person)-[r1]-(b:Battle)-[r2]-(p2:Person) 
WHERE b.year = 299 and type(r1)=type(r2)
MERGE (p1)-[:FRIEND]-(p2)

Run the algorithms and write back the results as a property of the node community.

CALL algo.unionFind('Person', 'FRIEND', {write:true, partitionProperty:"community"}) 
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis

Check results

match (p:Person)
return distinct(p.community) as community,
       collect(p.name) as members,
       count(*) as count 
order by count desc limit 10

Screen Shot 2017-07-14 at 16.11.52

We observe that 10 different communities (partitions) exist. It is a good representation of the state of alliances, that fought together on the specific year. We could do this analysis for a couple of different years and compare alliances over the year how they change.

Conclusion:

We have only scratched the surface of what is possible to do with this dataset so stay tuned for more!

You can grab the Neo4j database and just copy it into data/databases folder

Got for 3.2.x

Got for Neo4j 3.1.x

Neo4j Game of Thrones part 3

This is part 3 of GoT series, so if you haven’t seen previous posts i suggest you check them out.

In this part we will import the last CSV, so that we will be able to use data from all three sources in our analysis/visualizations.

Data:

We will be using kaggle dataset of Game of Thrones, which I saved on my github repo for easier access. In this second part we will import character_predictions.csv. This CSV takes an expanded view on character deaths, including predictions of how likely they are to die.

Graph model:

Screen Shot 2017-06-25 at 14.53.14

Our graph model is slowly evolving with every new CSV we import. We add a new Culture node and RELATED_TO relationship between persons, that creates a social network of persons. By the time we are done with importing, we will have 2091 persons and 352 House nodes in our knowledge graph, which is not bad for GoT dataset.

Import:

As you will see importing character_predictions.csv , is doing 10 FOREACH tricks, because the data has lots of null values in it, that we need to skip in order for the MERGE statement to work as needed.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/character-predictions.csv" as row
MERGE (p:Person{name:row.name})
// set properties on the person node
SET p.title = row.title,
    p.death_year = toINT(row.DateoFdeath),
    p.birth_year = toINT(row.dateOfBirth),
    p.age = toINT(row.age),
    p.gender = case when row.male = "1" then "male" else "female" end
// doing the foreach trick to skip null values
FOREACH
  (ignoreMe IN CASE WHEN row.mother is not null THEN [1] ELSE [] END |
    MERGE (mother:Person{name:row.mother})
    MERGE (p)-[:RELATED_TO{name:"mother"}]->(mother)
)
FOREACH
  (ignoreMe IN CASE WHEN row.spouse is not null THEN [1] ELSE [] END |
    MERGE (spouse:Person{name:row.spouse})
    MERGE (p)-[:RELATED_TO{name:"spouse"}]->(spouse)
)
FOREACH
  (ignoreMe IN CASE WHEN row.father is not null THEN [1] ELSE [] END |
    MERGE (father:Person{name:row.father})
    MERGE (p)-[:RELATED_TO{name:"father"}]->(father)
)
FOREACH
  (ignoreMe IN CASE WHEN row.heir is not null THEN [1] ELSE [] END |
    MERGE (heir:Person{name:row.heir})
    MERGE (p)-[:RELATED_TO{name:"heir"}]->(heir)
)
// we remove "House " from the value for better linking of data
FOREACH 
  (ignoreMe IN CASE WHEN row.house is not null THEN [1] ELSE [] END | 
    MERGE (house:House{name:replace(row.house,"House ","")}) 
    MERGE (p)-[:BELONGS_TO]->(house) 
)

I splited the query into two parts for more clarity, but as you see there is nothing fancy about it, just using the FOREACH trick over and over again.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/character-predictions.csv" as row
// match person
MERGE (p:Person{name:row.name})
// doing the foreach trick... we lower row.culture for better linking
FOREACH
  (ignoreMe IN CASE WHEN row.culture is not null THEN [1] ELSE [] END |
    MERGE (culture:Culture{name:lower(row.culture)})
    MERGE (p)-[:MEMBER_OF_CULTURE]->(culture)
)
FOREACH
  (ignoreMe IN CASE WHEN row.book1 = "1" THEN [1] ELSE [] END |
    MERGE (book:Book{sequence:1})
    MERGE (p)-[:APPEARED_IN]->(book)
)
FOREACH
  (ignoreMe IN CASE WHEN row.book2 = "1" THEN [1] ELSE [] END |
    MERGE (book:Book{sequence:2})
    MERGE (p)-[:APPEARED_IN]->(book)
)
FOREACH
  (ignoreMe IN CASE WHEN row.book3 = "1" THEN [1] ELSE [] END |
    MERGE (book:Book{sequence:3})
    MERGE (p)-[:APPEARED_IN]->(book)
)
FOREACH
  (ignoreMe IN CASE WHEN row.book4 = "1" THEN [1] ELSE [] END |
    MERGE (book:Book{sequence:4})
    MERGE (p)-[:APPEARED_IN]->(book)
)
FOREACH
  (ignoreMe IN CASE WHEN row.book5 = "1" THEN [1] ELSE [] END |
    MERGE (book:Book{sequence:5})
    MERGE (p)-[:APPEARED_IN]->(book)
)

Multi-labeled nodes:

Nodes can have none, one or more labels. Think of labels as a hiearchy of rules, where first level labels group entities together (:Person),(:Company),(:House) and additional, second level labels as a preprocessed way of filtering nodes faster/easier (:Person:Dead),(:Person:King),(:Company:Reseller) etc…

Check also AirBnB talk at graphconnect.

Lets mark all the dead people in our database first, since this is GoT database. Next we will mark all Kings and Knights with additional labels.

Dead persons:

Query 1:

LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/character-predictions.csv" as row
// do CASE statements
with row,
case when row.isAlive = "0" THEN [1] ELSE [] END as dead_person,
case when row.isAliveMother = "0" THEN [1] ELSE [] END as dead_mother,
case when row.isAliveFather = "0" THEN [1] ELSE [] END as dead_father,
case when row.isAliveHeir = "0" THEN [1] ELSE [] END as dead_heir,
case when row.isAliveSpouse = "0" THEN [1] ELSE [] END as dead_spouse
// MATCH all the persons
MATCH (p:Person{name:row.name})
// We use optional match so that it doesnt stop the query if not found
OPTIONAL MATCH (mother:Person{name:row.mother})
OPTIONAL MATCH (father:Person{name:row.father})
OPTIONAL MATCH (heir:Person{name:row.heir})
OPTIONAL MATCH (spouse:Spouse{name:row.spouse})
// Set the label of the dead persons
FOREACH (d in dead_person | set p:Dead)
FOREACH (d in dead_mother | set mother:Dead)
FOREACH (d in dead_father | set father:Dead)
FOREACH (d in dead_heir | set heir:Dead)
FOREACH (d in dead_spouse | set spouse:Dead)

Query 2:

MATCH (p:Person) where exists (p.death_year)
SET p:Dead

Kings:

Query 1:

MATCH (p:Person)-[:DEFENDER_KING|ATTACKER_KING]-()
SET p:King

Query 2:

MATCH (p:Person) where lower(p.title) contains "king"
SET p:King

Knight:

Query 1:

MATCH (p:Person) where p.title = "Ser"
SET p:Knight

Refactor:

After importing you can check the cleanliness of your data how well the Persons in your database match or are there many typos etc…with ordering the return by alphabet, so you can quickly spot mistakes if they exist.

Example query:

MATCH (n:Person)
RETURN n.name as name order by name

I cleaned up the culture row in character_predictions.csv, but left one mistake, so that I can show you how to clean up node duplication caused by a typo in the data.

Screen Shot 2017-06-25 at 22.18.53.png

You can use apoc plugin, to help you with refactoring, so that you can merge the two nodes into a single while retaining all the connections from both nodes.

match (c1:Culture{name:"wildlings"}),(c2:Culture{name:"wildlingss"}) 
call apoc.refactor.mergeNodes([c2,c1]) yield node 
return distinct "done"

Conclusion:

I hope you got a feeling how it looks like when importing different CSVs. If you manage to understand most of what I used in this series, you are probably capable of importing almost everything into Neo4j. We are done with importing data, so next time we can start analysing our GoT dataset.

You can grab the Neo4j database and just copy it into /data/databases folder.

Got for Neo4j 3.2.x

Got for Neo4j 3.1.x

 

Neo4j Game of Thrones part 2

This is the second part of GoT series, if you haven’t seen the first part, i suggest you check it out. In this part I will show you how I approach connecting different sources together in a knowledge graph.

Data:

We will be using kaggle dataset of Game of Thrones, which I saved on my github repo for easier access. In this second part we will import character_deaths.csv. It contains information about characters and when they died.

Graph model:

Screen Shot 2017-06-24 at 16.38.05

We will build on top of battles.csv from previous part. We will add Book,Status nodes and BELONGS_TO relationship from Person to a House.

Linking:

When dealing with real world data, we run into dirty data sooner or later. We have information about Person and House nodes in all three CSVs. Before importing the second CSV on top of the first one, I always check how well do they match and if there is anything I can do to improve it. I will show you an example of matching House nodes.

First I check how many rows in the CSV have row.Allegiances (name of the house) column non-null.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/character-deaths.csv" as row
return count(row.Allegiances)

We get back a result of 917 rows.

Then I check how many houses match to existing nodes. Using a MATCH in this case works like an INNER JOIN in SQL, where you only get back rows that match the first table on a specific property (node in our case).

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/character-deaths.csv" as row
MATCH (h:House{name:row.Allegiances})
return count(row.Allegiances)

We get back a result of 414 rows. So the remainder did not match. There are two options. First one is there are new Houses in the second CSV, that did not appear in the first CSV, or there is a slightly different record of the same houses (typos etc…) in the second CSV.

After inspecting the data I noticed that sometimes the record of row.Allegiances is “Stark”, and other times it is “House Stark”. Here lies our problem, that some row.Allegiances do not match existing House nodes. We can check how removing the “House ” from the value will affect our linking data.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/character-deaths.csv" as row
// use a replace function to remove "House "
MATCH (h:House{name:replace(row.Allegiances,"House ","")})
return count(row.Allegiances)

We get back a result of 521 rows, which is a slight improvement. This is how I generally approach linking different sources. Checking at raw data and trying different things to see what works best. Once you find the best solution (you might also fix raw data), then you can import the data, so that it fits our existing graph as good as possible.

MATCH statement used as above works as an INNER JOIN, where we could only update existing nodes if we wanted or just count them. On the other hand MERGE in this context works as an OUTER JOIN in SQL, where we combine different persons from two sources into a single graph.

I did something similar for Person node and corrected a few names in the CSVs.

Import:

Once we are done with linking the data, we can import it into Neo4j. The character_deaths.csv is not so complicated, so we can import in a single query. You will notice that parts of queries repeat from the first one. I hope by the end of this series you will be able to import any csv, as this are mainly all the things you need to know when importing into Neo4j. I will also show you how to use a CASE statement.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/character-deaths.csv" as row
// we can use CASE in a WITH statement
with row,
     case when row.Nobility = "1" then "Noble" else "Commoner" end as status_value
// as seen above we remove "House " for better linking
MERGE (house:House{name:replace(row.Allegiances,"House ","")})
MERGE (person:Person{name:row.Name})
// we can also use CASE statement inline
SET person.gender = case when row.Gender = "1" then "male" else "female" end,
    person.book_intro_chapter = row.`Book Intro Chapter`, 
    person.book_death_chapter = row.`Death Chapter`,
    person.death_year = toINT(row.`Death Year`)
MERGE (person)-[:BELONGS_TO]->(house)
MERGE (status:Status{name:status_value})
MERGE (person)-[:HAS_STATUS]->(status)
// doing the foreach trick to skip null values
FOREACH
  (ignoreMe IN CASE WHEN row.GoT = "1" THEN [1] ELSE [] END | 
    MERGE (book1:Book{sequence:1}) 
    ON CREATE SET book1.name = "Game of thrones" 
    MERGE (person)-[:APPEARED_IN]->(book1))
FOREACH
  (ignoreMe IN CASE WHEN row.CoK = "1" THEN [1] ELSE [] END | 
    MERGE (book2:Book{sequence:2}) 
    ON CREATE SET book2.name = "Clash of kings" 
    MERGE (person)-[:APPEARED_IN]->(book2))
FOREACH
  (ignoreMe IN CASE WHEN row.SoS = "1" THEN [1] ELSE [] END | 
    MERGE (book3:Book{sequence:3}) 
    ON CREATE SET book3.name = "Storm of swords" 
    MERGE (person)-[:APPEARED_IN]->(book3))
FOREACH
  (ignoreMe IN CASE WHEN row.FfC = "1" THEN [1] ELSE [] END | 
    MERGE (book4:Book{sequence:4}) 
    ON CREATE SET book4.name = "Feast for crows" 
    MERGE (person)-[:APPEARED_IN]->(book4))
FOREACH
  (ignoreMe IN CASE WHEN row.DwD = "1" THEN [1] ELSE [] END | 
    MERGE (book5:Book{sequence:5}) 
    ON CREATE SET book5.name = "Dance with dragons" 
    MERGE (person)-[:APPEARED_IN]->(book5))
FOREACH
  (ignoreMe IN CASE WHEN row.`Book of Death` is not null THEN [1] ELSE [] END | 
    MERGE (book:Book{sequence:toInt(row.`Book of Death`)}) 
    MERGE (person)-[:DIED_IN]->(book))

Conclusion:

We have successfuly imported second CSV of the dataset. There is only one left to import and then we can start to play around with analysing dataset and visualizing it with spoon.js

Example visualization:

How many battles did each house participate in either as attacker or defender ?

match (h:House)-[:ATTACKER]->()
WITH h,count(*) as attacker_side
optional match (h)-[:DEFENDER]->(b1)
return h.name,attacker_side,count(b1) as defender_side order by attacker_side desc limit 10

Screen Shot 2017-06-24 at 22.07.41

Neo4j Game of Thrones part 1

We are only a month away from the seventh Game of Thrones season, so the time is right to import GoT data into our favourite graph database Neo4j. Data is from the books, so some things that happened on the TV show, haven’t happened yet in the books. The data that is available on kaggle is a good material to get started as I can show you all the queries you will need, when importing and linking different sources of data together. This will probably end up 3 to 4 part series, so that we can cover all the material available.

Data:

We will be using kaggle dataset of Game of Thrones, which I saved on my github repo for easier access. I cleaned up the names of the persons a bit, so that they match better. There are 3 CSVs available.

  • battles.csv — contains information about all of the battles in game of thrones.
  • character-deaths.csv — contains information about characters and when they died.
  • character-predictions.csv –This csv takes an expanded view on character deaths, including predictions of how likely they are to die.

In this first part we will import battles.csv.

Graph model:

Screen Shot 2017-06-23 at 21.35.53

We will start with a simple graph model containing Person,House,Battle nodes and a very simple location tree.Unfortunately we cannot link Person to House nodes as they are not explicity marked. We will get this connections from the other two CSVs.

Import:

First we will create constraints and indexes, so that our queries will be faster.

CALL apoc.schema.assert(
{Location:['name'],Region:['name']},
{Battle:['name'],Person:['name'],House:['name']});

We will split the import into 4 different queries for easier understanding.

Battle:

First we create :Battle nodes using MERGE.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/battles.csv" as row
//merge node labeled Battle 
MERGE (b:Battle{name:row.name})
ON CREATE SET b.year = toINT(row.year),
              b.summer = row.summer,
              b.major_death = row.major_death,
              b.major_capture = row.major_capture,
              b.note = row.note,
              b.battle_type = row.battle_type,
              b.attacker_size = toINT(row.attacker_size),
              b.defender_size = toINT(row.defender_size)

Foreach trick:

If we check the data, we can see that some battles have one attacker and some have four. They are stored in separate columns, so when there is only one attacker, there are 3 columns with a null value. When there are null values in a column and we want to skip them (ignore them) when using MERGE on that property, we can use the following foreach trick.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/battles.csv" as row
// there is only attacker_outcome in the data, 
// so we do a CASE statement for defender_outcome
WITH row,
case when row.attacker_outcome = "win" THEN "loss" ELSE "win" END as defender_outcome
// match the battle
MATCH (b:Battle{name:row.name})
// all battles have atleast one attacker so we don't have to use foreach trick
MERGE (attacker1:House{name:row.attacker_1}) 
MERGE (attacker1)-[a1:ATTACKER]->(b) 
ON CREATE SET a1.outcome = row.attacker_outcome

// When we want to skip null values we can use foreach trick
FOREACH
  (ignoreMe IN CASE WHEN row.defender_1 is not null THEN [1] ELSE [] END | 
    MERGE (defender1:House{name:row.defender_1})
    MERGE (defender1)-[d1:DEFENDER]->(b)
    ON CREATE SET d1.outcome = defender_outcome)
FOREACH
  (ignoreMe IN CASE WHEN row.defender_2 is not null THEN [1] ELSE [] END | 
    MERGE (defender2:House{name:row.defender_2})
    MERGE (defender2)-[d2:DEFENDER]->(b)
    ON CREATE SET d2.outcome = defender_outcome)
FOREACH
  (ignoreMe IN CASE WHEN row.attacker_2 is not null THEN [1] ELSE [] END | 
    MERGE (attacker2:House{name:row.attacker_2})
    MERGE (attacker2)-[a2:ATTACKER]->(b)
    ON CREATE SET a2.outcome = row.attacker_outcome)
FOREACH
  (ignoreMe IN CASE WHEN row.attacker_3 is not null THEN [1] ELSE [] END | 
    MERGE (attacker2:House{name:row.attacker_3})
    MERGE (attacker3)-[a3:ATTACKER]->(b)
    ON CREATE SET a3.outcome = row.attacker_outcome)
FOREACH
  (ignoreMe IN CASE WHEN row.attacker_4 is not null THEN [1] ELSE [] END | 
    MERGE (attacker4:House{name:row.attacker_4})
    MERGE (attacker4)-[a4:ATTACKER]->(b)
    ON CREATE SET a4.outcome = row.attacker_outcome)

Coalesce:

What if we didn’t want to skip null values, but mark them with a default value “Unknown”. In this case we can use coalesce, to set a default value to all null values. Because there is only one location with null value, we do not have to use local merges, but you can get more information in Neo4j Location Trees.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/battles.csv" as row
MATCH (b:Battle{name:row.name})
// We use coalesce, so that null values are replaced with "Unknown" 
MERGE (location:Location{name:coalesce(row.location,"Unknown")})
MERGE (b)-[:IS_IN]->(location)
MERGE (region:Region{name:row.region})
MERGE (location)-[:IS_IN]->(region)

Multiple unwinds:

Using multiple unwinds in a single query, we need to take of the cardinality (number of rows), as UNWIND transforms any list back into individual rows. We need to separate them using any aggregation function, so they do not become nested (like a nested .map function in JS).

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/battles.csv" as row
// We split the columns that may contain more than one person
WITH row,
     split(row.attacker_commander,",") as att_commanders,
     split(row.defender_commander,",") as def_commanders,
     split(row.attacker_king,"/") as att_kings,
     split(row.defender_king,"/") as def_kings,
     row.attacker_outcome as att_outcome,
     CASE when row.attacker_outcome = "win" THEN "loss" 
     ELSE "win" END as def_outcome
MATCH (b:Battle{name:row.name})
// we unwind a list
UNWIND att_commanders as att_commander
MERGE (p:Person{name:trim(att_commander)})
MERGE (p)-[ac:ATTACKER_COMMANDER]->(b)
ON CREATE SET ac.outcome=att_outcome
// to end the unwind and correct cardinality(number of rows)
// we use any aggregation function ( e.g. count(*))
WITH b,def_commanders,def_kings,att_kings,att_outcome,def_outcome,count(*) as c1
UNWIND def_commanders as def_commander
MERGE (p:Person{name:trim(def_commander)})
MERGE (p)-[dc:DEFENDER_COMMANDER]->(b)
ON CREATE SET dc.outcome = def_outcome
// reset cardinality with an aggregation function (end the unwind)
WITH b,def_kings,att_kings,att_outcome,def_outcome,count(*) as c2
UNWIND def_kings as def_king
MERGE (p:Person{name:trim(def_king)})
MERGE (p)-[dk:DEFENDER_KING]->(b)
ON CREATE SET dk.outcome = def_outcome
// reset cardinality with an aggregation function (end the unwind)
WITH b,att_kings,att_outcome,count(*) as c3
UNWIND att_kings as att_king
MERGE (p:Person{name:trim(att_king)})
MERGE (p)-[ak:ATTACKER_KING]->(b)
ON CREATE SET ak.outcome = att_outcome

Conclusion:

As we have to import two more CSVs I will not do any queries on the data yet, but focus first on the importing and linking part of the process and then do an analysis on the data from all three CSVs. Second part should be coming in a few days so stay tuned!

Neo4j Marvel Social Graph

Today I will show you how to import the Marvel graph into Neo4j. We will use a very simple graph model. I will show you the advantages of having a social network stored in a graph database.

Requirements:

Data:

We will use the marvel social network data, which was downloaded from a kaggle competition post. There are 3 CSVs available.

  • nodes.csv — Node name and type
  • edges.csv — Heroes and the comic in which they appear.
  • hero-network.csv — Edges between heroes that appear in the same comic.

We only need the edges.csv file, because if we know which hero appeared in which comic, that is enough for us to import the social graph. We also do not need the hero-network.csv, that hold information about unweighted hero network, because we will create our own weighted hero network with the help of cypher and apoc.

Graph model:

We will use a very simple graph model, with nodes labeled Hero,Comic and a relationship APPEARED_IN between them.

Screen Shot 2017-06-10 at 13.53.41

Import:

First we create constraints on labels Comic and Hero, so that our queries will be more optimized and faster.

CALL apoc.schema.assert(
{},
{Comic:['name'],Hero:['name']});

I uploaded the edges.csv to my git repository, so you can import the graph in one step using below query. It it very basic, we just MERGE two nodes and a relationship between them. When we are importing larger amounts of data, we want to batch it with USING PERIODIC COMMIT.

USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-marvel/master/data/edges.csv" as row
MERGE (h:Hero{name:row.hero})
MERGE (c:Comic{name:row.comic})
MERGE (h)-[:APPEARED_IN]->(c)

Create a weighted undirected hero network:

Now that we have imported the data, we will create a weighted hero network, where we will assume, that the more comics hero appeared together, the more they know each other. We will save the number of comics heroes appeared together as a weight property of the KNOWS relationship.

We will use apoc.periodic.iterate to help us with batching. As our graph gets bigger, we do not want run global refactoring or calculations in one transaction, but we want to batch it. Apoc comes in handy when using batching with cypher.

Taken from apoc documentation:

With apoc.periodic.iterate you provide 2 statements, the first outer statement is providing a stream of values to be processed. The second, inner statement processes one element at a time or with iterateList:true the whole batch at a time.

We can use following query to create a weighted undirected hero social graph.

CALL apoc.periodic.iterate(
"MATCH (p1:Hero)-->(:Comic)<--(p2:Hero) where id(p1) < id(p2) RETURN p1,p2",
"MERGE (p1)-[r:KNOWS]-(p2) ON CREATE SET r.weight = 1 ON MATCH SET r.weight = r.weight + 1"
, {batchSize:5000, parallel:false,iterateList:true})

Now lets dissect the query for better understanding.

First outer statement:

First outer statement should return a stream of values to be processed. It consists of a simple cypher query that matches two heroes, that appeared in the same comic. There is a small, but important detail in WHERE id(p1) < id(p2) statement. This way we avoid doubling the results, because with using <> instead of < , MATCH will return every relationship twice, with the first MATCH returning hero1 as p1 and hero2 as p2, and the second MATCH returning hero1 as p2 and hero2 as p1.

MATCH (p1:Hero)-->(:Comic)<--(p2:Hero) where id(p1) < id(p2) RETURN p1,p2

Second inner statement:

The second inner statement processes one element at a time or with iterateList:true the whole batch at a time. We use MERGE to create the relationship. It means that it will try to match the relationship and if it doesn’t find one it will create one. Check docs for more info. Cypher has a built in logic that supports action based on whether the node or the relationship was created or found using MERGE. We can use ON CREATE SET, for scenarios when the node is created or ON MATCH SET, if the node was found(already exists).

We can also notice, that we do not specify the direction of the relationship. We treat this network as undirected, so the direction of relationship has no implication. MERGE provides the ability to not specify direction, which in practice means, that it checks if there is a relationship between given nodes in any direction, and if not found, it creates one in a random direction.

MERGE (p1)-[r:KNOWS]-(p2)
ON CREATE SET r.weight = 1
ON MATCH SET r.weight = r.weight + 1

Heroes with top weighted degree:

We can check the heroes, that have the biggest weighted degree, which is just a sum of all weights of the relationships connected to given node.

Even though we treat our network as undirected, Neo4j still stores the relationships as directed. That is not a problem as we can MATCH without specifying direction of relationship, so we query it as undirected, where we do not care about the direction of relationship.

MATCH (h:Hero)-[t:KNOWS]-()
RETURN h.name as hero,sum(t.weight) as weighted_degree
ORDER BY weighted_degree DESC LIMIT 10

Results:

Screen Shot 2017-06-10 at 19.06.31.png

 

This is the first part, where we learned how to import a social graph and create a weighted undirected network between our heroes. Now we can start to test graph algorithms like neo4j graph algoritms or apoc graph algorithms on our network. Stay tuned as I will talk about algorithms more next time.

Neo4j APOC graph algorithms part 2

As I said in the previous blog post, APOC plugin offers more graph algorithms, which you can easily use by simply calling APOC procedures in cypher. Today I will show you how to use centrality and community algorithms. There are also some path-finding algorithms in APOC, like dijkstra or A*, but that may be covered next time. I also found this cool blog post from William Lyon, where he does something similar as I am doing, but uses picture and describes graph algorithms nicely.

Dataset:

I will just use the standard actors/movies dataset, which you can get by running play :movies in your Neo4j Browser console.

I will use same dummy features as in previous part.

Algorithms:

As we noticed in part one of the series, our actors are very similar, given the two features, we used to describe them(age and number of movies). In my case I will use hard similarity, meaning I will set a threshold value of similarity, which will determine if a relation between two persons is created or not. If we do not do this, we end up with an all connected graph, meaning every Person node is connected to all the other Person nodes. As an example community detection algorithms does not work well with that.

Hard similarity:

MATCH (p1:Person),(p2:Person) where id(p1) < id(p2)
AND exists(p1.age) and exists(p2.age)
WITH p1,p2,apoc.algo.euclideanSimilarity([p1.count,p2.age_nor],[p2.count,p2.age_nor]) as value 
//Hard similarity with a threshold of 0.6 
WITH p1,p2,value where value > 0.6
MERGE (p1)-[s:SIMILARITY]-(p2)
SET s.e_similarity = value

Screen Shot 2017-04-27 at 23.01.57

We have created 3300 relationships between 123 people. We can also see that kNN algorithm results are 6 different isolated clusters.

Community detection:

Apoc uses simple label propagation kernel for community detection. More info about algorithm in documentation.

CALL apoc.algo.community(25,null,'community','SIMILARITY','BOTH','e_similarity' ,10000)

Lets check results:

MATCH (n:Person) 
RETURN distinct(n.community) as community,count(n) as Members,
avg(n.age) as avg_age,avg(n.count) as avg_count,stdev(n.age) as age_stdev,stdev(n.count) as count_stdev

community_results

We can easily see, that the biggest cluster with 77 members has an average age of 58.5 years and worked at one movie. We can also see that our clusters differ from each other based on the number of movies members worked at and are of similar ages.

Lets try some visualization with APOC virtual nodes and relationships.

MATCH (p1:Person) with p1.community as community,collect(p1) as members 
CALL apoc.create.vNodes(['Community'], [{name:community}]) YIELD node as comm 
UNWIND members as member 
CALL apoc.create.vRelationship(member,'MEMBER_OF_COMMUNITY',{}, comm) YIELD rel 
RETURN member,rel,comm

virtual_.png

I also want to give you some examples of other algorithms. Theory is from Wikipedia.

Closeness centrality:

In a connected graph, the closeness centrality (or closeness) of a node is a measure of centrality in a network, calculated as the sum of the length of the shortest paths between the node and all other nodes in the graph. Thus the more central a node is, the closer it is to all other nodes.

MATCH (node:Person)
WITH collect(node) AS nodes
CALL apoc.algo.closeness(['SIMILARITY'],nodes,'BOTH') YIELD node, score
RETURN node, score
ORDER BY score DESC

Betweenness centrality:

In graph theory, betweenness centrality is a measure of centrality in a graph based on shortest paths. For every pair of vertices in a connected graph, there exists at least one shortest path between the vertices such that either the number of edges that the path passes through (for unweighted graphs) or the sum of the weights of the edges (for weighted graphs) is minimized. The betweenness centrality for each vertex is the number of these shortest paths that pass through the vertex.

MATCH (node:Person)
WITH collect(node) AS nodes
CALL apoc.algo.betweenness(['SIMILARITY'],nodes,'BOTH') YIELD node, score
RETURN node, score
ORDER BY score DESC

Pagerank:

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

MATCH (node:Person)
WITH collect(node) AS nodes
// perform 10 page rank iterations, computing only over relationships of type SIMILARITY
CALL apoc.algo.pageRankWithConfig(nodes,{iterations:10,types:'SIMILARITY'}) YIELD node, score
RETURN node, score
ORDER BY score DESC

Awesome algorithms:

Apoc also supports Pagerank and Cypher, which does not need direct network, but can use the power of cypher to run algorithms on indirect relationships in graph. That is pretty awesome.

Pagerank with cypher:

CALL apoc.algo.pageRankWithCypher(
{iterations:5, write:true, 
node_cypher:'MATCH (p:Person) return id(p) as id',
rel_cypher: 'MATCH (p:Person)-->()<--(o:Person) return id(p) as source,id(o) as target, 1 as weight'
});

Betwenness with cypher:

CALL apoc.algo.betweennessCypher(
{write:true, 
node_cypher:'MATCH (p:Person) return id(p) as id',
rel_cypher: 'MATCH (p:Person)-->()<--(o:Person) return id(p) as source,id(o) as target, 1 as weight'
})

Conclusion:

I like it just how easy it is to run graph algorithm procedures with cypher. You can analyse direct network or even indirect networks using algorithms with cypher without any external tools, which is pretty impressive.

Thanks for reading.