Neo4j GoT Graph Analysis

Winter has come!

Thanks to Joakim Skoog, who built this cool API of ice and fire. It offers data about GoT universe, that we will examine and do some further analysis.

I will show you how to use just released neo4j graph algorithms plugin on the data, that we got from the above API. As the name suggests, they are user defined procedures of graph algorithms, that you can call with cypher. For most algorithms there are two procedures, one that writes results back to the graph as node-properties and another (named algo..stream) that returns a stream of data, e.g. node-ids and computed values.

Requirements:

Import:

Our cool community caretaker Michael Hunger spotted this data and posted a blog post, so that we can easily import the data into our Neo4j using cypher shell. I followed his script, imported the data, and am now sharing the graph.db database, that you can simply copy to your neo4j/data/databases folder and you will have the GoT universe in your Neo4j

Graph Model:

got.png
We have imported the data, that describes a network of persons and houses, with some additional meta-data. There are a couple of relationships between persons and houses like SWORN_TO and LED_BY. We can also find family tree hiearchies in PARENT_OF relationship. I will use the SWORN_TO relationship between houses, that describes house hiearchy of allegiances. It is a directed, unweighted network.

Analysis:

Sworn to network

sworn_to.png

At first glance we can observe, that we have some bridges between different part of communities and that they all link towards the center where the power resides. As you will see this network is from the peaceful times, when the 7 kingdoms were at peace.

Connected Components:

As a part of preprocessing we will check how many disconnected components exist within our network with algo.unionFind.

Connected Components or UnionFind basically finds sets of connected nodes where each node is reachable from any other node in the same set. In graph theory, a connected component of an undirected graph is a subgraph in which any two vertices are connected to each other by paths, and which is connected to no additional vertices in the graph.

Query:

CALL algo.unionFind('House', 'SWORN_TO', {write:true, partitionProperty:"partition"})

Result:
updateconect.png
We see that there is one big component with 393 members and a small one with 5 members. All the rest are disconnected with only one member. That means that either they are a loner house or we are missing some data. We will use centrality algorithms only on the biggest component with 393 members as graph algorithms work better on connected graphs.

Centralities:

In graph theory and network analysis, indicators of centrality identify the most important nodes within a graph. Applications include identifying the most influential person(s) in a social network, key infrastructure nodes in the Internet or urban networks, and super-spreaders of disease. Centrality concepts were first developed in social network analysis, and many of the terms used to measure centrality reflect their sociological origin.

Outgoing degree is always one as each house is sworn to only one house.

Incoming degree:

We simply count how many incoming relationships each house has. In this example it directly translates to how many houses are sworn to each house.

MATCH (h:House)<-[:SWORN_TO]-(:House)
RETURN h.name as house,count(*) as in_degree order by in_degree desc limit 10

updatedsworn

House Tyrell leads with a decent margin in front of Lannisters. The main force is house Baratheon, that consists of two branches and has combined number of sworn houses at 77. Others slowly follow, its funny ( sad ? ) to see, that even house Baelish of Harrenhal, which is ruled by Petyr Baelish has more support than house Stark of Winterfell. Books and show divergence a lot for Littlefinger’s character as he is not a ruler of Harrenhal in the show.

Pagerank centrality:

Pagerank is a version of weighted degree centrality with a feedback loop. You only get your “fair share” of your neighbor’s importance. That is, your neighbor’s importance is split between their neighbors, proportional to the number of interactions with that neighbor. Intuitively, PageRank captures how effectively you are taking advantage of your network contacts. In our context, PageRank centrality nicely captures where the power resides. Houses that have the support of other big houses will come on top.

CALL algo.pageRank(
'MATCH (h:House) where h.partition = 151 return id(h) as id',
'MATCH (p1:House)-[:SWORN_TO]->(p2:House) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', iterations:30, write: true});

pagerank.png

I was surprised with the results to say the least, but after investigating a bit I understood why. Remember in the picture of sworn to network, where I told you that all the different communities point at the center of the graph where the power resides. And at that center are the two Baratheon houses, that are sworn to each other and hence the number is so big. Where incoming degree takes into account only the number of your neighbours, pagerank also takes into account their importance. You can notice that house Stark has a better position than house Baelish, because Stark’s supporters are more important ( have more sworn to houses ).

Betweenness centrality:

Betweenness centrality identifies nodes that are strategically positioned in the network, meaning that information will often travel through that person. Such an intermediary position gives that person power and influence. Betweenness centrality is a raw count of the number of short paths that go through a given node. For example, if a node is located on a bottleneck between two large communities, then it will have high betweenness.

CALL algo.betweenness(
'MATCH (p:House) WHERE p.partition = 151 RETURN id(p) as id',
'MATCH (p1:House)-[:SWORN_TO]->(p2:House) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', write: true, writeProperty: 'betweenness'});

betweenness.png
House Baratheon is uniquely positioned at the center of the graph, so it will be the first in all of the centrality measures. Strong second candidate are the Tyrells, which is a bit surprising as they beat the Lannisters at both pagerank and betweenness centrality. In our case betweenness centrality measures which are the bridging houses with big support of other houses at their back.

Closeness centrality:

The most central nodes according to closeness centrality can quickly interact to all others because they are close to all others. This measure is preferable to degree centrality, because it does not take into account only direct connections among nodes, but also indirect connections.

CALL algo.closeness(
'MATCH (p:House) WHERE p.partition = 151 RETURN id(p) as id',
'MATCH (p1:House)-[:SWORN_TO]->(p2:House) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', write: true, writeProperty:'closeness'});

closeness.png
Closeness centrality is different from others in that the lower the closeness weight is, the better. As always house Baratheon leads with a decent margin, Tyrells are standing firmly on the second place. Starks are better than Lannisters at closeness, but are losing in all other measures, so they get fourth place and Lannisters third.

Conclusion:

This is a network from peaceful times and is therefore less interesting than having two strong disconnected components, that fight each other. What would be cool is to get a network based on the current show or even better that G.R.R.M publishes the Words of Winter, where alliances will change and people will die, to see how the graph evolves through time.

Check also https://github.com/neo4j-examples/game-of-graphs, where you can see how to expose this data as a GraphQL endpoint.

Thank for reading!

Advertisements

Neo4j apoc.load.jdbc

I recently made a simple query, that I used for syncing postgreSQL to Neo4j on an hourly basis with a cron job. My job was to make a simple scripts, that would automate the syncing of Neo4j with a SQL database. This is a very simple task thanks to APOC functions, which has loads of very useful functions, we can call with cypher.

I had to import a couple of SQL tables and join them together in a graph to have a better overview of the data. This is an example of query, that was used to import a table with sales data.

Graph Model:

jdbc2.png

As you can see we are importing some sales data from the postgreSQL database in my case. I made a very simple graph model to capture the data we get about our sales. Main focus is the invoice. We use domain instead of company name as an index as this works out better in our case.

You will need:

Just download them and put them into neo4j /plugins directory and that’s it!

Authorization:

Sample call looks like:

CALL apoc.load.jdbc('jdbc:mysql://localhost:3306/proddb1?user=root&password=football',
'select title from movies where studio=\'MGM Studios\'') 
YIELD row

with jdbc:mysql://localhost:3306/proddb1?user=root&password=football' being the connection string, that contains all the sensitive information such as ip, username and password of your SQL database.
If you want to hide/alias the connection string, this can be accomplished by adding to the conf/neo4j.conf a parameter similar to:

apoc.jdbc.myalias.url=jdbc:mysql://localhost:3306/proddb1?user=root&password=football

and now the above sample call can be rewritten as:

CALL apoc.load.jdbc('myalias',
'select title from movies where studio=\'MGM Studios\'') 
YIELD row

For more info check knowledge base post.

Query:

I made some logic with cypher, where I first get the date of the latest invoice in my Neo4j and then only select those that are were updated on the same day or later. I imagine there is a better way of doing this, but it serves our purpose.

// match existing invoices
MATCH (w:Invoice)
// get the latest date in the desired format
WITH apoc.date.format(max(w.unix),'s','yyyy-MM-dd') as last_invoice
// prepare SQL select statement
WITH 
'SELECT * FROM public.order WHERE updated_at::date >=?::date' as order_query,
last_invoice
// call load.jdbc with the latest date of existing data as a param
CALL apoc.load.jdbcParams(postgresurl, order_query, [last_invoice]) yield row
// we need to parse out domains from the emails
WITH row,split(row.customer_email,"@")[1] as domain
// if the email is in below list, then take whole email as a code
// so that those with gmail are not treated as one customer
WITH *,case when domain in ["gmail.com","yahoo.com","outlook.com","hotmail.com","msn.com","hotmail.co.uk"] 
  then row.customer_email
  else domain end as code
MERGE (i:Invoice{order_id:row.order_number})
// we round all the numbers to 2 digits
ON CREATE SET i.status = row.status,
              i.total_amount = apoc.math.round(row.total,2,"UP"),
              i.quantity = toINT(row.total_line_items_quantity),
              i.total_tax = apoc.math.round(row.total_tax,2,"UP"),
// save a human readable date format
              i.date = apoc.date.format(row.completed_at,"ms","yyyy-MM-dd'T'HH:mm:ss.SSS"),
              i.postcode = row.customer_billing_address_postcode,
              i.address = row.billing_address_address_1,
              i.total_discount = apoc.math.round(row.total_discount,2,"UP"),
              i.total_shipping = apoc.math.round(row.total_shipping,2,"UP"),
              i.unix = (row.completed_at / 1000),
              i.invoice_number = row.order_meta_tzs_invoice_number,
              i.total_refunded = apoc.math.round(row.refunded_total,2,"UP")
// columns, that can change in our invoices in time, we add to ON MATCH statement	
ON MATCH SET i.status = row.status,
             i.total_refunded = apoc.math.round(row.refunded_total,2,"UP"), 
             i.total_tax = apoc.math.round(row.total_tax,2,"UP"), 
             i.total_amount = apoc.math.round(row.total,2,"UP"),
// merge domain and person nodes and connect them 
MERGE (domain:Domain{name:code})
MERGE (person:Person{email:row.customer_email})
MERGE (domain)-[:MEMBER]->(person)
ON CREATE SET 
    person.name = row.billing_address_first_name + " " + row.customer_billing_address_last_name,
              person.ip = row.customer_ip ,
              person.phone = row.customer_billing_address_phone
// merge payment menthod node
MERGE (pay:PaymentMethod{name:row.payment_details_method_id})
MERGE (i)-[:PAYMENT_METHOD]->(pay)
// merge location tree
MERGE (country:Country{code:row.billing_address_country})
MERGE (country)(city)
RETURN distinct 'orders imported'

This is so easy I had to share with you, so you can maybe do a pilot project on your own to see how does Neo4j fit to your needs. Just wrap around it your favourite scripting language and you are good to go.

Game of Thrones part 4 Analysis

If you followed my GoT series so far, you have seen how to import a  GoT dataset, that I found on kaggle.com into Neo4j. There are only a couple days left before new season starts, so let’s do some analysis on the dataset. You need to know, that this data was derived from the books, so it does not contain some battles and data about the events that happened on the show as it has already surpassed the books.

Requirements:

Data:

We have already imported all the data from the CSVs into our Neo4j in previous posts. If we look closely, we can find that some persons have two houses they belong to, and one of them is None, which is the default unknown value in characters_death.csv.

Screen Shot 2017-07-14 at 13.48.42.png

To get rid of this anomaly, we can simply MATCH all persons that belong to House “None” and a separate house also. Then we just delete the relationship pointing to House “None” and we have solved the problem.

match (h1:House{name:"None"})-[r]-(p:Person)--(h2:House)
delete r

Distributions:

We can start by visualizing few distributions with Spoon.js directly in Neo4j Browser. Does not work on 3.2.x as the Neo4j browser was rewritten. 

Which house won/lost most battles ?

match (house:House)
OPTIONAL MATCH (house)-[fight:ATTACKER|:DEFENDER]->()
return house.name as house, 
       sum(case fight.outcome when "win" then 1 else 0 end) as wins,
       sum(case fight.outcome when "loss" then 1 else 0 end) as losses
order by wins + losses desc limit 20

House_battle_results.png

We can easily see the Lannisters are currently on the rise, with 11 wins and 7 losses. Second are the Starks, on the other hand in decline, with 6 wins and 10 losses. Red Wedding and what came afterward was really devastating for the Starks. Freys and Boltons look in a good position, as they were on the winning side of the Red Wedding, but those who watch the show, know that a bitter end is waiting for them.

Distribution of alive/dead people by house:

match (p:Person)--(h:House)
return h.name,
       sum(case when not p:Dead then 1 else 0 end) as alive,
       sum(case when p:Dead then 1 else 0 end) as dead 
order by dead desc limit 15

house_dead.png

It’s a hard life being in a Night’s Watch and we can see that mortality is high. Almost half the characters from Night’s Watch have died. It is also not a good time for Targaryens (yet!), as they were mostly killed off during the Robert’s Rebellion AKA the War of the Usurper. We all know that Starks are dying like flies, but it came as a surprise to see so many dead Lannisters.

Distribution of alive/dead people by culture:

match (p:Person)--(c:Culture)
return c.name as culture,
       sum(case when not p:Dead then 1 else 0 end) as alive,
       sum(case when p:Dead then 1 else 0 end) as dead 
order by dead desc limit 15

culture_dead

We can observe, that the most represented cultures are Northmen and Ironborn. While seeing so many Northmen in the books does not come as a surprise, as the whole story starts with House Stark living in Winterfell. Surprising is that there are so many Ironborns in the books. I think this all has to do with the Kingsmoot at Iron Islands, where we get to know lots of different Ironborn characters.

Valyrians are almost extinct, but they have the hidden joker Daenerys Targaryen, who has a big role yet to play.

Distribution of alive/dead people by books:

match (p:Person)-[r]-(h:Book)
with h.name as book,
     h.sequence as sequence,
     sum(case when type(r)="APPEARED_IN" then 1 else 0 end) as total,
     sum(case when type(r)="DIED_IN" then 1 else 0 end) as dead 
     order by sequence asc
return book,total,dead

book_dead.png

This figures of dead people seem small-ish for GoT, so I guess not all deaths are recorded. We can see, how the author slowly build up the community of all the people in the books. It started already with 400 characters in the first book and rose up to almost 1200 characters in the fourth book. That is a lot of characters to keep track. Storm of swords was the bloodiest book with 97 deaths, while Feast for Crows has unusual little deaths, only 27.

Analysis:

We will use algo.unionFind, that is a (weakly) connected components algorithm.  Taken from the docs.

Connected Components or UnionFind basically finds sets of connected nodes where each node is reachable from any other node in the same set. In graph theory, a connected component of an undirected graph is a subgraph in which any two vertices are connected to each other by paths, and which is connected to no additional vertices in the graph.

Which houses fought on the same side ?

We can create a network of Houses from indirect patterns of battles they fought together. We will also assume that an ally of my ally is also my ally. In this case we can use algo.unionFind to solve the problem.

Create a network of houses that fought on the same side

match (house1:House)-[r1]-(battle:Battle)-[r2]-(house2:House)
WHERE type(r1)=type(r2)
MERGE (house1)-[:ALLY]-(house2)

Run the algorithms and write back the results as a property of the node community.

CALL algo.unionFind('House', 'ALLY', {write:true, partitionProperty:"community"})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis

Check results

match (h:House)
return distinct(h.community) as community,
       collect(h.name) as members,
       count(*) as count 
order by count desc limit 4

House_community.png

We observe 4 different communities (partitions) came back as a result. We can see that Lannisters teamed up with Boltons and Freys to defeat the Starks. Night’s Watch and Baratheon fought on the same side when they defended the Wall from Wildlings (Free folk, Thenns). Stark and Tully have an alliance with marriage (Catelyn Tully and Ned Stark), so it is no wonder to see them fight battles on the same side.

Which persons fought on the same side in the year 299 ?

We will do something similar for persons also, but we will focus only on the year 299 as there were most battles fought that year. The reason why I chose a specific year is also, because alliances change over time and if we looked at all battles, then we get most persons in one community as they fought on the same side at least once.

Create a network of persons that fought on the same side in the year 299.

MATCH (p1:Person)-[r1]-(b:Battle)-[r2]-(p2:Person) 
WHERE b.year = 299 and type(r1)=type(r2)
MERGE (p1)-[:FRIEND]-(p2)

Run the algorithms and write back the results as a property of the node community.

CALL algo.unionFind('Person', 'FRIEND', {write:true, partitionProperty:"community"}) 
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis

Check results

match (p:Person)
return distinct(p.community) as community,
       collect(p.name) as members,
       count(*) as count 
order by count desc limit 10

Screen Shot 2017-07-14 at 16.11.52

We observe that 10 different communities (partitions) exist. It is a good representation of the state of alliances, that fought together on the specific year. We could do this analysis for a couple of different years and compare alliances over the year how they change.

Conclusion:

We have only scratched the surface of what is possible to do with this dataset so stay tuned for more!

You can grab the Neo4j database and just copy it into data/databases folder

Got for 3.2.x

Got for Neo4j 3.1.x