Neo4j GoT Graph Analysis

Winter has come!

Thanks to Joakim Skoog, who built this cool API of ice and fire. It offers data about GoT universe, that we will examine and do some further analysis.

I will show you how to use just released neo4j graph algorithms plugin on the data, that we got from the above API. As the name suggests, they are user defined procedures of graph algorithms, that you can call with cypher. For most algorithms there are two procedures, one that writes results back to the graph as node-properties and another (named algo..stream) that returns a stream of data, e.g. node-ids and computed values.

Requirements:

Import:

Our cool community caretaker Michael Hunger spotted this data and posted a load script, so that we can easily import the data into our Neo4j using cypher shell. I followed his script, imported the data, and am now sharing the graph.db database, that you can simply copy to your neo4j/data/databases folder and you will have the GoT universe in your Neo4j

Graph Model:

got.png
We have imported the data, that describes a network of persons and houses, with some additional meta-data. There are a couple of relationships between persons and houses like SWORN_TO and LED_BY. We can also find family tree hiearchies in PARENT_OF relationship. I will use the SWORN_TO relationship between houses, that describes house hiearchy of allegiances. It is a directed, unweighted network.

Analysis:

Sworn to network

sworn_to.png

At first glance we can observe, that we have some bridges between different part of communities and that they all link towards the center where the power resides. As you will see this network is from the peaceful times, when the 7 kingdoms were at peace.

Connected Components:

As a part of preprocessing we will check how many disconnected components exist within our network with algo.unionFind.

Connected Components or UnionFind basically finds sets of connected nodes where each node is reachable from any other node in the same set. In graph theory, a connected component of an undirected graph is a subgraph in which any two vertices are connected to each other by paths, and which is connected to no additional vertices in the graph.

Query:

CALL algo.unionFind('House', 'SWORN_TO', {write:true, partitionProperty:"partition"})

Result:
updateconect.png
We see that there is one big component with 393 members and a small one with 5 members. All the rest are disconnected with only one member. That means that either they are a loner house or we are missing some data. We will use centrality algorithms only on the biggest component with 393 members as graph algorithms work better on connected graphs.

Centralities:

In graph theory and network analysis, indicators of centrality identify the most important nodes within a graph. Applications include identifying the most influential person(s) in a social network, key infrastructure nodes in the Internet or urban networks, and super-spreaders of disease. Centrality concepts were first developed in social network analysis, and many of the terms used to measure centrality reflect their sociological origin.

Outgoing degree is always one as each house is sworn to only one house.

Incoming degree:

We simply count how many incoming relationships each house has. In this example it directly translates to how many houses are sworn to each house.

MATCH (h:House)<-[:SWORN_TO]-(:House)
RETURN h.name as house,count(*) as in_degree order by in_degree desc limit 10

updatedsworn

House Tyrell leads with a decent margin in front of Lannisters. The main force is house Baratheon, that consists of two branches and has combined number of sworn houses at 77. Others slowly follow, its funny ( sad ? ) to see, that even house Baelish of Harrenhal, which is ruled by Petyr Baelish has more support than house Stark of Winterfell. Books and show divergence a lot for Littlefinger’s character as he is not a ruler of Harrenhal in the show.

Pagerank centrality:

Pagerank is a version of weighted degree centrality with a feedback loop. You only get your “fair share” of your neighbor’s importance. That is, your neighbor’s importance is split between their neighbors, proportional to the number of interactions with that neighbor. Intuitively, PageRank captures how effectively you are taking advantage of your network contacts. In our context, PageRank centrality nicely captures where the power resides. Houses that have the support of other big houses will come on top.

CALL algo.pageRank(
'MATCH (h:House) where h.partition = 151 return id(h) as id',
'MATCH (p1:House)-[:SWORN_TO]->(p2:House) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', iterations:30, write: true});

pagerank.png

I was surprised with the results to say the least, but after investigating a bit I understood why. Remember in the picture of sworn to network, where I told you that all the different communities point at the center of the graph where the power resides. And at that center are the two Baratheon houses, that are sworn to each other and hence the number is so big. Where incoming degree takes into account only the number of your neighbours, pagerank also takes into account their importance. You can notice that house Stark has a better position than house Baelish, because Stark’s supporters are more important ( have more sworn to houses ).

Betweenness centrality:

Betweenness centrality identifies nodes that are strategically positioned in the network, meaning that information will often travel through that person. Such an intermediary position gives that person power and influence. Betweenness centrality is a raw count of the number of short paths that go through a given node. For example, if a node is located on a bottleneck between two large communities, then it will have high betweenness.

CALL algo.betweenness(
'MATCH (p:House) WHERE p.partition = 151 RETURN id(p) as id',
'MATCH (p1:House)-[:SWORN_TO]->(p2:House) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', write: true, writeProperty: 'betweenness'});

betweenness.png
House Baratheon is uniquely positioned at the center of the graph, so it will be the first in all of the centrality measures. Strong second candidate are the Tyrells, which is a bit surprising as they beat the Lannisters at both pagerank and betweenness centrality. In our case betweenness centrality measures which are the bridging houses with big support of other houses at their back.

Closeness centrality:

The most central nodes according to closeness centrality can quickly interact to all others because they are close to all others. This measure is preferable to degree centrality, because it does not take into account only direct connections among nodes, but also indirect connections.

CALL algo.closeness(
'MATCH (p:House) WHERE p.partition = 151 RETURN id(p) as id',
'MATCH (p1:House)-[:SWORN_TO]->(p2:House) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', write: true, writeProperty:'closeness'});

closeness.png
Closeness centrality is different from others in that the lower the closeness weight is, the better. As always house Baratheon leads with a decent margin, Tyrells are standing firmly on the second place. Starks are better than Lannisters at closeness, but are losing in all other measures, so they get fourth place and Lannisters third.

Conclusion:

This is a network from peaceful times and is therefore less interesting than having two strong disconnected components, that fight each other. What would be cool is to get a network based on the current show or even better that G.R.R.M publishes the Words of Winter, where alliances will change and people will die, to see how the graph evolves through time.

Check also https://github.com/neo4j-examples/game-of-graphs, where you can see how to expose this data as a GraphQL endpoint.

Thank for reading!

Neo4j apoc.load.jdbc

I recently made a simple query, that I used for syncing postgreSQL to Neo4j on an hourly basis with a cron job. My job was to make a simple scripts, that would automate the syncing of Neo4j with a SQL database. This is a very simple task thanks to APOC functions, which has loads of very useful functions, we can call with cypher.

I had to import a couple of SQL tables and join them together in a graph to have a better overview of the data. This is an example of query, that was used to import a table with sales data.

Graph Model:

jdbc2.png

As you can see we are importing some sales data from the postgreSQL database in my case. I made a very simple graph model to capture the data we get about our sales. Main focus is the invoice. We use domain instead of company name as an index as this works out better in our case.

You will need:

Just download them and put them into neo4j /plugins directory and that’s it!

Authorization:

Sample call looks like:

CALL apoc.load.jdbc('jdbc:mysql://localhost:3306/proddb1?user=root&password=football',
'select title from movies where studio=\'MGM Studios\'') 
YIELD row

with jdbc:mysql://localhost:3306/proddb1?user=root&password=football' being the connection string, that contains all the sensitive information such as ip, username and password of your SQL database.
If you want to hide/alias the connection string, this can be accomplished by adding to the conf/neo4j.conf a parameter similar to:

apoc.jdbc.myalias.url=jdbc:mysql://localhost:3306/proddb1?user=root&password=football

and now the above sample call can be rewritten as:

CALL apoc.load.jdbc('myalias',
'select title from movies where studio=\'MGM Studios\'') 
YIELD row

For more info check knowledge base post.

Query:

I made some logic with cypher, where I first get the date of the latest invoice in my Neo4j and then only select those that are were updated on the same day or later. I imagine there is a better way of doing this, but it serves our purpose.

// match existing invoices
MATCH (w:Invoice)
// get the latest date in the desired format
WITH apoc.date.format(max(w.unix),'s','yyyy-MM-dd') as last_invoice
// prepare SQL select statement
WITH 
'SELECT * FROM public.order WHERE updated_at::date >=?::date' as order_query,
last_invoice
// call load.jdbc with the latest date of existing data as a param
CALL apoc.load.jdbcParams(postgresurl, order_query, [last_invoice]) yield row
// we need to parse out domains from the emails
WITH row,split(row.customer_email,"@")[1] as domain
// if the email is in below list, then take whole email as a code
// so that those with gmail are not treated as one customer
WITH *,case when domain in ["gmail.com","yahoo.com","outlook.com","hotmail.com","msn.com","hotmail.co.uk"] 
  then row.customer_email
  else domain end as code
MERGE (i:Invoice{order_id:row.order_number})
// we round all the numbers to 2 digits
ON CREATE SET i.status = row.status,
              i.total_amount = apoc.math.round(row.total,2,"UP"),
              i.quantity = toINT(row.total_line_items_quantity),
              i.total_tax = apoc.math.round(row.total_tax,2,"UP"),
// save a human readable date format
              i.date = apoc.date.format(row.completed_at,"ms","yyyy-MM-dd'T'HH:mm:ss.SSS"),
              i.postcode = row.customer_billing_address_postcode,
              i.address = row.billing_address_address_1,
              i.total_discount = apoc.math.round(row.total_discount,2,"UP"),
              i.total_shipping = apoc.math.round(row.total_shipping,2,"UP"),
              i.unix = (row.completed_at / 1000),
              i.invoice_number = row.order_meta_tzs_invoice_number,
              i.total_refunded = apoc.math.round(row.refunded_total,2,"UP")
// columns, that can change in our invoices in time, we add to ON MATCH statement	
ON MATCH SET i.status = row.status,
             i.total_refunded = apoc.math.round(row.refunded_total,2,"UP"), 
             i.total_tax = apoc.math.round(row.total_tax,2,"UP"), 
             i.total_amount = apoc.math.round(row.total,2,"UP"),
// merge domain and person nodes and connect them 
MERGE (domain:Domain{name:code})
MERGE (person:Person{email:row.customer_email})
MERGE (domain)-[:MEMBER]->(person)
ON CREATE SET 
    person.name = row.billing_address_first_name + " " + row.customer_billing_address_last_name,
              person.ip = row.customer_ip ,
              person.phone = row.customer_billing_address_phone
// merge payment menthod node
MERGE (pay:PaymentMethod{name:row.payment_details_method_id})
MERGE (i)-[:PAYMENT_METHOD]->(pay)
// merge location tree
MERGE (country:Country{code:row.billing_address_country})
MERGE (country)(city)
RETURN distinct 'orders imported'

This is so easy I had to share with you, so you can maybe do a pilot project on your own to see how does Neo4j fit to your needs. Just wrap around it your favourite scripting language and you are good to go.

Game of Thrones part 4 Analysis

If you followed my GoT series so far, you have seen how to import a  GoT dataset, that I found on kaggle.com into Neo4j. There are only a couple days left before new season starts, so let’s do some analysis on the dataset. You need to know, that this data was derived from the books, so it does not contain some battles and data about the events that happened on the show as it has already surpassed the books.

Requirements:

Data:

We have already imported all the data from the CSVs into our Neo4j in previous posts. If we look closely, we can find that some persons have two houses they belong to, and one of them is None, which is the default unknown value in characters_death.csv.

Screen Shot 2017-07-14 at 13.48.42.png

To get rid of this anomaly, we can simply MATCH all persons that belong to House “None” and a separate house also. Then we just delete the relationship pointing to House “None” and we have solved the problem.

match (h1:House{name:"None"})-[r]-(p:Person)--(h2:House)
delete r

Distributions:

We can start by visualizing few distributions with Spoon.js directly in Neo4j Browser. Does not work on 3.2.x as the Neo4j browser was rewritten. 

Which house won/lost most battles ?

match (house:House)
OPTIONAL MATCH (house)-[fight:ATTACKER|:DEFENDER]->()
return house.name as house, 
       sum(case fight.outcome when "win" then 1 else 0 end) as wins,
       sum(case fight.outcome when "loss" then 1 else 0 end) as losses
order by wins + losses desc limit 20

House_battle_results.png

We can easily see the Lannisters are currently on the rise, with 11 wins and 7 losses. Second are the Starks, on the other hand in decline, with 6 wins and 10 losses. Red Wedding and what came afterward was really devastating for the Starks. Freys and Boltons look in a good position, as they were on the winning side of the Red Wedding, but those who watch the show, know that a bitter end is waiting for them.

Distribution of alive/dead people by house:

match (p:Person)--(h:House)
return h.name,
       sum(case when not p:Dead then 1 else 0 end) as alive,
       sum(case when p:Dead then 1 else 0 end) as dead 
order by dead desc limit 15

house_dead.png

It’s a hard life being in a Night’s Watch and we can see that mortality is high. Almost half the characters from Night’s Watch have died. It is also not a good time for Targaryens (yet!), as they were mostly killed off during the Robert’s Rebellion AKA the War of the Usurper. We all know that Starks are dying like flies, but it came as a surprise to see so many dead Lannisters.

Distribution of alive/dead people by culture:

match (p:Person)--(c:Culture)
return c.name as culture,
       sum(case when not p:Dead then 1 else 0 end) as alive,
       sum(case when p:Dead then 1 else 0 end) as dead 
order by dead desc limit 15

culture_dead

We can observe, that the most represented cultures are Northmen and Ironborn. While seeing so many Northmen in the books does not come as a surprise, as the whole story starts with House Stark living in Winterfell. Surprising is that there are so many Ironborns in the books. I think this all has to do with the Kingsmoot at Iron Islands, where we get to know lots of different Ironborn characters.

Valyrians are almost extinct, but they have the hidden joker Daenerys Targaryen, who has a big role yet to play.

Distribution of alive/dead people by books:

match (p:Person)-[r]-(h:Book)
with h.name as book,
     h.sequence as sequence,
     sum(case when type(r)="APPEARED_IN" then 1 else 0 end) as total,
     sum(case when type(r)="DIED_IN" then 1 else 0 end) as dead 
     order by sequence asc
return book,total,dead

book_dead.png

This figures of dead people seem small-ish for GoT, so I guess not all deaths are recorded. We can see, how the author slowly build up the community of all the people in the books. It started already with 400 characters in the first book and rose up to almost 1200 characters in the fourth book. That is a lot of characters to keep track. Storm of swords was the bloodiest book with 97 deaths, while Feast for Crows has unusual little deaths, only 27.

Analysis:

We will use algo.unionFind, that is a (weakly) connected components algorithm.  Taken from the docs.

Connected Components or UnionFind basically finds sets of connected nodes where each node is reachable from any other node in the same set. In graph theory, a connected component of an undirected graph is a subgraph in which any two vertices are connected to each other by paths, and which is connected to no additional vertices in the graph.

Which houses fought on the same side ?

We can create a network of Houses from indirect patterns of battles they fought together. We will also assume that an ally of my ally is also my ally. In this case we can use algo.unionFind to solve the problem.

Create a network of houses that fought on the same side

match (house1:House)-[r1]-(battle:Battle)-[r2]-(house2:House)
WHERE type(r1)=type(r2)
MERGE (house1)-[:ALLY]-(house2)

Run the algorithms and write back the results as a property of the node community.

CALL algo.unionFind('House', 'ALLY', {write:true, partitionProperty:"community"})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis

Check results

match (h:House)
return distinct(h.community) as community,
       collect(h.name) as members,
       count(*) as count 
order by count desc limit 4

House_community.png

We observe 4 different communities (partitions) came back as a result. We can see that Lannisters teamed up with Boltons and Freys to defeat the Starks. Night’s Watch and Baratheon fought on the same side when they defended the Wall from Wildlings (Free folk, Thenns). Stark and Tully have an alliance with marriage (Catelyn Tully and Ned Stark), so it is no wonder to see them fight battles on the same side.

Which persons fought on the same side in the year 299 ?

We will do something similar for persons also, but we will focus only on the year 299 as there were most battles fought that year. The reason why I chose a specific year is also, because alliances change over time and if we looked at all battles, then we get most persons in one community as they fought on the same side at least once.

Create a network of persons that fought on the same side in the year 299.

MATCH (p1:Person)-[r1]-(b:Battle)-[r2]-(p2:Person) 
WHERE b.year = 299 and type(r1)=type(r2)
MERGE (p1)-[:FRIEND]-(p2)

Run the algorithms and write back the results as a property of the node community.

CALL algo.unionFind('Person', 'FRIEND', {write:true, partitionProperty:"community"}) 
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis

Check results

match (p:Person)
return distinct(p.community) as community,
       collect(p.name) as members,
       count(*) as count 
order by count desc limit 10

Screen Shot 2017-07-14 at 16.11.52

We observe that 10 different communities (partitions) exist. It is a good representation of the state of alliances, that fought together on the specific year. We could do this analysis for a couple of different years and compare alliances over the year how they change.

Conclusion:

We have only scratched the surface of what is possible to do with this dataset so stay tuned for more!

You can grab the Neo4j database and just copy it into data/databases folder

Got for 3.2.x

Got for Neo4j 3.1.x

Neo4j Game of Thrones part 3

This is part 3 of GoT series, so if you haven’t seen previous posts i suggest you check them out.

In this part we will import the last CSV, so that we will be able to use data from all three sources in our analysis/visualizations.

Data:

We will be using kaggle dataset of Game of Thrones, which I saved on my github repo for easier access. In this second part we will import character_predictions.csv. This CSV takes an expanded view on character deaths, including predictions of how likely they are to die.

Graph model:

Screen Shot 2017-06-25 at 14.53.14

Our graph model is slowly evolving with every new CSV we import. We add a new Culture node and RELATED_TO relationship between persons, that creates a social network of persons. By the time we are done with importing, we will have 2091 persons and 352 House nodes in our knowledge graph, which is not bad for GoT dataset.

Import:

As you will see importing character_predictions.csv , is doing 10 FOREACH tricks, because the data has lots of null values in it, that we need to skip in order for the MERGE statement to work as needed.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/character-predictions.csv" as row
MERGE (p:Person{name:row.name})
// set properties on the person node
SET p.title = row.title,
    p.death_year = toINT(row.DateoFdeath),
    p.birth_year = toINT(row.dateOfBirth),
    p.age = toINT(row.age),
    p.gender = case when row.male = "1" then "male" else "female" end
// doing the foreach trick to skip null values
FOREACH
  (ignoreMe IN CASE WHEN row.mother is not null THEN [1] ELSE [] END |
    MERGE (mother:Person{name:row.mother})
    MERGE (p)-[:RELATED_TO{name:"mother"}]->(mother)
)
FOREACH
  (ignoreMe IN CASE WHEN row.spouse is not null THEN [1] ELSE [] END |
    MERGE (spouse:Person{name:row.spouse})
    MERGE (p)-[:RELATED_TO{name:"spouse"}]->(spouse)
)
FOREACH
  (ignoreMe IN CASE WHEN row.father is not null THEN [1] ELSE [] END |
    MERGE (father:Person{name:row.father})
    MERGE (p)-[:RELATED_TO{name:"father"}]->(father)
)
FOREACH
  (ignoreMe IN CASE WHEN row.heir is not null THEN [1] ELSE [] END |
    MERGE (heir:Person{name:row.heir})
    MERGE (p)-[:RELATED_TO{name:"heir"}]->(heir)
)
// we remove "House " from the value for better linking of data
FOREACH 
  (ignoreMe IN CASE WHEN row.house is not null THEN [1] ELSE [] END | 
    MERGE (house:House{name:replace(row.house,"House ","")}) 
    MERGE (p)-[:BELONGS_TO]->(house) 
)

I splited the query into two parts for more clarity, but as you see there is nothing fancy about it, just using the FOREACH trick over and over again.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/character-predictions.csv" as row
// match person
MERGE (p:Person{name:row.name})
// doing the foreach trick... we lower row.culture for better linking
FOREACH
  (ignoreMe IN CASE WHEN row.culture is not null THEN [1] ELSE [] END |
    MERGE (culture:Culture{name:lower(row.culture)})
    MERGE (p)-[:MEMBER_OF_CULTURE]->(culture)
)
FOREACH
  (ignoreMe IN CASE WHEN row.book1 = "1" THEN [1] ELSE [] END |
    MERGE (book:Book{sequence:1})
    MERGE (p)-[:APPEARED_IN]->(book)
)
FOREACH
  (ignoreMe IN CASE WHEN row.book2 = "1" THEN [1] ELSE [] END |
    MERGE (book:Book{sequence:2})
    MERGE (p)-[:APPEARED_IN]->(book)
)
FOREACH
  (ignoreMe IN CASE WHEN row.book3 = "1" THEN [1] ELSE [] END |
    MERGE (book:Book{sequence:3})
    MERGE (p)-[:APPEARED_IN]->(book)
)
FOREACH
  (ignoreMe IN CASE WHEN row.book4 = "1" THEN [1] ELSE [] END |
    MERGE (book:Book{sequence:4})
    MERGE (p)-[:APPEARED_IN]->(book)
)
FOREACH
  (ignoreMe IN CASE WHEN row.book5 = "1" THEN [1] ELSE [] END |
    MERGE (book:Book{sequence:5})
    MERGE (p)-[:APPEARED_IN]->(book)
)

Multi-labeled nodes:

Nodes can have none, one or more labels. Think of labels as a hiearchy of rules, where first level labels group entities together (:Person),(:Company),(:House) and additional, second level labels as a preprocessed way of filtering nodes faster/easier (:Person:Dead),(:Person:King),(:Company:Reseller) etc…

Check also AirBnB talk at graphconnect.

Lets mark all the dead people in our database first, since this is GoT database. Next we will mark all Kings and Knights with additional labels.

Dead persons:

Query 1:

LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/character-predictions.csv" as row
// do CASE statements
with row,
case when row.isAlive = "0" THEN [1] ELSE [] END as dead_person,
case when row.isAliveMother = "0" THEN [1] ELSE [] END as dead_mother,
case when row.isAliveFather = "0" THEN [1] ELSE [] END as dead_father,
case when row.isAliveHeir = "0" THEN [1] ELSE [] END as dead_heir,
case when row.isAliveSpouse = "0" THEN [1] ELSE [] END as dead_spouse
// MATCH all the persons
MATCH (p:Person{name:row.name})
// We use optional match so that it doesnt stop the query if not found
OPTIONAL MATCH (mother:Person{name:row.mother})
OPTIONAL MATCH (father:Person{name:row.father})
OPTIONAL MATCH (heir:Person{name:row.heir})
OPTIONAL MATCH (spouse:Spouse{name:row.spouse})
// Set the label of the dead persons
FOREACH (d in dead_person | set p:Dead)
FOREACH (d in dead_mother | set mother:Dead)
FOREACH (d in dead_father | set father:Dead)
FOREACH (d in dead_heir | set heir:Dead)
FOREACH (d in dead_spouse | set spouse:Dead)

Query 2:

MATCH (p:Person) where exists (p.death_year)
SET p:Dead

Kings:

Query 1:

MATCH (p:Person)-[:DEFENDER_KING|ATTACKER_KING]-()
SET p:King

Query 2:

MATCH (p:Person) where lower(p.title) contains "king"
SET p:King

Knight:

Query 1:

MATCH (p:Person) where p.title = "Ser"
SET p:Knight

Refactor:

After importing you can check the cleanliness of your data how well the Persons in your database match or are there many typos etc…with ordering the return by alphabet, so you can quickly spot mistakes if they exist.

Example query:

MATCH (n:Person)
RETURN n.name as name order by name

I cleaned up the culture row in character_predictions.csv, but left one mistake, so that I can show you how to clean up node duplication caused by a typo in the data.

Screen Shot 2017-06-25 at 22.18.53.png

You can use apoc plugin, to help you with refactoring, so that you can merge the two nodes into a single while retaining all the connections from both nodes.

match (c1:Culture{name:"wildlings"}),(c2:Culture{name:"wildlingss"}) 
call apoc.refactor.mergeNodes([c2,c1]) yield node 
return distinct "done"

Conclusion:

I hope you got a feeling how it looks like when importing different CSVs. If you manage to understand most of what I used in this series, you are probably capable of importing almost everything into Neo4j. We are done with importing data, so next time we can start analysing our GoT dataset.

You can grab the Neo4j database and just copy it into /data/databases folder.

Got for Neo4j 3.2.x

Got for Neo4j 3.1.x

 

Neo4j Game of Thrones part 2

This is the second part of GoT series, if you haven’t seen the first part, i suggest you check it out. In this part I will show you how I approach connecting different sources together in a knowledge graph.

Data:

We will be using kaggle dataset of Game of Thrones, which I saved on my github repo for easier access. In this second part we will import character_deaths.csv. It contains information about characters and when they died.

Graph model:

Screen Shot 2017-06-24 at 16.38.05

We will build on top of battles.csv from previous part. We will add Book,Status nodes and BELONGS_TO relationship from Person to a House.

Linking:

When dealing with real world data, we run into dirty data sooner or later. We have information about Person and House nodes in all three CSVs. Before importing the second CSV on top of the first one, I always check how well do they match and if there is anything I can do to improve it. I will show you an example of matching House nodes.

First I check how many rows in the CSV have row.Allegiances (name of the house) column non-null.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/character-deaths.csv" as row
return count(row.Allegiances)

We get back a result of 917 rows.

Then I check how many houses match to existing nodes. Using a MATCH in this case works like an INNER JOIN in SQL, where you only get back rows that match the first table on a specific property (node in our case).

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/character-deaths.csv" as row
MATCH (h:House{name:row.Allegiances})
return count(row.Allegiances)

We get back a result of 414 rows. So the remainder did not match. There are two options. First one is there are new Houses in the second CSV, that did not appear in the first CSV, or there is a slightly different record of the same houses (typos etc…) in the second CSV.

After inspecting the data I noticed that sometimes the record of row.Allegiances is “Stark”, and other times it is “House Stark”. Here lies our problem, that some row.Allegiances do not match existing House nodes. We can check how removing the “House ” from the value will affect our linking data.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/character-deaths.csv" as row
// use a replace function to remove "House "
MATCH (h:House{name:replace(row.Allegiances,"House ","")})
return count(row.Allegiances)

We get back a result of 521 rows, which is a slight improvement. This is how I generally approach linking different sources. Checking at raw data and trying different things to see what works best. Once you find the best solution (you might also fix raw data), then you can import the data, so that it fits our existing graph as good as possible.

MATCH statement used as above works as an INNER JOIN, where we could only update existing nodes if we wanted or just count them. On the other hand MERGE in this context works as an OUTER JOIN in SQL, where we combine different persons from two sources into a single graph.

I did something similar for Person node and corrected a few names in the CSVs.

Import:

Once we are done with linking the data, we can import it into Neo4j. The character_deaths.csv is not so complicated, so we can import in a single query. You will notice that parts of queries repeat from the first one. I hope by the end of this series you will be able to import any csv, as this are mainly all the things you need to know when importing into Neo4j. I will also show you how to use a CASE statement.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/character-deaths.csv" as row
// we can use CASE in a WITH statement
with row,
     case when row.Nobility = "1" then "Noble" else "Commoner" end as status_value
// as seen above we remove "House " for better linking
MERGE (house:House{name:replace(row.Allegiances,"House ","")})
MERGE (person:Person{name:row.Name})
// we can also use CASE statement inline
SET person.gender = case when row.Gender = "1" then "male" else "female" end,
    person.book_intro_chapter = row.`Book Intro Chapter`, 
    person.book_death_chapter = row.`Death Chapter`,
    person.death_year = toINT(row.`Death Year`)
MERGE (person)-[:BELONGS_TO]->(house)
MERGE (status:Status{name:status_value})
MERGE (person)-[:HAS_STATUS]->(status)
// doing the foreach trick to skip null values
FOREACH
  (ignoreMe IN CASE WHEN row.GoT = "1" THEN [1] ELSE [] END | 
    MERGE (book1:Book{sequence:1}) 
    ON CREATE SET book1.name = "Game of thrones" 
    MERGE (person)-[:APPEARED_IN]->(book1))
FOREACH
  (ignoreMe IN CASE WHEN row.CoK = "1" THEN [1] ELSE [] END | 
    MERGE (book2:Book{sequence:2}) 
    ON CREATE SET book2.name = "Clash of kings" 
    MERGE (person)-[:APPEARED_IN]->(book2))
FOREACH
  (ignoreMe IN CASE WHEN row.SoS = "1" THEN [1] ELSE [] END | 
    MERGE (book3:Book{sequence:3}) 
    ON CREATE SET book3.name = "Storm of swords" 
    MERGE (person)-[:APPEARED_IN]->(book3))
FOREACH
  (ignoreMe IN CASE WHEN row.FfC = "1" THEN [1] ELSE [] END | 
    MERGE (book4:Book{sequence:4}) 
    ON CREATE SET book4.name = "Feast for crows" 
    MERGE (person)-[:APPEARED_IN]->(book4))
FOREACH
  (ignoreMe IN CASE WHEN row.DwD = "1" THEN [1] ELSE [] END | 
    MERGE (book5:Book{sequence:5}) 
    ON CREATE SET book5.name = "Dance with dragons" 
    MERGE (person)-[:APPEARED_IN]->(book5))
FOREACH
  (ignoreMe IN CASE WHEN row.`Book of Death` is not null THEN [1] ELSE [] END | 
    MERGE (book:Book{sequence:toInt(row.`Book of Death`)}) 
    MERGE (person)-[:DIED_IN]->(book))

Conclusion:

We have successfuly imported second CSV of the dataset. There is only one left to import and then we can start to play around with analysing dataset and visualizing it with spoon.js

Example visualization:

How many battles did each house participate in either as attacker or defender ?

match (h:House)-[:ATTACKER]->()
WITH h,count(*) as attacker_side
optional match (h)-[:DEFENDER]->(b1)
return h.name,attacker_side,count(b1) as defender_side order by attacker_side desc limit 10

Screen Shot 2017-06-24 at 22.07.41

Neo4j Game of Thrones part 1

We are only a month away from the seventh Game of Thrones season, so the time is right to import GoT data into our favourite graph database Neo4j. Data is from the books, so some things that happened on the TV show, haven’t happened yet in the books. The data that is available on kaggle is a good material to get started as I can show you all the queries you will need, when importing and linking different sources of data together. This will probably end up 3 to 4 part series, so that we can cover all the material available.

Data:

We will be using kaggle dataset of Game of Thrones, which I saved on my github repo for easier access. I cleaned up the names of the persons a bit, so that they match better. There are 3 CSVs available.

  • battles.csv — contains information about all of the battles in game of thrones.
  • character-deaths.csv — contains information about characters and when they died.
  • character-predictions.csv –This csv takes an expanded view on character deaths, including predictions of how likely they are to die.

In this first part we will import battles.csv.

Graph model:

Screen Shot 2017-06-23 at 21.35.53

We will start with a simple graph model containing Person,House,Battle nodes and a very simple location tree.Unfortunately we cannot link Person to House nodes as they are not explicity marked. We will get this connections from the other two CSVs.

Import:

First we will create constraints and indexes, so that our queries will be faster.

CALL apoc.schema.assert(
{Location:['name'],Region:['name']},
{Battle:['name'],Person:['name'],House:['name']});

We will split the import into 4 different queries for easier understanding.

Battle:

First we create :Battle nodes using MERGE.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/battles.csv" as row
//merge node labeled Battle 
MERGE (b:Battle{name:row.name})
ON CREATE SET b.year = toINT(row.year),
              b.summer = row.summer,
              b.major_death = row.major_death,
              b.major_capture = row.major_capture,
              b.note = row.note,
              b.battle_type = row.battle_type,
              b.attacker_size = toINT(row.attacker_size),
              b.defender_size = toINT(row.defender_size)

Foreach trick:

If we check the data, we can see that some battles have one attacker and some have four. They are stored in separate columns, so when there is only one attacker, there are 3 columns with a null value. When there are null values in a column and we want to skip them (ignore them) when using MERGE on that property, we can use the following foreach trick.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/battles.csv" as row
// there is only attacker_outcome in the data, 
// so we do a CASE statement for defender_outcome
WITH row,
case when row.attacker_outcome = "win" THEN "loss" ELSE "win" END as defender_outcome
// match the battle
MATCH (b:Battle{name:row.name})
// all battles have atleast one attacker so we don't have to use foreach trick
MERGE (attacker1:House{name:row.attacker_1}) 
MERGE (attacker1)-[a1:ATTACKER]->(b) 
ON CREATE SET a1.outcome = row.attacker_outcome

// When we want to skip null values we can use foreach trick
FOREACH
  (ignoreMe IN CASE WHEN row.defender_1 is not null THEN [1] ELSE [] END | 
    MERGE (defender1:House{name:row.defender_1})
    MERGE (defender1)-[d1:DEFENDER]->(b)
    ON CREATE SET d1.outcome = defender_outcome)
FOREACH
  (ignoreMe IN CASE WHEN row.defender_2 is not null THEN [1] ELSE [] END | 
    MERGE (defender2:House{name:row.defender_2})
    MERGE (defender2)-[d2:DEFENDER]->(b)
    ON CREATE SET d2.outcome = defender_outcome)
FOREACH
  (ignoreMe IN CASE WHEN row.attacker_2 is not null THEN [1] ELSE [] END | 
    MERGE (attacker2:House{name:row.attacker_2})
    MERGE (attacker2)-[a2:ATTACKER]->(b)
    ON CREATE SET a2.outcome = row.attacker_outcome)
FOREACH
  (ignoreMe IN CASE WHEN row.attacker_3 is not null THEN [1] ELSE [] END | 
    MERGE (attacker2:House{name:row.attacker_3})
    MERGE (attacker3)-[a3:ATTACKER]->(b)
    ON CREATE SET a3.outcome = row.attacker_outcome)
FOREACH
  (ignoreMe IN CASE WHEN row.attacker_4 is not null THEN [1] ELSE [] END | 
    MERGE (attacker4:House{name:row.attacker_4})
    MERGE (attacker4)-[a4:ATTACKER]->(b)
    ON CREATE SET a4.outcome = row.attacker_outcome)

Coalesce:

What if we didn’t want to skip null values, but mark them with a default value “Unknown”. In this case we can use coalesce, to set a default value to all null values. Because there is only one location with null value, we do not have to use local merges, but you can get more information in Neo4j Location Trees.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/battles.csv" as row
MATCH (b:Battle{name:row.name})
// We use coalesce, so that null values are replaced with "Unknown" 
MERGE (location:Location{name:coalesce(row.location,"Unknown")})
MERGE (b)-[:IS_IN]->(location)
MERGE (region:Region{name:row.region})
MERGE (location)-[:IS_IN]->(region)

Multiple unwinds:

Using multiple unwinds in a single query, we need to take of the cardinality (number of rows), as UNWIND transforms any list back into individual rows. We need to separate them using any aggregation function, so they do not become nested (like a nested .map function in JS).

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/battles.csv" as row
// We split the columns that may contain more than one person
WITH row,
     split(row.attacker_commander,",") as att_commanders,
     split(row.defender_commander,",") as def_commanders,
     split(row.attacker_king,"/") as att_kings,
     split(row.defender_king,"/") as def_kings,
     row.attacker_outcome as att_outcome,
     CASE when row.attacker_outcome = "win" THEN "loss" 
     ELSE "win" END as def_outcome
MATCH (b:Battle{name:row.name})
// we unwind a list
UNWIND att_commanders as att_commander
MERGE (p:Person{name:trim(att_commander)})
MERGE (p)-[ac:ATTACKER_COMMANDER]->(b)
ON CREATE SET ac.outcome=att_outcome
// to end the unwind and correct cardinality(number of rows)
// we use any aggregation function ( e.g. count(*))
WITH b,def_commanders,def_kings,att_kings,att_outcome,def_outcome,count(*) as c1
UNWIND def_commanders as def_commander
MERGE (p:Person{name:trim(def_commander)})
MERGE (p)-[dc:DEFENDER_COMMANDER]->(b)
ON CREATE SET dc.outcome = def_outcome
// reset cardinality with an aggregation function (end the unwind)
WITH b,def_kings,att_kings,att_outcome,def_outcome,count(*) as c2
UNWIND def_kings as def_king
MERGE (p:Person{name:trim(def_king)})
MERGE (p)-[dk:DEFENDER_KING]->(b)
ON CREATE SET dk.outcome = def_outcome
// reset cardinality with an aggregation function (end the unwind)
WITH b,att_kings,att_outcome,count(*) as c3
UNWIND att_kings as att_king
MERGE (p:Person{name:trim(att_king)})
MERGE (p)-[ak:ATTACKER_KING]->(b)
ON CREATE SET ak.outcome = att_outcome

Conclusion:

As we have to import two more CSVs I will not do any queries on the data yet, but focus first on the importing and linking part of the process and then do an analysis on the data from all three CSVs. Second part should be coming in a few days so stay tuned!

Neo4j Marvel Social Graph

Today I will show you how to import the Marvel graph into Neo4j. We will use a very simple graph model. I will show you the advantages of having a social network stored in a graph database.

Requirements:

Data:

We will use the marvel social network data, which was downloaded from a kaggle competition post. There are 3 CSVs available.

  • nodes.csv — Node name and type
  • edges.csv — Heroes and the comic in which they appear.
  • hero-network.csv — Edges between heroes that appear in the same comic.

We only need the edges.csv file, because if we know which hero appeared in which comic, that is enough for us to import the social graph. We also do not need the hero-network.csv, that hold information about unweighted hero network, because we will create our own weighted hero network with the help of cypher and apoc.

Graph model:

We will use a very simple graph model, with nodes labeled Hero,Comic and a relationship APPEARED_IN between them.

Screen Shot 2017-06-10 at 13.53.41

Import:

First we create constraints on labels Comic and Hero, so that our queries will be more optimized and faster.

CALL apoc.schema.assert(
{},
{Comic:['name'],Hero:['name']});

I uploaded the edges.csv to my git repository, so you can import the graph in one step using below query. It it very basic, we just MERGE two nodes and a relationship between them. When we are importing larger amounts of data, we want to batch it with USING PERIODIC COMMIT.

USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-marvel/master/data/edges.csv" as row
MERGE (h:Hero{name:row.hero})
MERGE (c:Comic{name:row.comic})
MERGE (h)-[:APPEARED_IN]->(c)

Create a weighted undirected hero network:

Now that we have imported the data, we will create a weighted hero network, where we will assume, that the more comics hero appeared together, the more they know each other. We will save the number of comics heroes appeared together as a weight property of the KNOWS relationship.

We will use apoc.periodic.iterate to help us with batching. As our graph gets bigger, we do not want run global refactoring or calculations in one transaction, but we want to batch it. Apoc comes in handy when using batching with cypher.

Taken from apoc documentation:

With apoc.periodic.iterate you provide 2 statements, the first outer statement is providing a stream of values to be processed. The second, inner statement processes one element at a time or with iterateList:true the whole batch at a time.

We can use following query to create a weighted undirected hero social graph.

CALL apoc.periodic.iterate(
"MATCH (p1:Hero)-->(:Comic)<--(p2:Hero) where id(p1) < id(p2) RETURN p1,p2",
"MERGE (p1)-[r:KNOWS]-(p2) ON CREATE SET r.weight = 1 ON MATCH SET r.weight = r.weight + 1"
, {batchSize:5000, parallel:false,iterateList:true})

Now lets dissect the query for better understanding.

First outer statement:

First outer statement should return a stream of values to be processed. It consists of a simple cypher query that matches two heroes, that appeared in the same comic. There is a small, but important detail in WHERE id(p1) < id(p2) statement. This way we avoid doubling the results, because with using <> instead of < , MATCH will return every relationship twice, with the first MATCH returning hero1 as p1 and hero2 as p2, and the second MATCH returning hero1 as p2 and hero2 as p1.

MATCH (p1:Hero)-->(:Comic)<--(p2:Hero) where id(p1) < id(p2) RETURN p1,p2

Second inner statement:

The second inner statement processes one element at a time or with iterateList:true the whole batch at a time. We use MERGE to create the relationship. It means that it will try to match the relationship and if it doesn’t find one it will create one. Check docs for more info. Cypher has a built in logic that supports action based on whether the node or the relationship was created or found using MERGE. We can use ON CREATE SET, for scenarios when the node is created or ON MATCH SET, if the node was found(already exists).

We can also notice, that we do not specify the direction of the relationship. We treat this network as undirected, so the direction of relationship has no implication. MERGE provides the ability to not specify direction, which in practice means, that it checks if there is a relationship between given nodes in any direction, and if not found, it creates one in a random direction.

MERGE (p1)-[r:KNOWS]-(p2)
ON CREATE SET r.weight = 1
ON MATCH SET r.weight = r.weight + 1

Heroes with top weighted degree:

We can check the heroes, that have the biggest weighted degree, which is just a sum of all weights of the relationships connected to given node.

Even though we treat our network as undirected, Neo4j still stores the relationships as directed. That is not a problem as we can MATCH without specifying direction of relationship, so we query it as undirected, where we do not care about the direction of relationship.

MATCH (h:Hero)-[t:KNOWS]-()
RETURN h.name as hero,sum(t.weight) as weighted_degree
ORDER BY weighted_degree DESC LIMIT 10

Results:

Screen Shot 2017-06-10 at 19.06.31.png

 

This is the first part, where we learned how to import a social graph and create a weighted undirected network between our heroes. Now we can start to test graph algorithms like neo4j graph algoritms or apoc graph algorithms on our network. Stay tuned as I will talk about algorithms more next time.

Neo4j APOC graph algorithms part 2

As I said in the previous blog post, APOC plugin offers more graph algorithms, which you can easily use by simply calling APOC procedures in cypher. Today I will show you how to use centrality and community algorithms. There are also some path-finding algorithms in APOC, like dijkstra or A*, but that may be covered next time. I also found this cool blog post from William Lyon, where he does something similar as I am doing, but uses picture and describes graph algorithms nicely.

Dataset:

I will just use the standard actors/movies dataset, which you can get by running play :movies in your Neo4j Browser console.

I will use same dummy features as in previous part.

Algorithms:

As we noticed in part one of the series, our actors are very similar, given the two features, we used to describe them(age and number of movies). In my case I will use hard similarity, meaning I will set a threshold value of similarity, which will determine if a relation between two persons is created or not. If we do not do this, we end up with an all connected graph, meaning every Person node is connected to all the other Person nodes. As an example community detection algorithms does not work well with that.

Hard similarity:

MATCH (p1:Person),(p2:Person) where id(p1) < id(p2)
AND exists(p1.age) and exists(p2.age)
WITH p1,p2,apoc.algo.euclideanSimilarity([p1.count,p2.age_nor],[p2.count,p2.age_nor]) as value 
//Hard similarity with a threshold of 0.6 
WITH p1,p2,value where value > 0.6
MERGE (p1)-[s:SIMILARITY]-(p2)
SET s.e_similarity = value

Screen Shot 2017-04-27 at 23.01.57

We have created 3300 relationships between 123 people. We can also see that kNN algorithm results are 6 different isolated clusters.

Community detection:

Apoc uses simple label propagation kernel for community detection. More info about algorithm in documentation.

CALL apoc.algo.community(25,null,'community','SIMILARITY','BOTH','e_similarity' ,10000)

Lets check results:

MATCH (n:Person) 
RETURN distinct(n.community) as community,count(n) as Members,
avg(n.age) as avg_age,avg(n.count) as avg_count,stdev(n.age) as age_stdev,stdev(n.count) as count_stdev

community_results

We can easily see, that the biggest cluster with 77 members has an average age of 58.5 years and worked at one movie. We can also see that our clusters differ from each other based on the number of movies members worked at and are of similar ages.

Lets try some visualization with APOC virtual nodes and relationships.

MATCH (p1:Person) with p1.community as community,collect(p1) as members 
CALL apoc.create.vNodes(['Community'], [{name:community}]) YIELD node as comm 
UNWIND members as member 
CALL apoc.create.vRelationship(member,'MEMBER_OF_COMMUNITY',{}, comm) YIELD rel 
RETURN member,rel,comm

virtual_.png

I also want to give you some examples of other algorithms. Theory is from Wikipedia.

Closeness centrality:

In a connected graph, the closeness centrality (or closeness) of a node is a measure of centrality in a network, calculated as the sum of the length of the shortest paths between the node and all other nodes in the graph. Thus the more central a node is, the closer it is to all other nodes.

MATCH (node:Person)
WITH collect(node) AS nodes
CALL apoc.algo.closeness(['SIMILARITY'],nodes,'BOTH') YIELD node, score
RETURN node, score
ORDER BY score DESC

Betweenness centrality:

In graph theory, betweenness centrality is a measure of centrality in a graph based on shortest paths. For every pair of vertices in a connected graph, there exists at least one shortest path between the vertices such that either the number of edges that the path passes through (for unweighted graphs) or the sum of the weights of the edges (for weighted graphs) is minimized. The betweenness centrality for each vertex is the number of these shortest paths that pass through the vertex.

MATCH (node:Person)
WITH collect(node) AS nodes
CALL apoc.algo.betweenness(['SIMILARITY'],nodes,'BOTH') YIELD node, score
RETURN node, score
ORDER BY score DESC

Pagerank:

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

MATCH (node:Person)
WITH collect(node) AS nodes
// perform 10 page rank iterations, computing only over relationships of type SIMILARITY
CALL apoc.algo.pageRankWithConfig(nodes,{iterations:10,types:'SIMILARITY'}) YIELD node, score
RETURN node, score
ORDER BY score DESC

Awesome algorithms:

Apoc also supports Pagerank and Cypher, which does not need direct network, but can use the power of cypher to run algorithms on indirect relationships in graph. That is pretty awesome.

Pagerank with cypher:

CALL apoc.algo.pageRankWithCypher(
{iterations:5, write:true, 
node_cypher:'MATCH (p:Person) return id(p) as id',
rel_cypher: 'MATCH (p:Person)-->()<--(o:Person) return id(p) as source,id(o) as target, 1 as weight'
});

Betwenness with cypher:

CALL apoc.algo.betweennessCypher(
{write:true, 
node_cypher:'MATCH (p:Person) return id(p) as id',
rel_cypher: 'MATCH (p:Person)-->()<--(o:Person) return id(p) as source,id(o) as target, 1 as weight'
})

Conclusion:

I like it just how easy it is to run graph algorithm procedures with cypher. You can analyse direct network or even indirect networks using algorithms with cypher without any external tools, which is pretty impressive.

Thanks for reading.

Neo4j APOC graph algorithms part 1

In the last release of APOC plugin, there are some new graph algorithm, and one of them is a kNN algorithm, which is cool and easy to use. I have created my own kNN euclidian distance algorithm a few months ago with cypher, and yes it worked, but it was slow, because you are basically doing a cartesian product. I was pleasantly surprised how fast apoc version is.

Requirements:

Download the latest APOC release.

Data:

I will use the standard movies data, which I feel is a good benchmarking data, to show what cypher with APOC combined is capable of.

:play movies

Features:

We will use age and count of movies a person worked at as our two dummy features, so we can run some new apoc.algo functions, which provide us with lots of cool algorithms.

MATCH (p:Person) 
set p.age = 2017 - p.born
MATCH (p:Person)
with p,size((p)-->(:Movie)) as s
set p.count = s

Lets draw distributions with spoonJS, which you can easily attach to Neo4j browser. It augments it with chart visualization capacibilities, which are very useful and easy to use.

Age distribution:

We run this query and visualize the results.

//few :Person node have age property null so we must filter them out
MATCH (p:Person) where exists (p.age) 
return distinct(p.age) as age,count(*) as count order by count desc

Screen Shot 2017-04-20 at 22.39.51

Number of movies distribution:

MATCH (p:Person) return distinct(p.count) as number_of_movies,
count(*) as count order by count desc

Screen Shot 2017-04-20 at 22.45.19

As we can easily tell most persons are between 40-60 years old and have played in a movie or two. So now if we want to use both together as a feature, we should use do some sort of normalization.

Normalization:

Version 1:

I came up with a simple function that copies what I understand minmax to do.

//filter out outliers
MATCH (p:Person) where p.age > 25 and p.count < 10
//get the span
WITH max(p.age) - min(p.age) as age_span,max(p.count) - min(p.count) as count_span
WITH toFLOAT(age_span) / toFLOAT(count_span) as coefficient
MATCH (p1:Person)
SET p1.age_nor = p1.age / coefficient

Version 2:

You can also just normalize each feature between 0 and 1.Example for one feature

//filter out outliers
MATCH (p:Person) where p.age > 25 and p.count < 10
//get the the max and min value
WITH max(p.age) as max,min(p.age) as min
MATCH (p1:Person)
//normalize
SET p1.age_nor = (1.0 * p1.age - min) / (max - min)

I used the first version in all the visualizations below.

kNN queries:

Cosine similarity:

MATCH (p1:Person),(p2:Person) where id(p1) < id(p2) and exists(p1.age) and exists(p2.age)
WITH p1,p2,apoc.algo.cosineSimilarity([p1.count,p1.age_nor],[p2.count,p2.age_nor]) as value
MERGE (p1)-[s:SIMILARITY]-(p2)
SET s.cosine = value

Query for distribution:

We need to bucketize a bit for a better visualization.

match ()-[s:SIMILARITY]->()
WITH distinct(s.cosine) as cosine
return round(cosine * 100),count(*)

Screen Shot 2017-04-20 at 23.56.07

Euclidian distance:

MATCH (p1:Person),(p2:Person) where id(p1) < id(p2) and exists(p1.age) and exists(p2.age)
WITH p1,p2,apoc.algo.euclideanDistance([p1.count,p2.age_nor],[p2.count,p2.age_nor]) as value
MERGE (p1)-[s:SIMILARITY]-(p2)
SET s.e_distance = value

Screen Shot 2017-04-20 at 23.44.33.png

Euclidian similarity:

MATCH (p1:Person),(p2:Person) where id(p1) < id(p2) 
and exists(p1.age) and exists(p2.age)
WITH p1,p2,apoc.algo.euclideanSimilarity([p1.count,p2.age_nor],[p2.count,p2.age_nor]) as value
MERGE (p1)-[s:SIMILARITY]-(p2)
SET s.e_similarity = value

Screen Shot 2017-04-20 at 23.49.23

Conclusion:

With the help of APOC and spoonJS we can easily run graph algorithms and quickly visualize results, to help us get a feeling how the data looks like. You do not need any external tools for simple chart visualizations , which is pretty amazing. APOC holds more graph algorithms, so the next blog will probably come soon. Stay tuned

Neo4j to Mapbox

Introduction:

This is the part three of my Hospitals Series with Neo4j. If you have not seen the first two, you should check them out to see how we got to here.

Part 1 : Neo4j Location Trees
Part 2: Cypher Location queries

Basically what I have tried to do is to take these two maps

and combine both their view into one visualization as a cool project. Let me show you how you can easily create map visualizations using Neo4j and Mapbox/Leaflet JS. One nice use case can be a map with potencial customers to help you find dense areas etc…

Screen Shot 2017-04-08 at 14.07.21
Data:

I got the data from the National Map website. They have quite a collection of maps for us to examine from govermental boundaries to railroad maps. We just need to transform the Shapefile to WKT and we are good to go. I found out GDAL library does that very simply using the command.

$ ogr2ogr -f CSV polygon_state.csv cb_2015_us_state_20m.shp -nlt POLYGON -lco GEOMETRY=AS_WKT

Note that ideally you would want to use multipolygons, because as I later found out, some states have islands, like Alaska, and if we save them as a polygon, map visualization does not work properly.

Now we have to save our WKT data in Neo4j and set them as properties of states.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/hospitals-neo4j/master/state_polygon.csv" as row
MATCH (s:State) WHERE s.name = row.STUSPS
SET s.state = row.NAME,s.wkt = row.WKT

MapBox:

I have stole most of the code from Legis and Panama map project listed above. Also as you will see I am not experienced with JS, and I welcome anybody that would want to upgrade this project’s JS 🙂

Polygons:

We will need to retrieve WKT polygons and some meta-data for states using. We filter out Alaska intentionally because of its islands, which ruin our vizualization.

var districtQuery =
"MATCH (state:State)<-[:IS_IN*5..5]-(hospital)  WHERE exists (state.wkt) AND not state.name = 'AK'" +  "OPTIONAL MATCH (hospital)-[:HAS_RATING]->(r)" + 
"RETURN state.wkt as wkt,{state:state.state,count : count(hospital),
rating:avg(toINT(r.name))} as text"

Next we need to convert Polygon WKT string to an array of arrays of [x,y] points.

function parseWKTPolygons(wkt) {
   var pointArr = [];
   var points = wkt.slice(10,-3).split(",")
   for(var i = 0; i < points.length;i++){
       var xy =points[i].split(" ")
       if (!(isNaN(xy[0]) || isNaN(xy[1]))){
          pointArr.push({lat:parseFloat(xy[1]),lng:parseFloat(xy[0])})
          }
        }
    return pointArr;
}

Now finally we can draw them as a polygon on the map and bind a popup to them.

function drawPolygons(polygons, text) {
    var bounds = null;
    popuptext = buildStatePopup(text)
    var popup = L.popup({keepInView: true, minWidth: 350, maxWidth: 1000});
    popup.setContent(popuptext)
    var polyline = L.polygon(polygons).addTo(map);
    polyline.on('click', function(e){
           popup.setLatLng(e.latlng).openOn(map)
           this.openPopup();
          })
}

We have now created a map of simple polygons with clickable popup, like shown below in the picture.

Screen Shot 2017-04-08 at 17.07.51

Markers:

Now lets prepare a function that draws markers on our map for the ability to drill down to single hospital. We will define our custom icon and also bind a popup on click.

function addAddress(lat,lon,data) {

var hospitalIcon = L.icon({
iconUrl: 'icons/Hospital.png',

     iconSize: [38, 95], // size of the icon
     shadowSize: [50, 64], // size of the shadow
     iconAnchor: [22, 94], // point of the icon which will correspond to marker's location
     shadowAnchor: [4, 62], // the same for the shadow
     popupAnchor: [-3, -76] // point from which the popup should open relative to the iconAnchor
});

var marker = L.marker([lat,lon],{icon:hospitalIcon}).addTo(map);
popuptext = buildAddressPopup(data);

var popup = L.popup({keepInView: true, minWidth: 350, maxWidth: 1000});
popup.setLatLng({lat:lat,lng:lon})
.setContent(popuptext)
marker.bindPopup(popup);
marker.on('click', function(e){
      this.openPopup();
    })
}

If we zoom in we can drill down on each hospital and click on it to get a popup. This is a very basic visualization as MapBox supports all sorts of cool visualizations. There are all sorts of improvement possible, I will try to update it, and I ask you to join with your ideas/code.
Screen Shot 2017-04-08 at 17.27.16.png

Now all that is left is for us to set the rules of interacting with the map.

 L.mapbox.accessToken = MB_API_TOKEN;
    var map = L.mapbox.map('map', 'mapbox.streets')
    // run getStates when map loads
    map.on('load',function(){
        getStates();
   })
   // set starting view
   map.setView([39.8282, -98.5795], 4); 
   // disable zoom on doubleClick
   map.doubleClickZoom.disable(); 
   // run getAddresses and draw all hospitals within 100km on the map
   // on double click
   map.on('dblclick', function(e) {
        clearMap(map);
        getAddresses(e.latlng, 100)
    });

Conclusion:

I have found out that without much JS skill, you are able to create all sorts of custom map vizualizations, which is pretty cool. Since you can visualize mostly whatever is stored in your Neo4j, options are practically unlimited.

If you made it this far, thanks for reading through, and if you liked it, please share it with the community.

p.s.