Neo4j Categorical Pagerank

I found this cool Neo4j blog post written by Kenny Bastani, that describes a concept called categorical pagerank.
I will try to recreate it using neo4j-graph-algorithms library and GoT dataset.

 

categorical_pagerank

Kenny Bastani, Categorical PageRank Using Neo4j and Apache Spark, https://neo4j.com/blog/categorical-pagerank-using-neo4j-apache-spark/

Idea behind it is pretty simple. As shown in the example above we have a graph of pages that have links between each other and might also belong to one or more categories. To better understand global pagerank score of nodes in a network, we can breakdown our graph into several subgraphs, one for each category and execute pagerank algorithm on each of that subgraphs. We store results as a relationship property between category and pages.
This way we can break down which are the contributing categories to page’s global pagerank score.

 

Requirements:

Graph Model:

We will use the dataset made available by Joakim Skoog through his API of ice and fire.

I first encountered this dataset when Michael Hunger showed us how to import the data in his game of data blog post. I thought the dataset was pretty nice and as all I had to do was copy/paste the import queries I decided to play around with it and wrote a Neo4j GoT Graph Analysis post.

Michael’s import query of house data:

// create Houses and their relationships
call apoc.load.jsonArray('https://raw.githubusercontent.com/joakimskoog/AnApiOfIceAndFire/master/data/houses.json') yield value
// cleanup
with apoc.map.clean(apoc.convert.toMap(value), [],['',[''],[],null]) as data
// lowercase keys
with apoc.map.fromPairs([k in keys(data) | [toLower(substring(k,0,1))+substring(k,1,length(k)), data[k]]]) as data

// create House
MERGE (h:House {id:data.id}) 
// set attributes
SET 
h += apoc.map.clean(data, ['overlord','swornMembers','currentLord','heir','founder','cadetBranches'],[])

// create relationships to people or other houses
FOREACH (id in data.swornMembers | MERGE (o:Person {id:id}) MERGE (o)-[:ALLIED_WITH]->(h))
FOREACH (s in data.seats | MERGE (seat:Seat {name:s}) MERGE (seat)-[:SEAT_OF]->(h))
FOREACH (id in data.cadetBranches | MERGE (b:House {id:id}) MERGE (b)-[:BRANCH_OF]->(h))
FOREACH (id in case data.overlord when null then [] else [data.overlord] end | MERGE (o:House {id:id}) MERGE (h)-[:SWORN_TO]->(o))
FOREACH (id in case data.currentLord when null then [] else [data.currentLord] end | MERGE (o:Person {id:id}) MERGE (h)-[:LED_BY]->(o))
FOREACH (id in case data.founder when null then [] else [data.founder] end | MERGE (o:Person {id:id}) MERGE (h)-[:FOUNDED_BY]->(o))
FOREACH (id in case data.heir when null then [] else [data.heir] end | MERGE (o:Person {id:id}) MERGE (o)-[:HEIR_TO]->(h))
FOREACH (r in case data.region when null then [] else [data.region] end | MERGE (o:Region {name:r}) MERGE (h)-[:IN_REGION]->(o));

After we have imported the dataset our graph will have a schema as shown below. You can always check the schema of your graph using CALL db.schema

meta

Categorical pagerank:

As in my previous blog post we will use the SWORN_TO network of houses to demonstrate categorical pagerank and this time use regions as categories. This way we will try to understand and breakdown from which regions do the houses get their power and support.

We first match all regions so that we will iterate our algorithm through all regions. In the node-statement of cypher projection we project nodes belonging to only a specific region using a parameter. As we already filtered nodes from a specific region we don’t have to filter out any relationships as only relationships with both source and target nodes described in node-statement will be projected and all other relationships that don’t have both the source and target nodes described in node-statement will be ignored in the projection.

We will then save the results as a relationship property between region and house.

MATCH (r:Region)
CALL algo.pageRank.stream('
    MATCH (h:House)-[:IN_REGION]->(r:Region)
    WHERE r.name ="' + r.name +
    '" RETURN id(h) as id
    ','
    MATCH (h1:House)-[:SWORN_TO]->(h2:House)
    RETURN id(h1) as source,id(h2) as target',
    {graph:'cypher'}) 
YIELD nodeId,score
MATCH (h:House) where id(h) = nodeId
MERGE (r)-[p:PAGERANK]->(h)
ON CREATE SET p.score = score

Lets first examine the North.

MATCH (r:Region{name:"The North"})-[p:PAGERANK]->(h)
RETURN h.name as house,p.score as pagerank 
ORDER BY pagerank DESC LIMIT 5

House Bolton leads with House Stark following in second place. This might be disheartening to some fans as Starks is more lovable than Boltons, but we all know how things ended for Boltons in the TV series.

north

Westerlands region is the home of the Lannister House. Lets see how well they do in their home region.

MATCH (r:Region{name:"The Westerlands"})-[p:PAGERANK]->(h)
RETURN h.name as house,p.score as pagerank
ORDER BY pagerank DESC LIMIT 5

Lannisters have a very strong direct support in their home region. This is shown in that House Farman is the only other house in Westerlands that has the support of at least one house.

west

 

Second version:

As I was writing the blog post and running the above algorithm I thought to myself that even though a house might not be in a specific region it might still have support from a house in that region and hence support from that region.

For that reason I turned our projection of the graph to be analyzed around a bit and now we project all the houses of our graph and filter SWORN_TO relationships that have the source node based in a specific region only. This directly translates to support from a house in that region.

We filter out pagerank scores below 0.151 as 0.15 is the default value for a node with no inbound relationships and save results as a relationship between a region and a house. This way we keep our graph tidy.

MATCH (r:Region)
CALL algo.pageRank.stream('
    MATCH (h:House)
    RETURN id(h) as id
    ','
    MATCH (r:Region)<-[:IN_REGION]-(h1:House)-[:SWORN_TO]->(h2:House)
    WHERE r.name ="' + r.name +
    '" RETURN id(h1) as source,id(h2) as target',
    {graph:'cypher'})
YIELD nodeId,score
WITH nodeId,score,r where score > 0.151
MATCH (h:House) where id(h) = nodeId
MERGE (r)-[p:SUPPORT]->(h)
ON CREATE SET p.score = score

As we get back only 51 created relationships we can easily visualize this network in Neo4j Browser. It is pretty obvious that House Baratheon of King’s Landing has the support from most regions lacking only The Neck and Beyond the Wall region support.

categ.png

Check top 20 individual regional pagerank scores.

MATCH (h:House)<-[s:SUPPORT]-(r:Region)
RETURN r.name as region,h.name as house,s.score as score 
ORDER BY score DESC LIMIT 20

Both Baratheon houses are very dominant in the Crownlands region. House Tyrell comes in third in regional pagerank score from The Reach region. House Tyrell is sworn to House Baratheon of King’s Landing and solely because of this relationship House Baratheon comes in immediately after house Tyrell by support from the Reach region. This is a pattern occurring through most of the graph except for the North Region, where House Baratheon comes in before the Starks and Boltons having support from both of them.

cate_table.png

Conclusion:

With cypher projections we get all the freedom cypher query language provides. We can even parametrize graph algorithms to run on only specific subgraphs as shown in this blog post. Cypher projections are a very powerful tool that can be used to extract useful insights from our graphs and if you are familiar with cypher also quite easy to use.

Advertisements

One thought on “Neo4j Categorical Pagerank

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s