Neo4j Game of Thrones part 1

We are only a month away from the seventh Game of Thrones season, so the time is right to import GoT data into our favourite graph database Neo4j. Data is from the books, so some things that happened on the TV show, haven’t happened yet in the books. The data that is available on kaggle is a good material to get started as I can show you all the queries you will need, when importing and linking different sources of data together. This will probably end up 3 to 4 part series, so that we can cover all the material available.

Data:

We will be using kaggle dataset of Game of Thrones, which I saved on my github repo for easier access. I cleaned up the names of the persons a bit, so that they match better. There are 3 CSVs available.

  • battles.csv — contains information about all of the battles in game of thrones.
  • character-deaths.csv — contains information about characters and when they died.
  • character-predictions.csv –This csv takes an expanded view on character deaths, including predictions of how likely they are to die.

In this first part we will import battles.csv.

Graph model:

Screen Shot 2017-06-23 at 21.35.53

We will start with a simple graph model containing Person,House,Battle nodes and a very simple location tree.Unfortunately we cannot link Person to House nodes as they are not explicity marked. We will get this connections from the other two CSVs.

Import:

First we will create constraints and indexes, so that our queries will be faster.

CALL apoc.schema.assert(
{Location:['name'],Region:['name']},
{Battle:['name'],Person:['name'],House:['name']});

We will split the import into 4 different queries for easier understanding.

Battle:

First we create :Battle nodes using MERGE.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/battles.csv" as row
//merge node labeled Battle 
MERGE (b:Battle{name:row.name})
ON CREATE SET b.year = toINT(row.year),
              b.summer = row.summer,
              b.major_death = row.major_death,
              b.major_capture = row.major_capture,
              b.note = row.note,
              b.battle_type = row.battle_type,
              b.attacker_size = toINT(row.attacker_size),
              b.defender_size = toINT(row.defender_size)

Foreach trick:

If we check the data, we can see that some battles have one attacker and some have four. They are stored in separate columns, so when there is only one attacker, there are 3 columns with a null value. When there are null values in a column and we want to skip them (ignore them) when using MERGE on that property, we can use the following foreach trick.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/battles.csv" as row
// there is only attacker_outcome in the data, 
// so we do a CASE statement for defender_outcome
WITH row,
case when row.attacker_outcome = "win" THEN "loss" ELSE "win" END as defender_outcome
// match the battle
MATCH (b:Battle{name:row.name})
// all battles have atleast one attacker so we don't have to use foreach trick
MERGE (attacker1:House{name:row.attacker_1}) 
MERGE (attacker1)-[a1:ATTACKER]->(b) 
ON CREATE SET a1.outcome = row.attacker_outcome

// When we want to skip null values we can use foreach trick
FOREACH
  (ignoreMe IN CASE WHEN row.defender_1 is not null THEN [1] ELSE [] END | 
    MERGE (defender1:House{name:row.defender_1})
    MERGE (defender1)-[d1:DEFENDER]->(b)
    ON CREATE SET d1.outcome = defender_outcome)
FOREACH
  (ignoreMe IN CASE WHEN row.defender_2 is not null THEN [1] ELSE [] END | 
    MERGE (defender2:House{name:row.defender_2})
    MERGE (defender2)-[d2:DEFENDER]->(b)
    ON CREATE SET d2.outcome = defender_outcome)
FOREACH
  (ignoreMe IN CASE WHEN row.attacker_2 is not null THEN [1] ELSE [] END | 
    MERGE (attacker2:House{name:row.attacker_2})
    MERGE (attacker2)-[a2:ATTACKER]->(b)
    ON CREATE SET a2.outcome = row.attacker_outcome)
FOREACH
  (ignoreMe IN CASE WHEN row.attacker_3 is not null THEN [1] ELSE [] END | 
    MERGE (attacker2:House{name:row.attacker_3})
    MERGE (attacker3)-[a3:ATTACKER]->(b)
    ON CREATE SET a3.outcome = row.attacker_outcome)
FOREACH
  (ignoreMe IN CASE WHEN row.attacker_4 is not null THEN [1] ELSE [] END | 
    MERGE (attacker4:House{name:row.attacker_4})
    MERGE (attacker4)-[a4:ATTACKER]->(b)
    ON CREATE SET a4.outcome = row.attacker_outcome)

Coalesce:

What if we didn’t want to skip null values, but mark them with a default value “Unknown”. In this case we can use coalesce, to set a default value to all null values. Because there is only one location with null value, we do not have to use local merges, but you can get more information in Neo4j Location Trees.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/battles.csv" as row
MATCH (b:Battle{name:row.name})
// We use coalesce, so that null values are replaced with "Unknown" 
MERGE (location:Location{name:coalesce(row.location,"Unknown")})
MERGE (b)-[:IS_IN]->(location)
MERGE (region:Region{name:row.region})
MERGE (location)-[:IS_IN]->(region)

Multiple unwinds:

Using multiple unwinds in a single query, we need to take of the cardinality (number of rows), as UNWIND transforms any list back into individual rows. We need to separate them using any aggregation function, so they do not become nested (like a nested .map function in JS).

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/battles.csv" as row
// We split the columns that may contain more than one person
WITH row,
     split(row.attacker_commander,",") as att_commanders,
     split(row.defender_commander,",") as def_commanders,
     split(row.attacker_king,"/") as att_kings,
     split(row.defender_king,"/") as def_kings,
     row.attacker_outcome as att_outcome,
     CASE when row.attacker_outcome = "win" THEN "loss" 
     ELSE "win" END as def_outcome
MATCH (b:Battle{name:row.name})
// we unwind a list
UNWIND att_commanders as att_commander
MERGE (p:Person{name:trim(att_commander)})
MERGE (p)-[ac:ATTACKER_COMMANDER]->(b)
ON CREATE SET ac.outcome=att_outcome
// to end the unwind and correct cardinality(number of rows)
// we use any aggregation function ( e.g. count(*))
WITH b,def_commanders,def_kings,att_kings,att_outcome,def_outcome,count(*) as c1
UNWIND def_commanders as def_commander
MERGE (p:Person{name:trim(def_commander)})
MERGE (p)-[dc:DEFENDER_COMMANDER]->(b)
ON CREATE SET dc.outcome = def_outcome
// reset cardinality with an aggregation function (end the unwind)
WITH b,def_kings,att_kings,att_outcome,def_outcome,count(*) as c2
UNWIND def_kings as def_king
MERGE (p:Person{name:trim(def_king)})
MERGE (p)-[dk:DEFENDER_KING]->(b)
ON CREATE SET dk.outcome = def_outcome
// reset cardinality with an aggregation function (end the unwind)
WITH b,att_kings,att_outcome,count(*) as c3
UNWIND att_kings as att_king
MERGE (p:Person{name:trim(att_king)})
MERGE (p)-[ak:ATTACKER_KING]->(b)
ON CREATE SET ak.outcome = att_outcome

Conclusion:

As we have to import two more CSVs I will not do any queries on the data yet, but focus first on the importing and linking part of the process and then do an analysis on the data from all three CSVs. Second part should be coming in a few days so stay tuned!

Advertisements

2 thoughts on “Neo4j Game of Thrones part 1

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s