Neo4j Game of Thrones part 3

This is part 3 of GoT series, so if you haven’t seen previous posts i suggest you check them out.

In this part we will import the last CSV, so that we will be able to use data from all three sources in our analysis/visualizations.

Data:

We will be using kaggle dataset of Game of Thrones, which I saved on my github repo for easier access. In this second part we will import character_predictions.csv. This CSV takes an expanded view on character deaths, including predictions of how likely they are to die.

Graph model:

Screen Shot 2017-06-25 at 14.53.14

Our graph model is slowly evolving with every new CSV we import. We add a new Culture node and RELATED_TO relationship between persons, that creates a social network of persons. By the time we are done with importing, we will have 2091 persons and 352 House nodes in our knowledge graph, which is not bad for GoT dataset.

Import:

As you will see importing character_predictions.csv , is doing 10 FOREACH tricks, because the data has lots of null values in it, that we need to skip in order for the MERGE statement to work as needed.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/character-predictions.csv" as row
MERGE (p:Person{name:row.name})
// set properties on the person node
SET p.title = row.title,
    p.death_year = toINT(row.DateoFdeath),
    p.birth_year = toINT(row.dateOfBirth),
    p.age = toINT(row.age),
    p.gender = case when row.male = "1" then "male" else "female" end
// doing the foreach trick to skip null values
FOREACH
  (ignoreMe IN CASE WHEN row.mother is not null THEN [1] ELSE [] END |
    MERGE (mother:Person{name:row.mother})
    MERGE (p)-[:RELATED_TO{name:"mother"}]->(mother)
)
FOREACH
  (ignoreMe IN CASE WHEN row.spouse is not null THEN [1] ELSE [] END |
    MERGE (spouse:Person{name:row.spouse})
    MERGE (p)-[:RELATED_TO{name:"spouse"}]->(spouse)
)
FOREACH
  (ignoreMe IN CASE WHEN row.father is not null THEN [1] ELSE [] END |
    MERGE (father:Person{name:row.father})
    MERGE (p)-[:RELATED_TO{name:"father"}]->(father)
)
FOREACH
  (ignoreMe IN CASE WHEN row.heir is not null THEN [1] ELSE [] END |
    MERGE (heir:Person{name:row.heir})
    MERGE (p)-[:RELATED_TO{name:"heir"}]->(heir)
)
// we remove "House " from the value for better linking of data
FOREACH 
  (ignoreMe IN CASE WHEN row.house is not null THEN [1] ELSE [] END | 
    MERGE (house:House{name:replace(row.house,"House ","")}) 
    MERGE (p)-[:BELONGS_TO]->(house) 
)

I splited the query into two parts for more clarity, but as you see there is nothing fancy about it, just using the FOREACH trick over and over again.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/character-predictions.csv" as row
// match person
MERGE (p:Person{name:row.name})
// doing the foreach trick... we lower row.culture for better linking
FOREACH
  (ignoreMe IN CASE WHEN row.culture is not null THEN [1] ELSE [] END |
    MERGE (culture:Culture{name:lower(row.culture)})
    MERGE (p)-[:MEMBER_OF_CULTURE]->(culture)
)
FOREACH
  (ignoreMe IN CASE WHEN row.book1 = "1" THEN [1] ELSE [] END |
    MERGE (book:Book{sequence:1})
    MERGE (p)-[:APPEARED_IN]->(book)
)
FOREACH
  (ignoreMe IN CASE WHEN row.book2 = "1" THEN [1] ELSE [] END |
    MERGE (book:Book{sequence:2})
    MERGE (p)-[:APPEARED_IN]->(book)
)
FOREACH
  (ignoreMe IN CASE WHEN row.book3 = "1" THEN [1] ELSE [] END |
    MERGE (book:Book{sequence:3})
    MERGE (p)-[:APPEARED_IN]->(book)
)
FOREACH
  (ignoreMe IN CASE WHEN row.book4 = "1" THEN [1] ELSE [] END |
    MERGE (book:Book{sequence:4})
    MERGE (p)-[:APPEARED_IN]->(book)
)
FOREACH
  (ignoreMe IN CASE WHEN row.book5 = "1" THEN [1] ELSE [] END |
    MERGE (book:Book{sequence:5})
    MERGE (p)-[:APPEARED_IN]->(book)
)

Multi-labeled nodes:

Nodes can have none, one or more labels. Think of labels as a hiearchy of rules, where first level labels group entities together (:Person),(:Company),(:House) and additional, second level labels as a preprocessed way of filtering nodes faster/easier (:Person:Dead),(:Person:King),(:Company:Reseller) etc…

Check also AirBnB talk at graphconnect.

Lets mark all the dead people in our database first, since this is GoT database. Next we will mark all Kings and Knights with additional labels.

Dead persons:

Query 1:

LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/character-predictions.csv" as row
// do CASE statements
with row,
case when row.isAlive = "0" THEN [1] ELSE [] END as dead_person,
case when row.isAliveMother = "0" THEN [1] ELSE [] END as dead_mother,
case when row.isAliveFather = "0" THEN [1] ELSE [] END as dead_father,
case when row.isAliveHeir = "0" THEN [1] ELSE [] END as dead_heir,
case when row.isAliveSpouse = "0" THEN [1] ELSE [] END as dead_spouse
// MATCH all the persons
MATCH (p:Person{name:row.name})
// We use optional match so that it doesnt stop the query if not found
OPTIONAL MATCH (mother:Person{name:row.mother})
OPTIONAL MATCH (father:Person{name:row.father})
OPTIONAL MATCH (heir:Person{name:row.heir})
OPTIONAL MATCH (spouse:Spouse{name:row.spouse})
// Set the label of the dead persons
FOREACH (d in dead_person | set p:Dead)
FOREACH (d in dead_mother | set mother:Dead)
FOREACH (d in dead_father | set father:Dead)
FOREACH (d in dead_heir | set heir:Dead)
FOREACH (d in dead_spouse | set spouse:Dead)

Query 2:

MATCH (p:Person) where exists (p.death_year)
SET p:Dead

Kings:

Query 1:

MATCH (p:Person)-[:DEFENDER_KING|ATTACKER_KING]-()
SET p:King

Query 2:

MATCH (p:Person) where lower(p.title) contains "king"
SET p:King

Knight:

Query 1:

MATCH (p:Person) where p.title = "Ser"
SET p:Knight

Refactor:

After importing you can check the cleanliness of your data how well the Persons in your database match or are there many typos etc…with ordering the return by alphabet, so you can quickly spot mistakes if they exist.

Example query:

MATCH (n:Person)
RETURN n.name as name order by name

I cleaned up the culture row in character_predictions.csv, but left one mistake, so that I can show you how to clean up node duplication caused by a typo in the data.

Screen Shot 2017-06-25 at 22.18.53.png

You can use apoc plugin, to help you with refactoring, so that you can merge the two nodes into a single while retaining all the connections from both nodes.

match (c1:Culture{name:"wildlings"}),(c2:Culture{name:"wildlingss"}) 
call apoc.refactor.mergeNodes([c2,c1]) yield node 
return distinct "done"

Conclusion:

I hope you got a feeling how it looks like when importing different CSVs. If you manage to understand most of what I used in this series, you are probably capable of importing almost everything into Neo4j. We are done with importing data, so next time we can start analysing our GoT dataset.

You can grab the Neo4j database and just copy it into /data/databases folder.

Got for Neo4j 3.2.x

Got for Neo4j 3.1.x

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s