Neo4j Location Trees

Lately I have been trying to set a few rules for graph models, that should be applied in many use cases. Now that is quite a challenge, considering that Neo4j allows us to store the data however we feel like, leaving us with lots of decision we make along our graph journey. I found out that when dealing with hiearchical trees, such as location trees or time trees, one can be define a set of rules in order for us to easily and fast run the queries we want.

Requirements:

Guide:

You can follow the guide from Neo4j Browser or Sandbox using

:play http://guides.neo4j.com/contrib/hospital.html

Data:

Lets grab some public data, that will be useful for the example. Download the csv flat files at  Medicare.gov . We will use the Hospital General Information.csv in our case. because it holds the location information of  hospitals in USA and also some meta-date about their mortality/safety/experience level compared to national average. Copy it to neo4j/import folder, so we can access it easily with cypher from Neo4j Browser.

Graph model:

hospitalmeta

As you can see I created a big hiearchical location tree. You can notice that the relationship within the tree are of the same type, which will allow for optimized queries.We can easily traverse up the location levels.

Example:

MATCH (h:Hospital)-[:IS_IN*4..4]->(city)
return city,count(h) as numberOfHospitals order by numberOfHospitals desc

Importing:

In most tutorial you will see the standard procedure, where you merge all nodes separately and then merge relationships on them.

Standard approach:

MERGE (state:State{name:row.State})
MERGE (county:County{name:row.`County Name`})
MERGE (city:City{name:row.City})
MERGE (zip:ZipCode{name:row.`ZIP Code`})
MERGE (address:Address{name:row.Address})
MERGE (state)<-[:IS_IN]-(county)
MERGE (county)<-[:IS_IN]-(city)
MERGE (city)<-[:IS_IN]-(zip)
MERGE (zip)<-[:IS_IN]-(address)

We will run into some problems as some cities,addresses or hospitals share the same name, which in turn ruins our location tree structure. And so our results are corrupted because of these anomalies, where you have an address in 7 different zip codes and 7 hospitals on that address. That does not reflect the truth, because obviously each zip code should have its own 100 HOSPITAL DRIVE address and then there will be only one hospital per address, which is also what is in reality.

hospital

Lets define some basic rules in order for our graph to model reality and return correct results.

We like to have the relationship directed upwards the hiearchical levels, because usually we store context of our graph in the lowest location level(more detailed location info). So our queries will start from the bottom, in our case Hospital and go upwards as many levels as desired. The second rule is that each hospital has only one address, and for that reason an address cannot be in more zip codes and/or cities. To put it more technically.

Location trees rules:

  • All relationships are directed from children to parents, going up the hiearchy.
  • We have a single type for all relationships. (PARENT;FROM;IS_IN)
  • Every node has a single outgoing relationship to it’s parent.
  • Every node can have one or multiple incoming relationships from its children.

Now we can upgrade our query to import our graph by those rules. Notice that we do not merge every node separately, but we start from the level where the entities name like country name is still an unique identifier and then merge children entities by pattern to the parents.

MERGE (state:State{name:row.State})
MERGE (state)<-[:IS_IN]-(county:County{name:row.`County Name`})
MERGE (county)<-[:IS_IN]-(city:City{name:row.City})
MERGE (city)<-[:IS_IN]-(zip:ZipCode{name:row.`ZIP Code`})
MERGE (zip)<-[:IS_IN]-(address:Address{name:row.Address})

For Hospital General Information.csv we start from the state and then merge all children by pattern because some share counties/cities/zipcodes/addresses share the same name

LOAD CSV WITH HEADERS FROM "file:///Hospital%20General%20Information.csv" as row
// state name is unique
MERGE (state:State{name:row.State})
// merge by pattern with their parents
MERGE (state)<-[:IS_IN]-(county:County{name:row.`County Name`})
MERGE (county)<-[:IS_IN]-(city:City{name:row.City})
MERGE (city)<-[:IS_IN]-(zip:ZipCode{name:row.`ZIP Code`})
MERGE (zip)<-[:IS_IN]-(address:Address{name:row.Address})
// for entities it is best to have an id system
MERGE (h:Hospital{id:row.`Provider ID`})
ON CREATE SET h.phone=row.`Phone Number`,
h.emergency_services = row.`Emergency Services`, h.name= row.`Hospital Name`, h.mortality = row.`Mortality national comparison`, h.safety = row.`Safety of care national comparison`,h.timeliness = row.`Timeliness of care national comparison`,h.experience = row.`Patient experience national comparison`,h.effectiveness = row.`Effectiveness of care national comparison`
MERGE (h)-[:IS_IN]->(address)
//Some metadata about hospitals
MERGE (type:HospitalType{name:row.`Hospital Type`})
MERGE (h)-[:HAS_TYPE]->(type)
MERGE (ownership:Ownership{name: row.`Hospital Ownership`})
MERGE (h)-[:HAS_OWNERSHIP]->(ownership)
MERGE (rating:Rating{name:row.`Hospital overall rating`})
MERGE (h)-[:HAS_RATING]->(rating)

After importing we can check if our rules are being met for creating a location tree.

Check if any :Address have more than one relationship going upwards the hiearchy

match (a:Address)
with a where size((a)-[:IS_IN]->()) > 1
return a

We can also check the length of all the paths in location tree.

MATCH path=(h:Hospital)-[:IS_IN*..10]->(location) where not (location)-[:IS_IN]->()
return distinct(length(path)) as length,count(*) as count

Every hospital should have exactly one location path, so the results must be one path per hospital with all having the same length. In our case we have 4807 hospitals.

check

If we check how we imported the 100 HOSPITAL DRIVE we can see that now there are several addresses all within each own zip code.
hospital 3

Thanks for taking your time and reading through. I have a plan to turn this into a few parts blog post, so stay tuned for more. Leave some feedback with your thoughts about these rules.

Advertisements

3 thoughts on “Neo4j Location Trees

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s