Finding alternative routes in California road network with Neo4j

The focus of this blog post is to introduce Yen’s k-shortest path algorithm that was recently added to Neo4j graph algorithms. I will use Californian road network dataset made available by Feifei Li.

Next we will enrich our graph using Google’s reverse geocoding API and then proceed to find the shortest path with Dijkstra algorithm and the second shortest path using Yen’s k-shortest algorithm.

Graph schema

Our graph has a simple schema of nodes labeled Intersection connected with relationship CONNECTION to other intersections.

cas.png

Import

Lets first define the constraint in our graph schema.

CREATE CONSTRAINT ON (i:Intersection) ASSERT i.id IS UNIQUE;

Dataset is split into nodes and relationship files. Lets import them both to get all the data we need.

Import nodes

USING PERIODIC COMMIT
LOAD CSV FROM
"https://www.cs.utah.edu/~lifeifei/research/tpq/cal.cnode" 
as row fieldterminator " "
MERGE (i:Intersection{id:row[0]})
ON CREATE SET i.longitude = toFloat(row[1]),
              i.latitude = toFloat(row[2])

Import relationships

USING PERIODIC COMMIT
LOAD CSV FROM
"https://www.cs.utah.edu/~lifeifei/research/tpq/cal.cedge" 
as row fieldterminator " "
MATCH (start:Intersection{id:row[1]})
MATCH (end:Intersection{id:row[2]})
MERGE (start)-[c:CONNECTION{id:row[0]}]->(end)
ON CREATE SET c.length = toFloat(row[3])

Reverse geocode API

For every intersection in our graph we can get the address based on the GPS location with the help of Google’s reverse geocoding API . I used apoc.util.sleep(50) to throttle and wait 50 ms between each API call. It cost me around €7 to get this data as I couldn’t find a free version of the API :/.

MATCH (i:Intersection)
CALL apoc.util.sleep(50)
WITH "https://maps.googleapis.com/maps/api/geocode/json?latlng=" + 
toString(i.latitude) + "," + toString(i.longitude) + "&key={google_api_key}" as json,i
CALL apoc.load.json(json) yield value
SET i.name = value.results[0].formatted_address

Analysis

Lets start with visualizing Santa Barbara’s part of the road network with neovis.js.

 

santa_barbara.png

Neovis configuration
var config = {
   container_id: "viz",
   server_url: "bolt://localhost:7687",
   server_user: "neo4j",
   server_password: "neo4j",
   labels: {
     "Intersection": {
      "caption": "title"
      }
    },
   relationships: {
     "CONNECTION": {
      "caption": false
      }
    },
   initial_cypher: "MATCH p=(i1:Intersection)-[:CONNECTION]->(i2:Intersection)" +
     "WHERE i1.name contains 'Santa Barbara' AND i2.name contains 'Santa Barbara'" +
     "RETURN p limit 500"
 };

Shortest path

We use algo.shortestPath, that uses Dijkstra algorithm,  to find the shortest path between “Cathedral Peak Trail” and “5750 Stagecoach Rd”. We set direction:BOTH to treat our graph as undirected.

MATCH (start:Intersection{title:"Cathedral Peak Trail"}),
      (end:Intersection{title:"5750 Stagecoach Rd"})
CALL algo.shortestPath.stream(start,end,'length',{direction:'BOTH'}) YIELD nodeId,cost
MATCH (n) where id(n)= nodeId
RETURN n.title,cost

Visualization made with neovis.js.

shortest_path.png

Neovis configuration
var config = {
   container_id: "viz",
   server_url: "bolt://localhost:7687",
   server_user: "neo4j",
   server_password: "neo4j",
   labels: {
     "Intersection": {
       "caption": "title"
     }
   },
   relationships: {
     "CONNECTION": {
       "thickness": "length",
       "caption": false
     }
   },
   initial_cypher: "MATCH (start:Intersection{title:'Cathedral Peak Trail'}),(end:Intersection{title:'5750 Stagecoach Rd'}) " +
     "CALL algo.shortestPath.stream(start,end,'length',{direction:'BOTH',nodeQuery:'Intersection',relationshipQuery:'CONNECTION'}) YIELD nodeId,cost " +
     "MATCH (n) where id(n)=nodeId " + 
     "WITH collect(n) as nodes " +
     "UNWIND range(0, length(nodes)-2) as index " +
     "WITH nodes[index] as from, nodes[index + 1] as to " + 
     "MATCH p=(from)-[:CONNECTION]-(to) " +
     "RETURN p"
 };

Yen’s k-shortest paths

Yen’s k-shortest paths algorithm computes single-source K-shortest loopless paths for a graph with non-negative relationship weights.

It uses Dijkstra algorithm to find the shortest path and then proceeds to find k-1 deviations of the shortest paths. If we want to find the second shortest path we use k=2 as shown below.

MATCH (start:Intersection{title:"Cathedral Peak Trail"}),
(end:Intersection{title:"5750 Stagecoach Rd"})
CALL algo.kShortestPaths(start, end, 2, 'length' ,{})
YIELD resultCount
RETURN resultCount

Results are stored as paths in our graph.

MATCH p=()-[:PATH_0|:PATH_1]->()
RETURN p

Shortest path is visualized in red and second shortest path in yellow. We can easily observe that the paths diverge right at the start and join together 2 hops before the end.

yens-k2.png

Conclusion

With the addition of Yen’s k-shortest algorithm to the Neo4j graph algorithms library we can now search for alternative shortest routes in our graph. This can come in handy in various domains.

Advertisements

Neo4j to Mapbox

Introduction:

This is the part three of my Hospitals Series with Neo4j. If you have not seen the first two, you should check them out to see how we got to here.

Part 1 : Neo4j Location Trees
Part 2: Cypher Location queries

Basically what I have tried to do is to take these two maps

and combine both their view into one visualization as a cool project. Let me show you how you can easily create map visualizations using Neo4j and Mapbox/Leaflet JS. One nice use case can be a map with potencial customers to help you find dense areas etc…

Screen Shot 2017-04-08 at 14.07.21
Data:

I got the data from the National Map website. They have quite a collection of maps for us to examine from govermental boundaries to railroad maps. We just need to transform the Shapefile to WKT and we are good to go. I found out GDAL library does that very simply using the command.

$ ogr2ogr -f CSV polygon_state.csv cb_2015_us_state_20m.shp -nlt POLYGON -lco GEOMETRY=AS_WKT

Note that ideally you would want to use multipolygons, because as I later found out, some states have islands, like Alaska, and if we save them as a polygon, map visualization does not work properly.

Now we have to save our WKT data in Neo4j and set them as properties of states.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/hospitals-neo4j/master/state_polygon.csv" as row
MATCH (s:State) WHERE s.name = row.STUSPS
SET s.state = row.NAME,s.wkt = row.WKT

MapBox:

I have stole most of the code from Legis and Panama map project listed above. Also as you will see I am not experienced with JS, and I welcome anybody that would want to upgrade this project’s JS 🙂

Polygons:

We will need to retrieve WKT polygons and some meta-data for states using. We filter out Alaska intentionally because of its islands, which ruin our vizualization.

var districtQuery =
"MATCH (state:State)<-[:IS_IN*5..5]-(hospital)  WHERE exists (state.wkt) AND not state.name = 'AK'" +  "OPTIONAL MATCH (hospital)-[:HAS_RATING]->(r)" + 
"RETURN state.wkt as wkt,{state:state.state,count : count(hospital),
rating:avg(toINT(r.name))} as text"

Next we need to convert Polygon WKT string to an array of arrays of [x,y] points.

function parseWKTPolygons(wkt) {
   var pointArr = [];
   var points = wkt.slice(10,-3).split(",")
   for(var i = 0; i < points.length;i++){
       var xy =points[i].split(" ")
       if (!(isNaN(xy[0]) || isNaN(xy[1]))){
          pointArr.push({lat:parseFloat(xy[1]),lng:parseFloat(xy[0])})
          }
        }
    return pointArr;
}

Now finally we can draw them as a polygon on the map and bind a popup to them.

function drawPolygons(polygons, text) {
    var bounds = null;
    popuptext = buildStatePopup(text)
    var popup = L.popup({keepInView: true, minWidth: 350, maxWidth: 1000});
    popup.setContent(popuptext)
    var polyline = L.polygon(polygons).addTo(map);
    polyline.on('click', function(e){
           popup.setLatLng(e.latlng).openOn(map)
           this.openPopup();
          })
}

We have now created a map of simple polygons with clickable popup, like shown below in the picture.

Screen Shot 2017-04-08 at 17.07.51

Markers:

Now lets prepare a function that draws markers on our map for the ability to drill down to single hospital. We will define our custom icon and also bind a popup on click.

function addAddress(lat,lon,data) {

var hospitalIcon = L.icon({
iconUrl: 'icons/Hospital.png',

     iconSize: [38, 95], // size of the icon
     shadowSize: [50, 64], // size of the shadow
     iconAnchor: [22, 94], // point of the icon which will correspond to marker's location
     shadowAnchor: [4, 62], // the same for the shadow
     popupAnchor: [-3, -76] // point from which the popup should open relative to the iconAnchor
});

var marker = L.marker([lat,lon],{icon:hospitalIcon}).addTo(map);
popuptext = buildAddressPopup(data);

var popup = L.popup({keepInView: true, minWidth: 350, maxWidth: 1000});
popup.setLatLng({lat:lat,lng:lon})
.setContent(popuptext)
marker.bindPopup(popup);
marker.on('click', function(e){
      this.openPopup();
    })
}

If we zoom in we can drill down on each hospital and click on it to get a popup. This is a very basic visualization as MapBox supports all sorts of cool visualizations. There are all sorts of improvement possible, I will try to update it, and I ask you to join with your ideas/code.
Screen Shot 2017-04-08 at 17.27.16.png

Now all that is left is for us to set the rules of interacting with the map.

 L.mapbox.accessToken = MB_API_TOKEN;
    var map = L.mapbox.map('map', 'mapbox.streets')
    // run getStates when map loads
    map.on('load',function(){
        getStates();
   })
   // set starting view
   map.setView([39.8282, -98.5795], 4); 
   // disable zoom on doubleClick
   map.doubleClickZoom.disable(); 
   // run getAddresses and draw all hospitals within 100km on the map
   // on double click
   map.on('dblclick', function(e) {
        clearMap(map);
        getAddresses(e.latlng, 100)
    });

Conclusion:

I have found out that without much JS skill, you are able to create all sorts of custom map vizualizations, which is pretty cool. Since you can visualize mostly whatever is stored in your Neo4j, options are practically unlimited.

If you made it this far, thanks for reading through, and if you liked it, please share it with the community.

p.s.

Cypher Location queries

This is second part of my Hospital project. If you missed the first one, be sure to check it out, as we will continue, where we left of last time. Today I will show you how to get GPS informations with apoc.spatial and run some cool queries.

You can follow the guide from Neo4j Browser or Sandbox using

:play http://guides.neo4j.com/contrib/hospital.html

GPS information:

You can easily get GPS latitude and latitude using apoc.spatial function, which by default uses Open Street Map API and has 5 seconds throttle time. I personally prefer to use Google Geocoding API, which you can easily configure following the documentation. Not only does it return better results, but you can also query by company name etc… It also allows smaller throttle time like 100 ms or less. The only downside is a daily limit of 2500 requests. Example of spatial request.

MATCH (h:Hospital)-[:IS_IN*..3]->(location)
with h,substring(reduce(s="", name in collect(location.name) | s + "," + name),1) as geoinfo
CALL apoc.spatial.geocodeOnce(geoinfo) YIELD location
set h += location

We have 4807 hospitals and daily limit of 2500 request, so I saved GPS locations to csv file, so that you can easily follow.

LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/tomasonjo/hospitals-neo4j/master/gpsinfo.csv" as row
MATCH (h:Hospital{id:row.id})
SET h.longitude = toFloat(row.longitude),h.latitude=toFloat(row.latitude)

Spatial Query:

I must admit I stole the idea for this query from Michael Hunger and his JS version of MapBox map vizualizations with Neo4j. You can check out the git repo . I have a plan of doing a custom MapBox vizualization of hospitals too.

Lets say we we’re sightseeing on Liberty Island and hurt our knee. Now we want to find the nearest 10 hospitals to our position.

with "1 Liberty Island, New York" as myLocation
call apoc.spatial.geocodeOnce(myLocation) yield location
WITH point({longitude: location.longitude, latitude: location.latitude}) as myPosition,10 as distanceInKm
MATCH (h:Hospital)-->(rating:Rating) where exists(h.latitude) and
distance(myPosition, point({longitude:h.longitude,latitude:h.latitude})) < (distanceInKm * 1000)
RETURN h.name,rating.name as rating,distance(myPosition, point({longitude:h.longitude,latitude:h.latitude})) as distance order by distance limit 10

Unfortunately this query will not work in Sandbox

The results shows us, that there are a couple of hospitals within 10 km, but unfortunately only have a score of 2 out of 5 tops.

libertyisland

Other Queries:

Average rating by ownership:

MATCH (r)<-[:HAS_RATING]-(h:Hospital)-[:HAS_OWNERSHIP]->(o)
return o.name as ownership,avg(toINT(r.name)) as averageRating order by averageRating desc limit 15

Number of hospitals per city:

MATCH (h:Hospital)-[:IS_IN*3..3]->(city)
return city.name as city,count(h) as NumberOfHospitals order by NumberOfHospitals desc limit 15

Top 10 states by average rating:

MATCH (r)<-[:HAS_RATING]-(h:Hospital)-[:IS_IN*5..5]->(state)
where not r.name="Not Available"
return state.name as state,avg(toINT(r.name)) as averageRating,count(h) as numberOfHospitals order by averageRating desc limit 15

Which states has the most “above the average” hospitals in effectivness ?

MATCH (h:Hospital)-[:IS_IN*5..5]->(state) where h.effectiveness = "Above the National average"
return state.name as state,h.effectiveness,count(h) as numberOfHospitals order by numberOfHospitals desc limit 15

Which states has the most “below the average” hospitals in mortality ?

MATCH (h:Hospital)-[:IS_IN*5..5]->(state) where h.mortality = "Below the National average"
return state.name as state,h.mortality,count(h) as numberOfHospitals order by numberOfHospitals desc limit 15

P.s:

  • you can try querying this dataset in my Neo4j Sandbox example. Username: Hospital, Password: hospital
  • or you can create your own Neo4j Sandbox
  • I also created a guide, that you can find and host as shown in my git repo.

Neo4j to Gephi

I was inspired by this cool visualization from the Network of thrones analysis to try and recreate it. Thanks to Michael Hunger and William Lyon, I achieved it using Neo4j and Gephi together with Apoc library. Note that I am not a graph theory expert, so i will only focus on visualization aspect.

Requirements:

Download the latest APOC release.

Dataset:

We will use the standard movies dataset, which you can get with :play movies Cypher command. First we will create a social network by creating weighted relationships between persons. The assumption is that the more movies people worked at together the more they know each other. We do this by finding common movies and adding a point for each occasion with the following query.


MATCH (p1:Person)-->(:Movie)<--(p2:Person) where id(p1) < id(p2)
MERGE (p1)-[r:KNOWS]-(p2)
ON CREATE SET r.weight = 1
ON MATCH SET r.weight = r.weight + 1

Setup:

After installing Gephi we need to install Graph streaming plugin, which can also be easily installed using Tools --> Plugins --> Available Plugins tab in Gephi. Start a new project and turn on the streaming server as shown below.

startgephi

Export to Gephi:

We use a custom Apoc procedure, which works like this:

apoc.gephi.add(ip,'workspace1',path,'weightproperty') where ‘weightproperty’ is a property of the relationship that holds the weight value.  I named it weight before in cypher statement, but it can be any key of the relationship you want. If ip is set to null it will use default localhost . Our specific cypher query will look like this:

MATCH path = (:Person)-[:KNOWS]->(:Person)
CALL apoc.gephi.add(null,'workspace1',path,'weight') yield nodes
return *

Visualization:

Gephi offers lots of cool options, but it can have some learning curve, because it has so many features and you can get lost when first using it. I will show you some steps how to get to do cool visualizations as seen before in the network of thrones visualizations. This is a map of features I learned to use so far.

guideGephi

Steps:

  • Choose a layout and play around with options to see what best fits your need. I chose Force Atlas 2 layout with dissuade hubs and prevent overlap settings. Picture below:
  • forceatlas
  • Second step is to run graph algorithms and get a bunch of  graph metrics such as pagerank,centralities,etc….
  • Last step is to design the visualization. Gephi offers lots of custom details in how you design your graph visualizations and allows manualy dragging nodes aswell. I recommend this article to get some basic understanding and get a feeling what Gephi is capable of.

You can set color,size,label color and label size based on the calculated metrics from graph algorithms as shown below.

appearances

After 10 minutes of playing around with all the options, I came up with this graph visualization of actors related to how many movies they worked at together.

gepho

If you are still here thank you for reading through and please share some feedback 🙂

Neo4j Location Trees

Lately I have been trying to set a few rules for graph models, that should be applied in many use cases. Now that is quite a challenge, considering that Neo4j allows us to store the data however we feel like, leaving us with lots of decision we make along our graph journey. I found out that when dealing with hiearchical trees, such as location trees or time trees, one can be define a set of rules in order for us to easily and fast run the queries we want.

Requirements:

Guide:

You can follow the guide from Neo4j Browser or Sandbox using

:play http://guides.neo4j.com/contrib/hospital.html

Data:

Lets grab some public data, that will be useful for the example. Download the csv flat files at  Medicare.gov . We will use the Hospital General Information.csv in our case. because it holds the location information of  hospitals in USA and also some meta-date about their mortality/safety/experience level compared to national average. Copy it to neo4j/import folder, so we can access it easily with cypher from Neo4j Browser.

Graph model:

hospitalmeta

As you can see I created a big hiearchical location tree. You can notice that the relationship within the tree are of the same type, which will allow for optimized queries.We can easily traverse up the location levels.

Example:

MATCH (h:Hospital)-[:IS_IN*4..4]->(city)
return city,count(h) as numberOfHospitals order by numberOfHospitals desc

Importing:

In most tutorial you will see the standard procedure, where you merge all nodes separately and then merge relationships on them.

Standard approach:

MERGE (state:State{name:row.State})
MERGE (county:County{name:row.`County Name`})
MERGE (city:City{name:row.City})
MERGE (zip:ZipCode{name:row.`ZIP Code`})
MERGE (address:Address{name:row.Address})
MERGE (state)<-[:IS_IN]-(county)
MERGE (county)<-[:IS_IN]-(city)
MERGE (city)<-[:IS_IN]-(zip)
MERGE (zip)<-[:IS_IN]-(address)

We will run into some problems as some cities,addresses or hospitals share the same name, which in turn ruins our location tree structure. And so our results are corrupted because of these anomalies, where you have an address in 7 different zip codes and 7 hospitals on that address. That does not reflect the truth, because obviously each zip code should have its own 100 HOSPITAL DRIVE address and then there will be only one hospital per address, which is also what is in reality.

hospital

Lets define some basic rules in order for our graph to model reality and return correct results.

We like to have the relationship directed upwards the hiearchical levels, because usually we store context of our graph in the lowest location level(more detailed location info). So our queries will start from the bottom, in our case Hospital and go upwards as many levels as desired. The second rule is that each hospital has only one address, and for that reason an address cannot be in more zip codes and/or cities. To put it more technically.

Location trees rules:

  • All relationships are directed from children to parents, going up the hiearchy.
  • We have a single type for all relationships. (PARENT;FROM;IS_IN)
  • Every node has a single outgoing relationship to it’s parent.
  • Every node can have one or multiple incoming relationships from its children.

Now we can upgrade our query to import our graph by those rules. Notice that we do not merge every node separately, but we start from the level where the entities name like country name is still an unique identifier and then merge children entities by pattern to the parents.

MERGE (state:State{name:row.State})
MERGE (state)<-[:IS_IN]-(county:County{name:row.`County Name`})
MERGE (county)<-[:IS_IN]-(city:City{name:row.City})
MERGE (city)<-[:IS_IN]-(zip:ZipCode{name:row.`ZIP Code`})
MERGE (zip)<-[:IS_IN]-(address:Address{name:row.Address})

For Hospital General Information.csv we start from the state and then merge all children by pattern because some share counties/cities/zipcodes/addresses share the same name

LOAD CSV WITH HEADERS FROM "file:///Hospital%20General%20Information.csv" as row
// state name is unique
MERGE (state:State{name:row.State})
// merge by pattern with their parents
MERGE (state)<-[:IS_IN]-(county:County{name:row.`County Name`})
MERGE (county)<-[:IS_IN]-(city:City{name:row.City})
MERGE (city)<-[:IS_IN]-(zip:ZipCode{name:row.`ZIP Code`})
MERGE (zip)<-[:IS_IN]-(address:Address{name:row.Address})
// for entities it is best to have an id system
MERGE (h:Hospital{id:row.`Provider ID`})
ON CREATE SET h.phone=row.`Phone Number`,
h.emergency_services = row.`Emergency Services`, h.name= row.`Hospital Name`, h.mortality = row.`Mortality national comparison`, h.safety = row.`Safety of care national comparison`,h.timeliness = row.`Timeliness of care national comparison`,h.experience = row.`Patient experience national comparison`,h.effectiveness = row.`Effectiveness of care national comparison`
MERGE (h)-[:IS_IN]->(address)
//Some metadata about hospitals
MERGE (type:HospitalType{name:row.`Hospital Type`})
MERGE (h)-[:HAS_TYPE]->(type)
MERGE (ownership:Ownership{name: row.`Hospital Ownership`})
MERGE (h)-[:HAS_OWNERSHIP]->(ownership)
MERGE (rating:Rating{name:row.`Hospital overall rating`})
MERGE (h)-[:HAS_RATING]->(rating)

After importing we can check if our rules are being met for creating a location tree.

Check if any :Address have more than one relationship going upwards the hiearchy

match (a:Address)
with a where size((a)-[:IS_IN]->()) > 1
return a

We can also check the length of all the paths in location tree.

MATCH path=(h:Hospital)-[:IS_IN*..10]->(location) where not (location)-[:IS_IN]->()
return distinct(length(path)) as length,count(*) as count

Every hospital should have exactly one location path, so the results must be one path per hospital with all having the same length. In our case we have 4807 hospitals.

check

If we check how we imported the 100 HOSPITAL DRIVE we can see that now there are several addresses all within each own zip code.
hospital 3

Thanks for taking your time and reading through. I have a plan to turn this into a few parts blog post, so stay tuned for more. Leave some feedback with your thoughts about these rules.