Neo4j London tube system analysis

I recently came across London tube dataset, that was uploaded by Nicola Greco. I thought it would a cool example to show some algorithms from the new Neo4j graph algorithms plugin. They are really useful as they allow network analysis without using external services to run the algorithms on.

Requirements:

Graph model:

We will create a very simple graph model, with stations and connections between them as shown below. You can check the always check the schema of your graph with db.schema() procedure.

london

Lets define the schema for our graph model with a constraint and an index, so that all queries will run faster.

CREATE CONSTRAINT ON (s:Station) ASSERT s.id is unique;

CREATE INDEX ON :Station(name);

Import:

LOAD CSV can import data from local files or from the internet, which is really cool, so that we can access data on github easily without the need of any intermediate steps.

Import stations

LOAD CSV WITH HEADERS FROM
"https://raw.githubusercontent.com/nicola/tubemaps/master/datasets/london.stations.csv" as row
MERGE (s:Station{id:row.id})
ON CREATE SET s.name = row.name,
              s.latitude=row.latitude,
              s.longitude=row.longitude,
              s.zone=row.zone,
              s.total_lines=row.total_lines

Import connections between stations.
We create relationships in both directions as trains also travel in both directions. We save the time spent traveling between stations as a relationship property, so that we can use it as a weight in our algorithms.

LOAD CSV WITH HEADERS FROM
"https://raw.githubusercontent.com/nicola/tubemaps/master/datasets/london.connections.csv" as row
MATCH (s1:Station{id:row.station1})
MATCH (s2:Station{id:row.station2})
MERGE (s1)-[:CONNECTION{time:row.time,line:row.line}]->(s2)
MERGE (s1)<-[:CONNECTION{time:row.time,line:row.line}]-(s2)

Analysis:

Which stations have the most connections to other stations?

MATCH (n:Station)--(n1)
RETURN n.name as station,
       count(distinct(n1)) as connections 
order by connections desc LIMIT 15

Results:

connections

Find the fastest path based between two stations.

algo.shortestPath is a dijsktra algorithm, that returns shortest path between start and end node. For more info check the documentation.

transfers are not taken into account

MATCH (start:Station{name:"Baker Street"}),(end:Station{name:"Beckton Park"})
CALL algo.shortestPath.stream(start, end, 'time')
YIELD nodeId, cost
MATCH (s:Station) where id(s)=nodeId
RETURN s.name as station,cost as time

Results:

single_connection

Find out which stations take the longest to get to from Baker Street station.

algo.shortestPaths is a dijsktra algorithm, that return shortest paths from start node to all the other nodes in the network.

transfers are not taken into account

MATCH (n:Station {name:'Baker Street'})
CALL algo.shortestPaths.stream(n, 'time')
YIELD nodeId, distance
MATCH (s:Station) where id(s)=nodeId
RETURN s.name as station,distance as time 
ORDER BY time desc LIMIT 15

Results:

single_many

Find the longest shortest paths between stations in the network.

algo.allShortestPaths returns shortest paths between all pairs of nodes in the network

transfers are not taken into account

CALL algo.allShortestPaths.stream('time',
{nodeQuery:'Station',relationshipQuery:'CONNECTION',defaultValue:1.0})
YIELD sourceNodeId, targetNodeId, distance
WITH sourceNodeId, targetNodeId, distance 
ORDER BY distance desc LIMIT 20
// We filter out duplicates
WHERE sourceNodeId > targetNodeId
MATCH (s1:Station),(s2:Station) 
WHERE id(s1)=sourceNodeId AND id(s2)=targetNodeId
RETURN s1.name as station1,s2.name as station2,distance as time

Results:

all_paths.png

Find bridge stations in network.

We can use betweenness centrality to help us identify, which are the stations, where people will be passing by the most, based on the structure of the network.

algo.betweenness is a betweenness algorithm, that helps us find bridges in the network. For more info check documentation.

CALL algo.betweenness.exp1.stream('Station','CONNECTION')
YIELD nodeId, centrality
MATCH (n:Station) where id(n)=nodeId
RETURN n.name,centrality 
ORDER BY centrality desc
LIMIT 15

Results:

bett

Conclusion:

It is very easy to enhance Neo4j with graph algorithms. You just copy the .jar file to /plugins folder, restart Neo4j and you are good to go. They offer lots of cool algorithms, that can help you analyse your graph in Neo4j. You should try them out!

Advertisements