Paradise papers analysis with Neo4j

I haven’t used a real-world dataset yet in my blog so I decided it’s time to try out some real-world networks. I will be using the Paradise papers dataset that we can explore thanks to ICIJ.

Paradise papers are a leak of millions of documents containing information about companies and trusts in offshore tax havens that have revealed information on tens of thousands of companies and people, including some high-profile figures like the Queen of England.

We can get this dataset and any of the other leaks at ICIJ official site in CSV or Neo4j Desktop form. If you are lazy like me you can just use Neo4j online sandbox, where you get your own Neo4j with APOC and graph algorithms library setup within a matter of seconds.

Graph model:

We have a graph of officers, who are persons or a banks in real world and entities that are companies. Entities can also have intermediaries and they all have one or more registered addresses that we know of.

I will focus today more on algorithms use, but if you want to learn more about the dataset itself and Neo4j you should check out analysing paradise papers and an in depth analysis of paradise papers by Michael Hunger and William Lyon.

jaha

Michael Hunger & William Lyon,An In-Depth Graph Analysis of the Paradise Papers, https://neo4j.com/blog/depth-graph-analysis-paradise-papers/

 

Infer the network

As I mentioned before we will focus more on graph algorithms use cases for Paradise papers.

We will assume that officers who are related to the same entity might know each other or at least have some contacts with one another. With this assumption we will create a social network of officers who are related to same entities and have a registered address in Switzerland only. I filtered Switzerland only so we might get a better understanding of their local investment network.

MATCH (o1:Officer)-[:OFFICER_OF]->()<-[:OFFICER_OF]-(o2:Officer)
WHERE id(o1) > id(o2) AND o1.countries contains "Switzerland" 
AND o2.countries contains "Switzerland"
WITH o1,o2,count(*) as common_investments
MERGE (o1)-[c:COMMON_INVESTMENTS]-(o2)
ON CREATE SET c.weight = common_investments

 

Analysis

We start by analyzing the degree of nodes in our network. There are 1130 officer with registered address in Switzerland. Each officer has on average 6 contacts to other officers through his entities.

MATCH (o:Officer)
WHERE (o)-[:COMMON_INVESTMENTS]-()
WITH o,size((o)-[:COMMON_INVESTMENTS]-()) as degree
RETURN count(o) as officers,
       avg(degree) as average_degree,
       stdev(degree) as stdev_degree,
       max(degree) as max_degree,
       min(degree) as min_degree

paradise_sat.png

We can search for pairs of officers with the most common investments as we stored this value as a property of relationship.

MATCH (o1:Officer)-[w:COMMON_INVESTMENTS]->(o2)
RETURN o1.name as officer1,
                 o2.name as officer2,
                 w.weight as common_investments 
order by common_investments desc limit 10

Barnett – Kevin Alan David seems to be very intertwined with the Mackies as he has got 322 common investments with Thomas Louis and 233 with Jacqueline Anne. Actually eight out of first ten places belong to Barnett Kevin Alan David, Hartland Georgina Louise and the Mackies. This would indicate that they cooperate on a large scale.

paradise_a.png

Weakly connected components

With weakly connected components algorithm we search for so called “islands” in our graph. An island or a connected component is a connected graph where all nodes are reachable between each other and any disconnected part of the global graph is it’s own component.

In our scenario it would be an useful algorithm to search for people who have common investments in companies and might know each other or maybe just have easier access to communicate with.

CALL algo.unionFind.stream(
    'MATCH (o:Officer) WHERE (o)-[:COMMON_INVESTMENTS]-()
    RETURN id(o) as id
   ','
    MATCH (o1:Officer)-[:COMMON_INVESTMENTS]-(o2)
    RETURN id(o1) as source,id(o2) as target',
    {graph:'cypher'})
YIELD nodeId,setId
RETURN setId as component,count(*) as componentSize
ORDER BY componentSize desc limit 10

As with most real-world graphs I have encountered so far we get one larger component and some smaller ones. If we wanted we could dig deeper into smaller components and check out their members and see if something interesting comes up.

paradise_component

Lets visualize component 14 for example.

paradise_compt.png

Studhalter – Alexander Walter seem to be quite interlaced with Gadzhiev – Nariman as they have 60 common investments. To complete the triangle there is Studhalter – Philipp Rudolf with 15 common investments with Alexander Walter and 12 with Nariman.  Alexander Walter is positioned at the center of this graph with connection to 8 different officers and we could assume that he holds some influence over this network.

Pagerank

PageRank was first used to measure importance of websites to help users find better results when searching the internet. In the domain of websites and links each link is treated as a vote from one website to another indicating that there is some quality content over there. When calculating pageRank it is also taken into account how important is the voter website as a link from amazon.com means something completely different as a link from tbgraph.wordpress.com.

In the Paradise papers domain we can use it to find potential influencer in our inferred “common_investments” network as officer who have common investments with other important officers in the network will come on top.

CALL algo.pageRank.stream(
    'MATCH (o:Officer) WHERE (o)-[:COMMON_INVESTMENTS]-()
     RETURN id(o) as id
    ','
     MATCH (o1:Officer)-[:COMMON_INVESTMENTS]-(o2)
     RETURN id(o1) as source,id(o2) as target',
    {graph:'cypher'})
YIELD node,score
WITH node,score order by score desc limit 10
RETURN node.name as officer,score

Cabral – Warren Wilton comes out on top by a large margin. I checked him out and it turns out he is an officer of 430 different entities and he has got connection to 116 other officers  from Switzerland through his entities. Find out more about Cabral – Warren Wilton. Following is the Swiss Reinsurance Company, which is a shareholder of 19 different entities. You can get same detailed look as above for Swiss Reinsurance thanks to ICIJ.

paradise_pagerank

Harmonic closeness centrality

We can interpret closeness centrality as the potential ability of a node to reach all other nodes as quickly as possible. This works both ways in our example as also other nodes can reach a specific node quickly through shortest paths between them. Harmonic centrality is a variation of closeness centrality that deals nicely with disconnected graphs.

In our domain we could interpret it as the potential ability for “insider trading” as having quick access to other nodes in the network could potentially lead to an advantage such as having access to (confidential) information before others.

CALL algo.closeness.harmonic.stream(
    'MATCH (o:Officer) WHERE (o)-[:COMMON_INVESTMENTS]-()
     RETURN id(o) as id
    ','
     MATCH (o1:Officer)-[:COMMON_INVESTMENTS]-(o2)
     RETURN id(o1) as source,id(o2) as target',
    {graph:'cypher'})
YIELD nodeId,centrality
WITH nodeId,centrality order by centrality desc limit 10
MATCH (n) where id(n)=nodeId
RETURN n.name as officer,centrality

Cabral – Warren Wilton also leads by harmonic centrality. He seems to be a big player in Switzerland. Swiss Reinsurance Company and PricewaterhouseCoopers are the only two that were also in pagerank top 10 leaderboard. All the others are new candidates we haven’t seen before.  We can deep dive on Schröder – Stefan and observe that he has connections in SwissRe.

paradise_harmonic.png

Betweenness centrality

Betweenness centrality is useful in finding nodes that serve as a bridge from one group of users to another in a graph. Betweenness centrality in a social network can be interpreted as a rudimentary measure of the control that a specific node exerts over the information flow throughout the graph.

CALL algo.betweenness.stream(
    'MATCH (o:Officer) WHERE (o)-[:COMMON_INVESTMENTS]-()
     RETURN id(o) as id
    ','
     MATCH (o1:Officer)-[:COMMON_INVESTMENTS]-(o2)
     RETURN id(o1) as source,id(o2) as target',
    {graph:'cypher'})
YIELD nodeId,centrality
WITH nodeId,centrality order by centrality desc limit 10
MATCH (n) where id(n)=nodeId
RETURN n.name as officer,centrality

The usual players are in the top 3 spots. We can also spot Schröder – Stefan in the fifth spot and the other officers we haven’t come across yet. It’s interesting to see Zulauf – Hans-Kaspar up there as he is an officer of only two entities, but looks like his network position makes him so interesting.

paradise_betweenness

Label propagation algorithm

Label propagation algorithm is a community detection algorithm. Algorithm divides the network into communities of nodes with dense connections internally and sparses connections between communities.

CALL algo.labelPropagation(
    'MATCH (o:Officer) WHERE (o)-[:COMMON_INVESTMENTS]-()
     RETURN id(o) as id
    ','
     MATCH (o1:Officer)-[q:COMMON_INVESTMENTS]-(o2)
     RETURN id(o1) as source,id(o2) as target',
    'OUT',
    {graph:'cypher',partitionProperty:'community',iterations:10})

To help us with analyzing communities we will use Gephi to visualize our network.

Visualize with Gephi:

I like to use Gephi for visualizing networks. It is a really cool tool that lets you draw nice network visualizations based on centrality and community values.

Check out my previous blog post Neo4j to Gephi for more information.

We would need to save centrality and label propagation results to nodes if we wanted to export them to Gephi. Assuming we have done that we can use the following query to export data from Neo4j to Gephi.

MATCH path = (o:Officer)-[:COMMON_INVESTMENTS]->(:Officer)
CALL apoc.gephi.add(null,'workspace1',path,'weight',['pagerank','labelpropagation','betweeneess']) 
yield nodes
return distinct "done"

Here we have a visualization of only the biggest component in the graph with 344 members. There are 10+ communities that we can easily identify just by looking at this picture. I used label propagation results for color of nodes, betweenness centrality results for node size and pageRank results for node title.

We can’t really see much except that Cabral – Warren Wilton is very important in our network and positioned at the center of it.

paradise

 

Lets zoom in on the center of the network to get a better understanding of the graph.

As we noticed at the start that Barnet – Kevin Alan David is deeply connected with the Mackies and Hartlands. I have also noticed there is Hartland Mackie – Thomas Alan located on the bottom left so this might answer why the Hartlands and Mackies are so deeply connected. We can also find Barnett – Emma Louise in this community, which would makes this community(red) a community of Barnetts, Hartlands, Mackies primarily.

On the bottom right we can find Schroder Stefan very near to Swiss Reinsurance Company.

 

middle.png

Conclusion:

I think that understanding of the graph and proper visualizations tool is a powerful tool in the hand of an explorer of data. With Neo4j and Gephi we are able to understand the graph and find insights even when we have little prior knowledge about the data and what exactly are we looking for in the first place.

Advertisements

One thought on “Paradise papers analysis with Neo4j

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s