Neo4j Facebook ego network analysis

Graph algorithms plugin for Neo4j is getting bigger and bigger by the day, so here are some new algorithms, that have been added to the collection. I will use Facebook’s ego network of friends to show triangle count algorithms and betweenness centrality algorithms on an undirected graph in Neo4j.

Requirements:

Graph model:

We will import a friends social network of Facebook’s users. There are 4039 nodes labeled User and 88234 unweighted relationships indicating friendship between users as shown below. You can always get the schema of your graph using CALL dbms.schema().

user

We create a constraint on User labels, so that our queries will run faster.

CREATE CONSTRAINT ON (u:User) ASSERT u.id is unique

Import:

Direct link to data is here. We will model this network as undirected. There is a twist here, that Neo4j supports storing only directed relationships. That is no problem as we can still treat it as undirected in our queries and algorithms.

USING PERIODIC COMMIT 5000
LOAD CSV FROM 
"file:///facebook_combined.txt" as row fieldterminator ' '
MERGE (u:User{id:row[0]})
MERGE (u1:User{id:row[1]})
MERGE (u)-[:FRIEND]-(u1)

We are using MERGE without specifying direction of the relationship as it does not matter in an undirected graph. Using MERGE like this will check for any existing relationships in both direction and if none exists it will choose a random direction to store the new relationship.

Graph Analysis:

Find (weakly) connected components in the graph ?

I always start with checking how many connected components are there in the graph. Most graph algorithms run best on an connected component, so I usually choose the largest component to run further algorithms on. Check documentation for more.

CALL algo.unionFind.stream('User', 'FRIEND', {})
YIELD nodeId,setId
return setId as component, count(*) as size

un

As we can see there exists only one connected component in this graph. This is very good as there are no disconnected components. Graph algorithms library also support finding strongly connected components with algo.scc. In undirected graph every weakly connected component is also strongly connected so we can skip it.

 Get average number of friends and standard deviation.

MATCH (u:User)
return avg(size((u)--())) as average_degree,
       stdev(size((u)--())) as stdev

st

A user has on average 43 friends. Standard deviation is quite high so the number of friends is spread out over a wider range of values. We can check now for users with most friends to start looking for influencers in our network.

Degree:

MATCH (u:User)
RETURN u.id as id,size((u)--()) as degree 
order by degree desc limit 10

deg

Most basic metric is degree, that we can get using plain cypher. This directly translates to the number of friends a user has. It can be used as a good starting ground to finding influencers in our network. Top 4 users have more than 10x the average friends with user 107 being really high at 1045 friends (26% of all users).

Calculate triangle count and clustering coefficient for each node and return total triangle count and average clustering coefficient of the network.

CALL algo.triangleCount('User', 'FRIEND',
{concurrency:4, write:true, writeProperty:'triangles', 
clusteringCoefficientProperty:'coefficient'})
YIELD nodeCount, triangleCount, averageClusteringCoefficient,
loadMillis, computeMillis, writeMillis

global

There are 4039 node with 1612010 triangles in our network. Average clustering is quite high at 0.6055. This is expected as evidence suggests that in most real-world networks, and in particular social networks, nodes tend to create tightly knit groups characterized by a relatively high density of relationships. [1]

Check user local triangle count and cluster coefficient.

The local clustering coefficient of a node in a graph quantifies how close its neighbours are to being a complete graph, where complete graph is a simple undirected graph in which every pair of distinct nodes is connected by a unique relationship.[1]

If put simple the clustering coeffient would be 1 when all of my friends knew eachother.
Above algorithms also saved values of individual nodes as properties of nodes, so we can now query the results.

MATCH (u:User)
return u.id,u.triangles as triangles,
            u.coefficient as cluster_coefficient 
order triangles desc limit 10

tri

User 1912 is a member of more triangles then user 107 , who has the most friends. They both have a small clustering coefficient, which indicates, that their friends don’t really know eachother and they have a great potencial for introducing people to one another. User 1912,107 were third and first by the number of friends, so combined with this results they are solidifying their importance in the network. There is quite a drop in triangle count to third place, while on the other hand their neighbour communities are more tightly knit.

Betweenness centrality:

Betweenness centrality is useful in finding nodes that serve as a bridge from one group of users to another in  a graph. Consequently, betweenness is a rudimentary measure of the control, that a specific node exerts over the information flow throughout the full graph.

We use direction:'BOTH'for algorithms on an undirected graph.

Check documentation for more.

CALL algo.betweenness.stream('User', 'FRIEND', 
{direction:'BOTH'}) 
YIELD nodeId,centrality
MATCH (u:User) where id(u) = nodeId
RETURN u.id, centrality order by centrality desc limit 15

betwennness

User 107 tops the betweenness centrality results, which is not surprising as he has the most amount of friends with a very low clustering coeffient. So he in the position to act as a broker of informations between different groups of users. He sure does appear as the main influencer in this network based on the metrics we checked. Based on the combined view of centrality metrics we can create a list with users, who are likely to be important in our network and do some further research on them.

Conclusion:

As you can see graph algorithms plugin greatly improves our ability to analyse the network in Neo4j. You should check them out! If you think some algorithms are still missing or find some bugs please let us know on github or the Neo4j slack community.

 

[1] https://en.wikipedia.org/wiki/Clustering_coefficient

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s