The idea for this blog post is to show how to chain similarity algorithms with community detection algorithms in Neo4j without ever storing any information back to the graph. If for example you have only read rights to Neo4j or you don’t want to store anything to the graph while analyzing it, then chaining algorithms is for you.

As I used a new dataset to play around, the post also shows how to import and preprocess the graph for the analysis.

## Data:

We use Countries of the World dataset, made available by Fernando Lasso on Kaggle. It contains information such as birthrate, GDP, infant mortality and others about 227 countries of the world.

Let’s start by researching how many missing values are there in the CSV file.

**Countries of the world.csv** file must be copied to the **Neo4j/import** folder to be able to run the missing values query.

#### Missing values query

LOAD CSV WITH HEADERS FROM "file:///countries%20of%20the%20world.csv" as row UNWIND keys(row) as key RETURN key, sum(CASE WHEN row[key] is null THEN 1 ELSE 0 END) as missing_value ORDER BY missing_value DESC LIMIT 15

#### Missing values results

As expected with any real-world dataset there are some missing values. Fortunately, only five features have some missing values while others have close to zero. Features with more than five percent of missing values (11+) are not considered in our analysis.

## Import

### Graph model

Schema of the graph is simple. We have one type of nodes representing countries. They have some features stored as properties and are also connected to the to the region they belong in with a relationship.

We could define unique constraints for labels Country and Region, but since our graph is so tiny we’ll skip this today and proceed with the import.

#### Import query

LOAD CSV WITH HEADERS FROM "file:///countries%20of%20the%20world.csv" as row MERGE (c:Country{name:row.Country}) ON CREATE SET c += apoc.map.clean(row,['Country','Region'],[]) MERGE (r:Region{name:row.Region}) MERGE (c)-[:IN]->(r)

## Preprocessing

When I first explored this dataset in Neo4j I got back weird results and didn’t exactly understand why. After a couple of minutes, I found the culprit. If we run the following cypher query `RETURN toFloat("0,1")`

it returns a `null`

value. I later found out that this is a Java limitation and not specifically Neo4j. The numbers in this dataset use the comma as a decimal point. This doesn’t work for us and we need to replace the commas with the dots to be able to store them as a float.

Let’s replace the commas for dots and store the new values as floats.

MATCH (c:Country) UNWIND keys(c) as key WITH c, key, toFloat(replace(c[key],',','.')) as fixed_float WHERE NOT key = 'name' CALL apoc.create.setProperty(c, key, fixed_float) YIELD node RETURN distinct 'done'

### Feature selection

We don’t use all the features of countries that are available in our analysis. I cherry-picked a couple of features that have little to no missing values, specifically:

- Birthrate
- Infant mortality
- GDP
- Arable %
- Crops %
- Phones (per 1000)

Let’s check basic statistics about our features with apoc.agg.statistics procedure.

UNWIND ["Birthrate", "Infant mortality (per 1000 births)","GDP ($ per capita)", "Arable (%)", "Crops (%)", "Phones (per 1000)"] as potential_feature MATCH (c:Country) WITH potential_feature, apoc.agg.statistics(c[potential_feature],[0.5,0.75,0.9,0.95,0.99]) as stats RETURN potential_feature, apoc.math.round(stats.min,2) as min, apoc.math.round(stats.max,2) as max, apoc.math.round(stats.mean,2) as mean, apoc.math.round(stats.stdev,2) as stdev, apoc.math.round(stats.`0.5`,2) as p50, apoc.math.round(stats.`0.75`,2) as p75, apoc.math.round(stats.`0.95`,2) as p95, apoc.math.round(stats.`0.99`,2) as p99

#### Results

I found it interesting that Monaco has 1035 phones per 1000 persons and so more phones than people. On the other side of the spectrum, I found it shocking that Angola has (not sure exactly which year) 191,19 infant deaths per 1000 birth.

### MinMax Normalization

MinMax normalization is used to scale all values of features between 0 and 1. It is better known as a process of feature scaling or data normalization.

WITH ["Birthrate", "Infant mortality (per 1000 births)", "GDP ($ per capita)", "Arable (%)", "Crops (%)", "Phones (per 1000)"] as features UNWIND features as key MATCH (c:Country) WITH max(c[key]) as max,min(c[key]) as min,key MATCH (c1:Country) WITH c1, "normalized_" + key AS newKey, (toFloat(c1[key]) - min) / (max - min) as normalized_value CALL apoc.create.setProperty(c1, newKey, normalized_value) YIELD node RETURN distinct 'normalization done'

### Missing values

As we observed at the start, some features of the countries have missing values. A simple average of the region is used to fill in the missing values.

UNWIND ["normalized_Birthrate", "normalized_Infant mortality (per 1000 births)", "normalized_GDP ($ per capita)", "normalized_Arable (%)", "normalized_Crops (%)", "normalized_Phones (per 1000)"] as feature MATCH (c:Country) WHERE c[feature] IS null MATCH (c)-[:IN]->(r:Region)<-[:IN]-(other:Country) WHERE other[feature] IS NOT null WITH c,feature,avg(other[feature]) as avg_value CALL apoc.create.setProperty(c, feature, avg_value) YIELD node RETURN distinct 'missing values populated'

## Analysis

The analysis is split into two parts.

In the first part we find similar countries using cosine similarity algorithm.

In the second part we chain cosine similarity with Louvain algorithm to find communities of similar countries.

### Cosine similarity algorithm

If you are new to cosine similarity metric I suggest you check out the documentation or the Wikipedia.

To get a feeling of how similar our countries are we run **algo.similarity.cosine** algorithm with **write:false** parameter. It returns global statistics about the similarity graph.

MATCH (c:Country) WITH id(c) as id, [c["normalized_Birthrate"], c["normalized_Infant mortality (per 1000 births)"], c["normalized_GDP ($ per capita)"], c["normalized_Arable (%)"], c["normalized_Crops (%)"], c["normalized_Phones (per 1000)"]] as weights WITH {item:id, weights: weights} as userData WITH collect(userData) as data CALL algo.similarity.cosine(data, {similarityCutoff: 0.0, write:false}) YIELD nodes, similarityPairs, min, max, mean, stdDev, p25, p50, p75, p90 RETURN nodes,similarityPairs,apoc.math.round(min,2) as min, apoc.math.round(max,2) as max, apoc.math.round(mean,2) as mean, apoc.math.round(stdDev,2) as stdDev, apoc.math.round(p25,2) as p25, apoc.math.round(p50,2) as p50, apoc.math.round(p75,2) as p75, apoc.math.round(p90,2) as p90

#### Results

Even with **similarityCutoff: 0.0**, cosine similarity ranges between -1 and 1, we get an average of more than 100 relationships per node. Community detection algorithms don’t perform well on a very dense network. In our results each node is connected to about 50% of the nodes in the graph and community detection algorithms will most likely find one giant community that spans over most of the graph.

This is not a desired outcome when searching for communities. To avoid it we use a combination of **similarityCutoff:0.8** and **topK:10**. Read more in the documentation.

#### Stream cosine similarity query

MATCH (c:Country) WITH id(c) as id, [c["normalized_Birthrate"], c["normalized_Infant mortality (per 1000 births)"], c["normalized_GDP ($ per capita)"], c["normalized_Arable (%)"], c["normalized_Crops (%)"], c["normalized_Phones (per 1000)"]] as weights WITH {item:id, weights: weights} as userData WITH collect(userData) as data CALL algo.similarity.cosine.stream(data,{similarityCutoff:0.8,topK:10}) YIELD item1, item2, count1, count2, similarity WITH item1,item2,similarity where item1 > item2 RETURN algo.getNodeById(item1).name AS from, algo.getNodeById(item2).name AS to, similarity ORDER BY similarity DESC LIMIT 10

#### Results

Results show that African countries are the most similar countries in the world judging by our features.

### Chaining Cosine Similarity and Louvain algorithm

We have prepared everything so now we can finally chain cosine similarity with Louvain algorithm. Louvain algorithm is run using cypher projection.

To project a graph with cypher projection we must provide two cypher statements. The first one, so-called node-statement, is where we provide all the ids of nodes the algorithm should consider. In the second statement, also called relationship-statement, we provide all the relationships of our graph. This is done by providing source and target node ids for all relationships we want to consider.

Node-statement is fairly trivial as we just provide all the countries. Cosine similarity algorithm will be put into the relationship-statement to provide all the relationships of the similarity graph to the Louvain algorithm.

CALL algo.louvain.stream( 'MATCH (c:Country) RETURN id(c) as id', 'MATCH (c:Country) WITH id(c) as id,[c["normalized_Birthrate"], c["normalized_Infant mortality (per 1000 births)"], c["normalized_GDP ($ per capita)"], c["normalized_Arable (%)"], c["normalized_Crops (%)"], c["normalized_Phones (per 1000)"]] as weights WITH {item:id, weights: weights} as userData WITH collect(userData) as data CALL algo.similarity.cosine.stream(data,{similarityCutoff:0.8,topK:10}) YIELD item1, item2, similarity RETURN item1 AS source, item2 AS target', {graph: 'cypher'}) YIELD nodeId,community WITH algo.getNodeById(nodeId) as node, community RETURN community,count(*) as community_size, apoc.math.round(avg(node["Birthrate"]),2) as avg_birthrate, apoc.math.round(avg(node["Infant mortality (per 1000 births)"]),2) as infant_mortality, apoc.math.round(avg(node["GDP ($ per capita)"]),2) as gdp, apoc.math.round(avg(node["Arable (%)"]),2) as arable, apoc.math.round(avg(node["Crops (%)"]),2) as crops, apoc.math.round(avg(node["Phones (per 1000)"]),2) as phones

#### Results

Louvain algorithm found 7 communities in our graph. First community is the richest judging by the GDP, has the lowest infant mortality rate and interestingly also almost the lowest birthrate.

On the other side, there is the poorest community of countries with an average value of GDP of only 1633$. They lead in infant mortality, which is quite high at 88,61 deaths per 1000 births and also lead in birthrate.

## Conclusion

I was also thinking of showing how to visualize communities in the graph using only streaming algorithms, but I decided not to as we would have to run the similarity algorithm twice in a single query and it doesn’t make sense.

Still, I want to end on a visualization of communities in the graph. Made in Gephi.