In the last release of APOC plugin, there are some new graph algorithm, and one of them is a kNN algorithm, which is cool and easy to use. I have created my own kNN euclidian distance algorithm a few months ago with cypher, and yes it worked, but it was slow, because you are basically doing a cartesian product. I was pleasantly surprised how fast apoc version is.

Requirements:

- Neo4j — Neo4j Site
- Apoc plugin — Apoc plugin

*Download the latest APOC release.*

## Data:

I will use the standard movies data, which I feel is a good benchmarking data, to show what cypher with APOC combined is capable of.

`:play movies`

## Features:

We will use age and count of movies a person worked at as our two dummy features, so we can run some new `apoc.algo`

functions, which provide us with lots of cool algorithms.

MATCH (p:Person) set p.age = 2017 - p.born

MATCH (p:Person) with p,size((p)-->(:Movie)) as s set p.count = s

Lets draw distributions with spoonJS, which you can easily attach to Neo4j browser. It augments it with chart visualization capacibilities, which are very useful and easy to use.

## Age distribution:

We run this query and visualize the results.

//few :Person node have age property null so we must filter them out MATCH (p:Person) where exists (p.age) return distinct(p.age) as age,count(*) as count order by count desc

## Number of movies distribution:

MATCH (p:Person) return distinct(p.count) as number_of_movies, count(*) as count order by count desc

As we can easily tell most persons are between 40-60 years old and have played in a movie or two. So now if we want to use both together as a feature, we should use do some sort of normalization.

### Normalization:

### Version 1:

I came up with a simple function that copies what I understand minmax to do.

//filter out outliers MATCH (p:Person) where p.age > 25 and p.count < 10 //get the span WITH max(p.age) - min(p.age) as age_span,max(p.count) - min(p.count) as count_span WITH toFLOAT(age_span) / toFLOAT(count_span) as coefficient MATCH (p1:Person) SET p1.age_nor = p1.age / coefficient

### Version 2:

You can also just normalize each feature between 0 and 1.Example for one feature

//filter out outliers MATCH (p:Person) where p.age > 25 and p.count < 10 //get the the max and min value WITH max(p.age) as max,min(p.age) as min MATCH (p1:Person) //normalize SET p1.age_nor = (1.0 * p1.age - min) / (max - min)

I used the first version in all the visualizations below.

## kNN queries:

### Cosine similarity:

MATCH (p1:Person),(p2:Person) where id(p1) < id(p2) and exists(p1.age) and exists(p2.age) WITH p1,p2,apoc.algo.cosineSimilarity([p1.count,p1.age_nor],[p2.count,p2.age_nor]) as value MERGE (p1)-[s:SIMILARITY]-(p2) SET s.cosine = value

### Query for distribution:

We need to bucketize a bit for a better visualization.

match ()-[s:SIMILARITY]->() WITH distinct(s.cosine) as cosine return round(cosine * 100),count(*)

### Euclidian distance:

MATCH (p1:Person),(p2:Person) where id(p1) < id(p2) and exists(p1.age) and exists(p2.age) WITH p1,p2,apoc.algo.euclideanDistance([p1.count,p2.age_nor],[p2.count,p2.age_nor]) as value MERGE (p1)-[s:SIMILARITY]-(p2) SET s.e_distance = value

### Euclidian similarity:

MATCH (p1:Person),(p2:Person) where id(p1) < id(p2) and exists(p1.age) and exists(p2.age) WITH p1,p2,apoc.algo.euclideanSimilarity([p1.count,p2.age_nor],[p2.count,p2.age_nor]) as value MERGE (p1)-[s:SIMILARITY]-(p2) SET s.e_similarity = value

## Conclusion:

With the help of APOC and spoonJS we can easily run graph algorithms and quickly visualize results, to help us get a feeling how the data looks like. You do not need any external tools for simple chart visualizations , which is pretty amazing. APOC holds more graph algorithms, so the next blog will probably come soon. Stay tuned

## One thought on “Neo4j APOC graph algorithms part 1”