In the last release of APOC plugin, there are some new graph algorithm, and one of them is a kNN algorithm, which is cool and easy to use. I have created my own kNN euclidian distance algorithm a few months ago with cypher, and yes it worked, but it was slow, because you are basically doing a cartesian product. I was pleasantly surprised how fast apoc version is.
Download the latest APOC release.
I will use the standard movies data, which I feel is a good benchmarking data, to show what cypher with APOC combined is capable of.
We will use age and count of movies a person worked at as our two dummy features, so we can run some new
apoc.algo functions, which provide us with lots of cool algorithms.
MATCH (p:Person) set p.age = 2017 - p.born
MATCH (p:Person) with p,size((p)-->(:Movie)) as s set p.count = s
Lets draw distributions with spoonJS, which you can easily attach to Neo4j browser. It augments it with chart visualization capacibilities, which are very useful and easy to use.
We run this query and visualize the results.
//few :Person node have age property null so we must filter them out MATCH (p:Person) where exists (p.age) return distinct(p.age) as age,count(*) as count order by count desc
Number of movies distribution:
MATCH (p:Person) return distinct(p.count) as number_of_movies, count(*) as count order by count desc
As we can easily tell most persons are between 40-60 years old and have played in a movie or two. So now if we want to use both together as a feature, we should use do some sort of normalization.
I came up with a simple function that copies what I understand minmax to do.
//filter out outliers MATCH (p:Person) where p.age > 25 and p.count < 10 //get the span WITH max(p.age) - min(p.age) as age_span,max(p.count) - min(p.count) as count_span WITH toFLOAT(age_span) / toFLOAT(count_span) as coefficient MATCH (p1:Person) SET p1.age_nor = p1.age / coefficient
You can also just normalize each feature between 0 and 1.Example for one feature
//filter out outliers MATCH (p:Person) where p.age > 25 and p.count < 10 //get the the max and min value WITH max(p.age) as max,min(p.age) as min MATCH (p1:Person) //normalize SET p1.age_nor = (1.0 * p1.age - min) / (max - min)
I used the first version in all the visualizations below.
MATCH (p1:Person),(p2:Person) where id(p1) < id(p2) and exists(p1.age) and exists(p2.age) WITH p1,p2,apoc.algo.cosineSimilarity([p1.count,p1.age_nor],[p2.count,p2.age_nor]) as value MERGE (p1)-[s:SIMILARITY]-(p2) SET s.cosine = value
Query for distribution:
We need to bucketize a bit for a better visualization.
match ()-[s:SIMILARITY]->() WITH distinct(s.cosine) as cosine return round(cosine * 100),count(*)
MATCH (p1:Person),(p2:Person) where id(p1) < id(p2) and exists(p1.age) and exists(p2.age) WITH p1,p2,apoc.algo.euclideanDistance([p1.count,p2.age_nor],[p2.count,p2.age_nor]) as value MERGE (p1)-[s:SIMILARITY]-(p2) SET s.e_distance = value
MATCH (p1:Person),(p2:Person) where id(p1) < id(p2) and exists(p1.age) and exists(p2.age) WITH p1,p2,apoc.algo.euclideanSimilarity([p1.count,p2.age_nor],[p2.count,p2.age_nor]) as value MERGE (p1)-[s:SIMILARITY]-(p2) SET s.e_similarity = value
With the help of APOC and spoonJS we can easily run graph algorithms and quickly visualize results, to help us get a feeling how the data looks like. You do not need any external tools for simple chart visualizations , which is pretty amazing. APOC holds more graph algorithms, so the next blog will probably come soon. Stay tuned