Article recommendation system on a citation network using Personalized Pagerank and Neo4j

Recently personalized pageRank was added to Neo4j graph algorithms library and I was looking for a cool example to demonstrate its applications. I found a great example in an Efficient Algorithms for Personalized PageRank research paper written by Peter Lofgren, in which he states:

As an example of how changing the source of the Personalized PageRank algorithm results in different rankings, we consider personalized search on a citation graph. On a citation graph provided by Citeseer, we created a demo which given a keyword and researcher name, finds all papers containing that keyword, and ranks them from the point of view of the given researcher. For a given researcher, we find all papers by that researcher, and define a source distribution giving equal weight to those papers. We then use Personalized PageRank on the citation graph to rank papers matching the given keyword. As an example, the keyword “entropy” means different things to different researchers, so we compare the top results for keyword “entropy” from different points of view.
Let’s recreate this scenario with Neo4j!

Requirements:

Once you have downloaded all the plugins and put them in the Neo4j plugins folder you have to enable them by adding the following configuration to the Neo4j.conf file.

dbms.unmanaged_extension_classes=com.graphaware.server=/graphaware
com.graphaware.runtime.enabled=true
com.graphaware.module.NLP.1=com.graphaware.nlp.module.NLPBootstrapper
dbms.security.procedures.whitelist=ga.nlp.*,algo.*,apoc.*
dbms.security.procedures.unrestricted=apoc.*,algo.*
apoc.import.file.enabled=true

Graph model

graph_modely.png

We start with a simple graph model consisting of nodes labeled Author, which can have one or more relationships AUTHOR to nodes labeled Article. At the same time Article can have a relationship REFERENCE to other nodes labeled Article, indicating the reference.

Graph schema

To optimize our queries and make full use of indexes, we define the graph schema with unique constraints for label Article on “index” property and label Author on “name” property.

CALL apoc.schema.assert(
{},
{Article:['index'],Author:['name']})

Import data

We use the citation network v10 made available on aminer.org. It is the latest version of the dataset and more importantly first version stored as JSON.

Find more about the dataset in the ArnetMiner: Extraction and Mining of Academic Social Networks research paper[1].

Importing data into Neo4j is done in two steps. First, we import all articles and their authors. In the second step, we import all the references of the articles.
We use apoc.periodic.iterate procedure for batch importing.

Import articles and authors

CALL apoc.periodic.iterate(
  'UNWIND ["dblp-ref-0.json","dblp-ref-1.json","dblp-ref-2.json","dblp-ref-3.json"] as file 
   CALL apoc.load.json("file:///neo4j/import/" + file) 
   yield value return value',
  'MERGE (a:Article{index:value.id}) 
   ON CREATE SET a += apoc.map.clean(value,["id","authors","references"],[0]) 
   WITH a,value.authors as authors 
   UNWIND authors as author 
   MERGE (b:Author{name:author}) 
   MERGE (b)-[:AUTHOR]->(a)'
,{batchSize: 10000, iterateList: true})

Import references

CALL apoc.periodic.iterate(
  'UNWIND ["dblp-ref-0.json","dblp-ref-1.json","dblp-ref-2.json","dblp-ref-3.json"] as file 
   CALL apoc.load.json("file:///neo4j/import/" + file) 
   yield value return value',
  'MERGE (a:Article{index:value.id}) 
   WITH a,value.references as references 
   UNWIND references as reference 
   MERGE (b:Article{index:reference}) 
   MERGE (a)-[:REFERENCES]->(b)'
,{batchSize: 10000, iterateList: true})

PageRank algorithm

PageRank was designed to analyze the importance of website pages. It takes into account the number of links a website has and also the quality of those links. It’s quite different if for example, a website has a link from the first page of Reddit or for example my blog 🙂

The process can be easily translated to the citation network domain. The citation itself can be viewed as an “upvote” from one article to another. Which article has the most “upvotes” from other quality articles is a question PageRank algorithm can help us answer.

We use pageRank algorithm on the global citation network to find the most important and influential articles within our graph.

Run pageRank and store results as a property of nodes

CALL algo.pageRank('Article', 'REFERENCES')

Most important articles by pagerank

MATCH (a:Article)
RETURN a.title as article,
       a.pagerank as score
ORDER BY score DESC 
LIMIT 10
Results

global_pagerank

Natural language processing

We want to recommend articles by keyword selection, so we need to extract keywords from our graph. Thanks to Graphaware’s NLP plugin this is a simple process even if you have little to no understanding what is going on beneath the hood of NLP algorithms.

The process will enrich our graph schema with additional node labels and relationship types as shown in the below picture.

ciata

Define NLP graph schema

To optimize the NLP process we define a set of unique constraints and indexes neatly available as a single procedure.

CALL ga.nlp.createSchema()

Add pipeline

Define the configuration of our pipeline. Find more details in the documentation.

CALL ga.nlp.processor.addPipeline({
  textProcessor: 'com.graphaware.nlp.processor.stanford.StanfordTextProcessor', 
  name: 'defaultPipeline', 
  threadNumber: 4
  processingSteps: {tokenize: true, 
                     ner: true, 
                     dependency: false}})

Set default pipeline

CALL ga.nlp.processor.pipeline.default('defaultPipeline')

Annotate text

The original text is broken down into words, parts of speech, and functions. This analysis of the text acts as a starting point for the later steps.[2]

If you want to learn more about it I suggest you to check out Reverse Engineering Book Stories with Neo4j and GraphAware NLP blog post written by Christophe Willemsen.

CALL apoc.periodic.iterate(
  "MATCH (n:Article) WHERE exists (n.title) RETURN n",
  "CALL ga.nlp.annotate({text: n.title, id: id(n)}) 
   YIELD result MERGE (n)-[:HAS_ANNOTATED_TEXT]->(result)",
{batchSize:1, iterateList:true})

Keyword extraction

The TextRank algorithm is a relatively simple, unsupervised method of text summarization directly applicable to the topic extraction task. Its objective is to retrieve keywords and construct key phrases that are most descriptive of a given document by building a graph of word co-occurrences and ranking the importance of individual words using the PageRank algorithm.

Taken from Efficient unsupervised keywords extraction using graphs[3]

CALL apoc.periodic.iterate(
  "MATCH (a:AnnotatedText) RETURN a",
  "CALL ga.nlp.ml.textRank({annotatedText: a}) YIELD result
   RETURN distinct 'done' ",
{batchSize:1,iterateList:true}

Get the 10 most occurring keywords in the articles titles

Lets check which are the keywords that describe the most articles in our graph.

MATCH (k:Keyword)-[:DESCRIBES]->()
WHERE k.numTerms > 1
RETURN k.value as Keyphrase, 
       count(*) AS n_articles
ORDER BY n_articles DESC
LIMIT 10
Results

keyphrase.png

Basic article recommendation

If you followed this blog post step by step, you are now able to create a basic article recommendation system based on the global pageRank score of articles and keywords extracted from the NLP process.

Recommend top ten articles described by the keyword “social networks“.

MATCH (k:Keyword)-[:DESCRIBES]->()<-[:HAS_ANNOTATED_TEXT]-(a:Article)
WHERE k.value = "social networks"
RETURN a.title as title, a.pagerank as p
ORDER BY p DESC
LIMIT 10
Results

basic

Personalized Pagerank algorithm

Personalized PageRank algorithm returns the pageRank score of nodes in a graph from the point of view of one or more given source node.

Let’s calculate the global pageRank again, but this time using articles described by keyword “social networks” as source nodes.

MATCH (k:Keyword)-[:DESCRIBES]->()<-[:HAS_ANNOTATED_TEXT]-(a:Article)
WHERE k.value = "social networks"
WITH collect(a) as articles
CALL algo.pageRank.stream('Article', 'REFERENCES', {sourceNodes: articles})
YIELD nodeId, score
WITH nodeId,score order by score desc limit 10
MATCH (n) where id(n) = nodeId
RETURN n.title as article, score
Results

pes

The anatomy of a large-scale hypertextual Web search engine research paper by Sergey Brin and Larry Page comes out top. We could conclude that Google’s early work with graphs and pageRank had a big effect on social network research papers. Cool easter egg 🙂

Personalized recommendation system

If you remember the goal of this blog post was to recreate the following example.

As an example, the keyword “entropy” means different things to different researchers, so we compare the top results for keyword “entropy” from different points of view.

We start by matching all articles written by a specific author. We collect them as we will use them as source nodes in the personalized pageRank algorithm. We then run the algorithm and project as nodes only articles that are described by the keyword “entropy”. We also project the REFERENCES relationships between these articles.

We don’t have to filter out any relationships in the relationship statement as the following rule is enforced while projecting with cypher projection.

Relationships described in the relationship-statement will only be projected if both source and target nodes are described in the node-statement. Relationships that don’t have both source and target nodes described in the node-statement will be ignored.

Taken from documentation

Example recommendation

Recommendation of articles described by keyword “entropy” from the point of view of Jose C. Principe.

MATCH (a:Article)<-[:AUTHOR]-(author:Author)
WHERE author.name="Jose C. Principe"
WITH collect(a) as articles
CALL algo.pageRank.stream(
  'MATCH (a:Article)-[:HAS_ANNOTATED_TEXT]->()<-[:DESCRIBES]-(k:Keyword) 
   WHERE k.value contains "entropy" RETURN distinct id(a) as id',
  'MATCH (a1:Article)-[:REFERENCES]->(a2:Article) 
   RETURN id(a1) as source,id(a2) as target', 
  {sourceNodes: articles,graph:'cypher'})
YIELD nodeId, score
WITH nodeId,score order by score desc limit 10
MATCH (n) where id(n) = nodeId
RETURN n.title as article, score
Results

blogy1.png

Recommendation of articles described by keyword “entropy” from the point of view of Hong Wang.

MATCH (a:Article)<-[:AUTHOR]-(author:Author)
WHERE author.name="Hong Wang"
WITH collect(a) as articles
CALL algo.pageRank.stream(
  'MATCH (a:Article)-[:HAS_ANNOTATED_TEXT]->()<-[:DESCRIBES]-(k:Keyword) 
   WHERE k.value contains "entropy" RETURN distinct id(a) as id',
  'MATCH (a1:Article)-[:REFERENCES]->(a2:Article) 
   RETURN id(a1) as source,id(a2) as target', 
  {sourceNodes: articles,graph:'cypher'})
YIELD nodeId, score
WITH nodeId,score order by score desc limit 10
MATCH (n) where id(n) = nodeId
RETURN n.title as article, score
Results

blogy4.png

Conclusion

As expected we get different recommendations for the two authors as their point of view is different.

Neo4j is a powerful tool, but knowing when and which plugins to use in a specific use case, makes it an even greater tool.

Register now for your copy of the O’Reilly book, Graph Algorithms: Practical Examples in Apache Spark and Neo4j by Mark Needham and Amy E. Hodler.

Social-AlgoTwitter2 .png

References

[1] Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. ArnetMiner: Extraction and Mining of Academic Social Networks. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD’2008). pp.990-998. [PDF] [Slides] [System] [API]

[2] https://github.com/graphaware/neo4j-nlp#pipelines-and-components

[3] https://graphaware.com/neo4j/2017/10/03/efficient-unsupervised-topic-extraction-nlp-neo4j.html

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s