Learn how to import, clean, and analyze ArXiv dataset in Neo4j. In the last step, you will learn how to create a search and recommendation engine for articles.

In Europe, we are deep in the second wave of Covid lockdown. I’ve seen some motivational speakers talk about using this time and learning a new skillset. As a child, I’ve always liked nuclear experiments, so I decided to build a reactor in my basement and try some experiments. I’ve already got a basement, so now I only need to learn nuclear physics or maybe get some nuclear researchers to help me out.

I’ve got the idea from Estelle Scifo, who imported and analyzed the ArXiv dataset in Neo4j. We’ll take a detailed look at the nuclear experiments category of articles and try to learn some physics, and find some potential collaborators. As we’ll need to reference some papers in the future, we’ll also build a search and recommendation engine for articles to help us in our future endeavors.

As mentioned, we will analyze the ArXiv dataset available on Kaggle. I have prepared a Jupyter Notebook so that you can easily follow along. We will first do a little bit of data preprocessing and then import the data into Neo4j. Next, we will clean the article’s authors so that our exploratory graph and network analysis will be more accurate. Along the way, we will enrich our graph using the internal citation dataset available on GitHub. To finish this blog post, we will create the backend for a search and recommendation application of ArXiv articles.

#### Agenda:

1. Data preprocessing
2. Data cleaning
3. Exploratory graph analysis
4. Co-authorship network analysis
5. Citation network analysis
6. Prepare an article search engine
7. Develop an article recommendation engine

#### Data preprocessing

We will assume the ArXiv dataset has been downloaded and extracted into the analysis folder. First, we will load the articles in the category of nuclear experiments in the Pandas data frame. The following script also preprocesses the authors’ names and the published date of the papers.

articles = []
category = 'nucl-ex'
for l in f:
if category in d['categories'].split(' '):
d['clean_authors'] = get_clean_authors(d['authors_parsed'])
articles.append(d)

articles_df = pd.DataFrame().from_records(articles)
articles_df['created_date'] = [datetime.strptime(date[0]['created'].split(',')[1],' %d %b %Y %H:%M:%S %Z') for date in articles_df['versions']]

If we take a look at a random abstract text, we can notice that the text is available in the Latex format:

This result confirms that dominant contributions to the electric\nand magnetic polarizabilities may be represented in terms of two-photon\ncouplings to the $\\sigma$-meson having the predicted mass $m_\\sigma=666$ MeV\nand two-photon width $\\Gamma_{\\gamma\\gamma}=2.6$ keV.

We will transform both the title and abstract text to UTF-8 format using the pylatexenc library.

from pylatexenc.latex2text import LatexNodes2Text

# LaTex to UTF
clean_abstract = []
clean_title = []
for i,a in articles_df.iterrows():
# Clean title
try:
clean_title.append(LatexNodes2Text().latex_to_text(a['title']).replace('\n', ' ').strip())
except:
clean_title.append(a['abstract'].replace('\n', ' ').strip())
# Clean abstract
try:
clean_abstract.append(LatexNodes2Text().latex_to_text(a['abstract']).replace('\n', ' ').strip())
except:
clean_abstract.append(a['abstract'].replace('\n', ' ').strip())
articles_df['clean_abstracts'] = clean_abstract
articles_df['clean_title'] = clean_title

We got rid of the LaTeX formatting and have the title and abstract text in the UTF-8 format. The same abstract as before now looks like this:

This result confirms that dominant contributions to the electric and magnetic polarizabilities may be represented in terms of two-photon couplings to theσ-meson having the predicted massm_σ=666MeV and two-photon widthΓ_γγ=2.6keV.

#### Graph import

After some initial data wrangling, we can go ahead and import the data into Neo4j. The graph schema is quite simple. We have Article and Author nodes, and a WROTE relationship between the two, indicating who are the authors of a particular article.

As with most Neo4j projects, it is advisable first to define the unique constraints for nodes. This way, the import will be faster as the unique identifiers of the nodes will be indexed. We will create unique constraints for both the Article and the Author nodes.

CREATE CONSTRAINT article IF NOT EXISTS ON (a:Article) ASSERT a.id IS UNIQUE;CREATE CONSTRAINT authors IF NOT EXISTS ON (a:Author) ASSERT a.name IS UNIQUE;

The dataset and the unique constraints are ready, so we can go ahead and import the graph. An important thing to notice is that when you are using your favorite scripting language with Neo4j Driver to import the data, we never want to run a single transaction for each line but instead want to batch several lines together into a single transaction. In our case, we batch every 2000 lines into a single transaction.

import_query = """
UNWIND $data as row CREATE (a:Article) SET a += apoc.map.clean(row,['authors'],[]) SET a.date = date(row['date']) WITH a, row.authors as authors UNWIND authors as author MERGE (au:Author{name:author}) MERGE (au)-[:WROTE]->(a) """ import_data = [] session = driver.session() for i, row in articles_df.iterrows(): import_data.append({'id':row['id'], 'title':row['clean_title'], 'abstract':row['clean_abstracts'], 'date':row['created_date'].strftime('%Y-%m-%d'), 'authors':row['clean_authors']}) # Batch import by 2000 lines if ((i % 2000) == 0) and (len(import_data) != 0): session.run(import_query, {'data':import_data}) import_data = [] session.run(import_query, {'data':import_data}) session.close() #### Data Cleaning If we take a look at our graph in Neo4j Browser, we should see the same schema as we defined before. We’ll quickly see why this part is titled Data Cleaning. Let’s begin by examining the most active authors in the Nuclear Experiments category. MATCH (a:Author)RETURN a.name as author, size((a)-[:WROTE]->()) as number_of_articlesORDER BY number_of_articles DESC LIMIT 10 Results I wasn’t aware of this before, but it looks like that a collaboration like ALICE or STAR can be an author of an article. Let’s explore the collaborations a bit more and take a look at the most active collaborations overall. MATCH (a:Author)WHERE a.name contains "Collaboration"RETURN a.name as author, size((a)-[:WROTE]->()) as number_of_articlesORDER BY number_of_articlesDESC LIMIT 10 Results We have already known that ALICE and STAR collaborations are the most active, but there are many more collaborations in the nuclear experiment category. Unfortunately, the data appears dirty, as there are two STAR collaboration variations and two CLAS collaboration variations. We’ll do some data cleaning of the Author entities before we continue with the graph exploration analysis. Data cleaning is the most fun part of any analysis. Well, not really, but it is a vital part of the research if we want to rely on the data. We’ll introduce a secondary label for all collaboration nodes. Again, we will define a unique constraint for optimization and indexing. CREATE CONSTRAINT collaboration IF NOT EXISTS ON (c:Collaboration) ASSERT c.id IS UNIQUE; Let’s take a look at the potential candidates that could be merged into a single entity. We will remove stop words from the author’s name like “the”, “collaboration”, “for” and examine how well does this simple technique work. WITH ["The","the","Collaboration", "collaboration", "for", "For", "\n","on behalf of","By "] as remove_wordsMATCH (a:Author)WHERE a.name contains "ollaboration"WITH trim(reduce(v = a.name, word in remove_words | apoc.text.replace(v, word, ' '))) as author, collect(a.name) as nodesWHERE size(nodes) > 1RETURN author, nodesORDER BY authorLIMIT 10 Results It appears that this simple technique found many candidates to be merged. It also found examples where both an author and a collaboration can be mentioned as a single author. An example would be Adhikari K. P. for the CLAS Collaboration. We will first merge these candidates as presented in the previous query and go from there. WITH ["The","the","Collaboration", "collaboration", "for", "For", "\n","on behalf of","By "] as remove_wordsMATCH (a:Author)WHERE (a.name contains "ollaboration" OR a.name IN ["INDRA","H1"]) AND NOT a.name CONTAINS "ollaborations" WITH trim(reduce(v = a.name, word in remove_words | apoc.text.replace(v, word, ' '))) as collab, collect(a) as nodesCALL apoc.refactor.mergeNodes(nodes) YIELD nodeSET node.name = collab, node:CollaborationRETURN distinct 'done' as result Next, we can decouple the entities where both the author and the collaboration are present as a single entity. We will start with a simple assumption that the last word in the author’s name represents the collaboration. We will compare the last word of the author’s name with the existing collaborations in our graph. If the last word matches an existing collaboration, we will decouple the single author node into collaboration and author nodes. MATCH (c:Collaboration) WHERE size(split(c.name,' ')) > 1 // get the last word of the name WITH c, split(c.name,' ')[-1] as collab // Match existing collaboration MATCH (c1:Collaboration{name:collab}) RETURN trim(replace(c.name, collab, '')) as author, c1.name as collab LIMIT 5 Results This query only looked at intermediate results but did not store the decoupled relationships. The decoupling results are satisfactory, so we will go ahead and save the decoupled relationships back to the graph. MATCH (c:Collaboration)WHERE size(split(c.name,' ')) > 1WITH c, split(c.name,' ')[-1] as collabMATCH (new_colab:Collaboration{name:collab})WITH c,new_colab, trim(replace(c.name, collab, '')) as authorMATCH (c)-[:WROTE]->(article:Article)MERGE (new_author:Author{name:author})MERGE (new_colab)-[:WROTE]->(article)MERGE (new_author)-[:WROTE]->(article)WITH distinct cDETACH DELETE cRETURN distinct 'done' as result Let’s examine the potential collaboration candidates that our previous decoupling technique missed. MATCH (c:Collaboration)WHERE size(split(c.name,' ')) > 1RETURN c.name as authorLIMIT 6 Results Some of the collaboration names are longer than just a single word, like “Jefferson Lab Hall A” and we have missed those. Some of the others weren’t tagged as collaborations before, because it is not explicitly tagged as collaboration in text. With a bit of manual work, I have listed a bunch of other collaborations that need to be decoupled as well. UNWIND ["Jefferson Lab Hall A", "NA61/SHINE", "Crystal Ball at MAMI", "WA98", "A2 at MAMI","JETSCAPE", "DLS", "WASA-at-COSY", "CERES/NA45", "ALADIN'2000","AMADEUS", "CB-ELSA", "GRAAL", "KEK-PS E559","CELSIUS-WASA","RHIC Spin", "EIC", "LSSS", "COSY-11", "PAX", "Hall A", "LPC-CHARISSA-DEMON","Crystal Ball at MAMI", "KLOE-2", "Graal", "HAL QCD", "MAJORANA.","Daya Bay","UConn-Yale-TUNL-Weizmann-PTB-UCL", "KamLAND-Zen", "COMPASS", "PREX", "ALICE HLT"] as collabMATCH (c:Collaboration)WHERE size(split(c.name,' ')) > 1 AND NOT c.name in ["Jefferson Lab Hall A", "Hall A DVCS", "Hall A", "HAL QCD"] AND c.name contains collabWITH c, trim(replace(c.name,collab,'')) as author, collabMERGE (new_author:Author{name:author})MERGE (new_collab:Author{name:collab})SET new_collab:CollaborationWITH c, new_author, new_collabMATCH (c)-[:WROTE]->(article:Article)MERGE (new_author)-[:WROTE]->(article)MERGE (new_collab)-[:WROTE]->(article)WITH distinct cDETACH delete cRETURN distinct 'done' as result Last, but not least, the “Jefferson Lab Hall A” shows up also as “Hall A DVCS” and “Hall A”, so we’ll merge them into a single entity as well. MATCH (c:Collaboration)WHERE c.name contains "Hall A"WITH c ORDER BY c.nameWITH collect(c) as nodesCALL apoc.refactor.mergeNodes(nodes, {properties: {name:'overwrite'}}) YIELD nodeRETURN distinct 'done' as result We have cleaned the entities where a single collaboration or a pair of collaboration and an author appeared. The dataset also has entities where two or more collaborations can occur along with an author. Because this is not a data cleaning workshop, we’ll skip cleaning this data. MATCH (a:Author) WHERE a.name CONTAINS "Collaborations" RETURN a.name as author LIMIT 5 Results #### Exploratory graph analysis Finally, we can start our exploratory graph analysis. We will begin by taking a look at the number of articles written by publication year. MATCH (a:Article)RETURN a.date.year as year, count(*) as articlesORDER BY year Results The nuclear experiment category on ArXiv was created in 1994. Since then, it has steadily risen in popularity until the year 2011, where the number of articles written per year has plateaued for the last decade. We will again take a look at the most active authors but disregard any collaborations this time. MATCH (a:Author)WHERE NOT a:CollaborationRETURN a.name as author, size((a)-[:WROTE]->()) as number_of_articlesORDER BY number_of_articles DESCLIMIT 10 Results Lebedev A. stays as the most active author, while Ma. Y. G. and Li X. follow in the second and third place. Another interesting statistic to look at is who has collaborated with most of the other authors. MATCH (a:Author)-[:WROTE]->()<-[:WROTE]-(other)WHERE NOT a:Collaboration AND NOT other:CollaborationRETURN a.name as author, count(distinct other) as number_of_coauthorsORDER BY number_of_coauthors DESCLIMIT 10 Results Wang Y. has collaborated with more than 5000 other authors. When I first observed this data, I taught for sure there was some error in the data. It turns out this is correct. The only explanation is that there are articles where many scientists collaborated on a paper. We’ll take a look at articles with the highest count of authors. MATCH (a:Author{name:"Wang Y."})-[:WROTE]->(article)RETURN article.id as article, size((article)<-[:WROTE]-()) as number_of_authorsORDER BY number_of_authors DESCLIMIT 10 Results There are four articles where more than 800 researchers have collaborated. I’ve also checked the ArXiv page to verify this data, and it turns out to be valid. I can only applaud the coordinators of these articles, where they could finish a paper with almost a thousand collaborators. Next, we’ll take a look at pairs of researchers that have collaborated the most. MATCH (a:Author)-[:WROTE]->()<-[:WROTE]-(other)WHERE id(a) < id(other) AND NOT a:Collaboration AND NOT other:CollaborationRETURN a.name + ' with ' + other.name as authors, count(*) as number_of_collaborationsORDER BY number_of_collaborations DESCLIMIT 10 Results Lebedev A. has worked on more than 200 articles with Li X. and around 180 articles with Pei H. In the third place, we have Li X. and Pei H. We might assume that Lebedev A., Li X., and Pei H. are the powerhouse in the Nuclear Experiments category. We can also examine the most active authors by publication year for the last decade. MATCH (a:Author)-[:WROTE]->(article:Article)WHERE NOT a:Collaboration AND article.date.year > 2010WITH article.date.year as year, a.name as author, count(*) as countORDER BY count DESC RETURN year, collect(author)[..3] as most_active_authorsORDER BY year Results #### Co-authorship network analysis If we remember, the idea for this article came out of necessity to find a potential collaborator for my basement project. After some basic graph explorations, it seems that one can’t do nuclear experiments on its own but has to join a collaboration or find a study to participate with 800 other scientists. We will explore the co-authorship network with the help of the Neo4j Graph Data Science library. If you need a quick refresher on how does the GDS library work, I would suggest the Introduction to graph algorithm course. An important thing to note is that our network analysis of the co-authorship network does not consider how influential the papers the authors have written are. It can be seen as more of a social network analysis, where we assume that if two authors collaborated on an article, they know each other. This way, I can find authors with lots of networking influence to collaborate on my project and introduce me to various people and organizations. We will use the Cypher projection feature to project the co-authorship network. The co-authorship relationship between the authors does not exist in our stored graph, so it could be said that we will be projecting a virtual network where we reduce the 2-hop relationship to a direct relationship without having to save the direct relationship in Neo4j. We will also disregard nodes with the Collaboration label in our analysis. CALL gds.graph.create.cypher('coauthorship-network', 'MATCH (a:Author) WHERE NOT a:Collaboration RETURN id(a) as id', 'MATCH (s:Author)-[:WROTE]->()<-[:WROTE]-(t:Author) WHERE NOT s:Collaboration AND NOT t:Collaboration RETURN id(s) as source, id(t) as target, count(*) as weight') We will begin with the Weakly Connected Components algorithms. It is used to find disconnected components or islands within the graph. CALL gds.wcc.write('coauthorship-network', {writeProperty:'wcc_coauthorship'})YIELD componentCount, componentDistribution Results There is a total of 2281 components in our network. More than half of them consist of a single node, which indicates that the specific author has no collaborations. As with most real-world networks, we have a single super component that contains 85% of all nodes and then many smaller ones. Next, we will execute the weighted variant of the PageRank algorithm to find the authors with the highest networking potential. We expect that authors who collaborated with other influential networking authors will rank the highest. We will store the results to Neo4j with the write mode of the algorithm. CALL gds.pageRank.write('coauthorship-network', {writeProperty:'pagerank_coauthorship', maxIterations:20, relationshipWeightProperty:'weight'})YIELD centralityDistribution Let’s examine the authors with the highest networking potential. MATCH (a:Author)WHERE NOT a:CollaborationRETURN a.name as author, a.pagerank_coauthorship as pagerankORDER BY pagerank DESCLIMIT 10 Results It seems that Zhang Y., Lebedev A., and Li X. have the highest networking potential. If someone can get me their contact, I would appreciate it. We can also examine authors with the highest PageRank score grouped by connected component for the five largest components. MATCH (a:Author)WHERE NOT a:CollaborationWITH a.wcc_coauthorship as componentId, a.pagerank_coauthorship as pagerank, aORDER BY pagerank DESCRETURN componentId, count(*) as countOfMembers, collect(a.name)[..3] as mostInfluentialMembersORDER BY countOfMembers DESCLIMIT 5 Results More than anything, the WCC results alert us about the authors’ data we haven’t cleaned. I assume the largest component, which consists of 46666 members, is the component with the cleaned author names. The second-largest component consists of authors, where that author appeared along with two collaborations. Last, there are a couple of associations or organizations I have missed in my manual cleanup process. We will also use the approximate Betweenness centrality to observe who has the most influence over the information flow in the co-authorship network. CALL gds.betweenness.stream('coauthorship-network', {samplingSize:1000})YIELD nodeId, scoreRETURN gds.util.asNode(nodeId).name as author, score as betweennessORDER BY betweenness DESC LIMIT 10 Results #### Citation network analysis The time has come to enrich our graph with external data. The internal citation network between ArXiv articles is available on GitHub. You will need to download the internal references file. If you look closely at the Kaggle ArXiv dataset footnotes, this is the same tool they used to scrape the ArXiv articles and authors. It makes sense that the author’s information is dirty as it was scraped from the PDFs. After we have downloaded the internal references dataset, we can import it using the following Python script. import_citations_query = """ UNWIND$data as row
MATCH (source:Article{id:row.source})
WITH source, row.cites as cites
UNWIND cites as cite
MATCH (target:Article{id:cite})
MERGE (source)-[:CITES]->(target)
RETURN distinct 'done' as result
"""

articles_id = articles_df['id'].to_list()
citations_params = []
session = driver.session()

with open('internal-references-pdftotext.json') as citation_file:

for article in citations:
if not article in articles_id:
continue
citations_params.append({'source':article, 'cites': citations[article]})
if (len(citations_params) % 1000) == 0:
session.run(import_citations_query, {'data': citations_params})
citations_params = []

session.run(import_citations_query, {'data': citations_params})
session.close()        

Let’s first count the number of internal citations we have imported.

MATCH ()-[r:CITES]->() RETURN count(*) as count_of_citations

There are 56172 citations between 20874 articles. The dataset was compiled on 30th April 2019, so we don’t have any citations after that date. We can now look at the most cited articles within our network.

MATCH (a:Article)RETURN a.title as title,       size((a)<-[:CITES]-()) as citationsORDER BY citations DESC LIMIT 10

Results

To my uneducated eye, it seems that the most cited articles talk about nuclear collisions. Next, we will calculate the ArticleRank score for each article in the citation network. ArticleRank is a variation of the PageRank algorithm. Find more details here.

First, we will project the citation network with the Native projection feature.

CALL gds.graph.create('citation-network', 'Article', 'CITES')

Now we can execute the ArticleRank and store the results back to Neo4j.

CALL gds.alpha.articleRank.write('citation-network',   {writeProperty:'articlerank_citation'})

Let’s examine the ArticleRank algorithm results.

MATCH (a:Article)RETURN a.title as title, a.articlerank_citation as articlerankORDER BY articlerank DESC LIMIT 10

Results

We can observe that the ranking is a bit different as with the direct count of citation rank. The ranking difference is due to taking into account the number of citations as well as the importance of the citing papers.

We can also look at the author with the highest sum of ArticleRank of his articles.

MATCH (a:Author)-[:WROTE]->(article:Article)WHERE NOT a:CollaborationRETURN a.name as author, sum(article.articlerank_citation) as rankORDER BY rankDESC LIMIT 10

Results

The usual suspects come out on top like Lebedev A. and Li X. Another option we have is to look at the most influential articles by the year of publication.

MATCH (a:Article)WHERE a.articlerank_citation > 0.151 AND a.date.year > 2010WITH a, a.date.year as year, a.pagerank_citation as pagerankORDER BY pagerank DESCRETURN year, collect(a.id)[..3] as most_influential_articlesORDER BY year

Results

I have added the article ids as otherwise, the results would not look nice in a blog post. You can take each of these ids and search for them on the ArXiv website.

#### Prepare an article search engine

We have analyzed both the co-authorship and the citation network. I have a good idea who I need to contact for my basement nuclear reactor. Let’s now create a search engine for the articles, where you get the most relevant articles based on the keyword you provide. We will use the Fulltext search index, which is available in Neo4j. I have written a detailed blog post explaining the FTS functionalities in Neo4j.

First, we have to store the publication year of the article as a string property to be able to load it in the Full-Text Search index.

MATCH (a:Article)SET a.year = toString(a.date.year)

Now we can create the Full-Text Search index, where we index the title, abstract, and year property of articles.

CALL db.index.fulltext.createNodeIndex("titlesAndAbstracts",["Article"],["title", "abstract", "year"])

We can take execute a sample FTS query looking for articles containing LHC keyword in the text.

CALL db.index.fulltext.queryNodes("titlesAndAbstracts", "LHC")YIELD node, scoreRETURN node.title as article, scoreLIMIT 5

Results

We can upgrade our query and search for articles containing LHC keyword that were published in 2015. We can also combine the Lucene score with the ArticleRank score to return more influential papers.

CALL db.index.fulltext.queryNodes("titlesAndAbstracts", "LHC AND year:2015")YIELD node, scoreRETURN node.title as title,        node.id as id,        score * node.articlerank_citation as combined_scoreORDER BY combined_score DESCLIMIT 5

Results

Another cool thing we can do with Fulltext search is to introduce the time-decay scoring. I’ve learned this trick from Christophe Willemsen.

WITH apoc.text.join([x in range(0,10) | "year:" + toString((date().year - x)) + "^" +     toString(10-x)]," ") as time_decayCALL db.index.fulltext.queryNodes("titlesAndAbstracts", "LHC " + time_decay) YIELD node, scoreRETURN node.title as title, node.id as id, score * node.articlerank_citation as combined_scoreORDER BY combined_score DESCLIMIT 5

Results

#### Develop an article recommendation engine

If we are already making an application on top of this dataset, we can also add a recommendation engine. We want to be able to recommend similar articles to the one we are currently reading. How do we go about that? We don’t have any data yet in the graph that we could use to group similar articles together. Luckily, we can use many NLP tools to find similar papers given their title and abstract text. We will use the SciBert model to extract embeddings for each article and then use the kNN algorithm to find similar papers based on their embeddings.

First, we have to calculate the embeddings and store them back to Neo4j. We will use the transformers library to load the pre-trained SciBert model and calculate the embeddings based on the title and abstract of an article.

from transformers import *

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased', model_max_length=512)
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

store_embeddings_query = """
UNWIND \$data as row
MATCH (a:Article{id:row.article_id})
SET a.embeddings = row.embeddings
"""
input_embeddings = []
session = driver.session()
for i, row in articles_df.iterrows():
inputs = tokenizer(row['clean_title'] + ' ' + row['clean_abstracts'], return_tensors="pt", truncation=True)
outputs = model(**inputs)
embeddings = outputs[1].detach().numpy().tolist()[0]
input_embeddings.append({'embeddings':embeddings, 'article_id': row['id']})

if len(input_embeddings) % 1000 == 0:
session.run(store_embeddings_query, {'data':input_embeddings})
input_embeddings = []

session.run(store_embeddings_query, {'data':input_embeddings})

This process took around an hour on my laptop. With the embeddings stored in Neo4j, we can go ahead infer the kNN similarity network using the new k-Nearest Neighbour algorithm available in the GDS. As always, we have to first project the in-memory graph.

CALL gds.graph.create('article_similarity','Article','*', {nodeProperties:['embeddings']})

We want to be able to recommend at most ten similar articles, so we will choose the topK value of 10 in the configuration option for the kNN algorithm. We will store the results as SIMILAR relationships between articles, where the score property will indicate the cosine similarity between the embeddings of the two articles.

CALL gds.beta.knn.write('article_similarity', {nodeWeightProperty:'embeddings', writeProperty:'score', writeRelationshipType:'SIMILAR', topK:10})

We can now observe the recommendations for a random article.

MATCH (a:Article{id:"1210.5482"})-[s:SIMILAR]->(similar)RETURN a.title as original, similar.title as recommendation, s.score as scoreORDER BY score DESCLIMIT 5

Results

And just like before, we can combine the similarity score and ArticleRank score to recommend more relevant articles.

MATCH (a:Article{id:"1210.5482"})-[s:SIMILAR]->(similar)RETURN a.title as original, similar.title as recommendation,        s.score * similar.articlerank_citation as scoreORDER BY score DESCLIMIT 5

Results

#### Conclusion

I also wanted to extract scientific concepts using any available pre-trained NLP models, but unfortunately, I did not get any presentable results. I would have to analyze one of the biomedical categories as there are about a thousand pre-trained biomedical NLP models and very few from other categories. From studying the Nuclear Experiments category of articles, I learned that collaboration might be the way to go. I am very open to collaborating with an NLP practitioner to train a model, which would extract scientific concepts.

I encourage you to try Neo4j and load your data as a graph, which will allow you to find any network insights you might have missed till now. If you run into any questions along the way, you can join the Neo4j community, where friendly people can help you with your questions.

As always, the code is available on GitHub.