Learn how to import, clean, and analyze ArXiv dataset in Neo4j. In the last step, you will learn how to create a search and recommendation engine for articles.

In Europe, we are deep in the second wave of Covid lockdown. I’ve seen some motivational speakers talk about using this time and learning a new skillset. As a child, I’ve always liked nuclear experiments, so I decided to build a reactor in my basement and try some experiments. I’ve already got a basement, so now I only need to learn nuclear physics or maybe get some nuclear researchers to help me out.

I’ve got the idea from Estelle Scifo, who imported and analyzed the ArXiv dataset in Neo4j. We’ll take a detailed look at the nuclear experiments category of articles and try to learn some physics, and find some potential collaborators. As we’ll need to reference some papers in the future, we’ll also build a search and recommendation engine for articles to help us in our future endeavors.

As mentioned, we will analyze the ArXiv dataset available on Kaggle. I have prepared a Jupyter Notebook so that you can easily follow along. We will first do a little bit of data preprocessing and then import the data into Neo4j. Next, we will clean the article’s authors so that our exploratory graph and network analysis will be more accurate. Along the way, we will enrich our graph using the internal citation dataset available on GitHub. To finish this blog post, we will create the backend for a search and recommendation application of ArXiv articles.

Agenda:

  1. Data preprocessing
  2. Data cleaning
  3. Exploratory graph analysis
  4. Co-authorship network analysis
  5. Citation network analysis
  6. Prepare an article search engine
  7. Develop an article recommendation engine

Tools:

Data preprocessing

We will assume the ArXiv dataset has been downloaded and extracted into the analysis folder. First, we will load the articles in the category of nuclear experiments in the Pandas data frame. The following script also preprocesses the authors’ names and the published date of the papers.

articles = []
category = 'nucl-ex'
with open("arxiv-metadata-oai-snapshot.json", "r") as f:
    for l in f:
        d = json.loads(l)
        if category in d['categories'].split(' '):
            d['clean_authors'] = get_clean_authors(d['authors_parsed']) 
            articles.append(d)

articles_df = pd.DataFrame().from_records(articles)
articles_df['created_date'] = [datetime.strptime(date[0]['created'].split(',')[1],' %d %b %Y %H:%M:%S %Z') for date in articles_df['versions']]

If we take a look at a random abstract text, we can notice that the text is available in the Latex format:

This result confirms that dominant contributions to the electric\nand magnetic polarizabilities may be represented in terms of two-photon\ncouplings to the $\\sigma$-meson having the predicted mass $m_\\sigma=666$ MeV\nand two-photon width $\\Gamma_{\\gamma\\gamma}=2.6$ keV.

We will transform both the title and abstract text to UTF-8 format using the pylatexenc library.

from pylatexenc.latex2text import LatexNodes2Text

# LaTex to UTF
clean_abstract = []
clean_title = []
for i,a in articles_df.iterrows():
    # Clean title
    try:
        clean_title.append(LatexNodes2Text().latex_to_text(a['title']).replace('\n', ' ').strip()) 
    except:
        clean_title.append(a['abstract'].replace('\n', ' ').strip())
    # Clean abstract
    try:
        clean_abstract.append(LatexNodes2Text().latex_to_text(a['abstract']).replace('\n', ' ').strip()) 
    except:
        clean_abstract.append(a['abstract'].replace('\n', ' ').strip())
articles_df['clean_abstracts'] = clean_abstract
articles_df['clean_title'] = clean_title

We got rid of the LaTeX formatting and have the title and abstract text in the UTF-8 format. The same abstract as before now looks like this:

This result confirms that dominant contributions to the electric and magnetic polarizabilities may be represented in terms of two-photon couplings to theσ-meson having the predicted massm_σ=666MeV and two-photon widthΓ_γγ=2.6keV.

Graph import

After some initial data wrangling, we can go ahead and import the data into Neo4j. The graph schema is quite simple. We have Article and Author nodes, and a WROTE relationship between the two, indicating who are the authors of a particular article.

Neo4j graph schema, made with Arrow tool

As with most Neo4j projects, it is advisable first to define the unique constraints for nodes. This way, the import will be faster as the unique identifiers of the nodes will be indexed. We will create unique constraints for both the Article and the Author nodes.

CREATE CONSTRAINT article IF NOT EXISTS ON (a:Article) ASSERT a.id IS UNIQUE;
CREATE CONSTRAINT authors IF NOT EXISTS ON (a:Author) ASSERT a.name IS UNIQUE;

The dataset and the unique constraints are ready, so we can go ahead and import the graph. An important thing to notice is that when you are using your favorite scripting language with Neo4j Driver to import the data, we never want to run a single transaction for each line but instead want to batch several lines together into a single transaction. In our case, we batch every 2000 lines into a single transaction.

import_query = """
UNWIND $data as row
CREATE (a:Article)
SET a += apoc.map.clean(row,['authors'],[])
SET a.date = date(row['date'])
WITH a, row.authors as authors
UNWIND authors as author
MERGE (au:Author{name:author})
MERGE (au)-[:WROTE]->(a)
"""

import_data = []
session = driver.session()
for i, row in articles_df.iterrows():
    import_data.append({'id':row['id'], 'title':row['clean_title'],   
                       'abstract':row['clean_abstracts'], 
                        'date':row['created_date'].strftime('%Y-%m-%d'),  
                        'authors':row['clean_authors']})
    # Batch import by 2000 lines
    if ((i % 2000) == 0) and (len(import_data) != 0):
        session.run(import_query, {'data':import_data})
        import_data = []

session.run(import_query, {'data':import_data})
session.close()

Data Cleaning

If we take a look at our graph in Neo4j Browser, we should see the same schema as we defined before.

Sample data in Neo4j Browser

We’ll quickly see why this part is titled Data Cleaning. Let’s begin by examining the most active authors in the Nuclear Experiments category.

MATCH (a:Author)
RETURN a.name as author, size((a)-[:WROTE]->()) as number_of_articles
ORDER BY number_of_articles DESC LIMIT 10

Results

I wasn’t aware of this before, but it looks like that a collaboration like ALICE or STAR can be an author of an article. Let’s explore the collaborations a bit more and take a look at the most active collaborations overall.

MATCH (a:Author)
WHERE a.name contains "Collaboration"
RETURN a.name as author,
size((a)-[:WROTE]->()) as number_of_articles
ORDER BY number_of_articles
DESC LIMIT 10

Results

We have already known that ALICE and STAR collaborations are the most active, but there are many more collaborations in the nuclear experiment category. Unfortunately, the data appears dirty, as there are two STAR collaboration variations and two CLAS collaboration variations. We’ll do some data cleaning of the Author entities before we continue with the graph exploration analysis.

Original image borrowed from https://www.wikihow.com/Get-Out-of-Washing-the-Dishes and slightly edited

Data cleaning is the most fun part of any analysis. Well, not really, but it is a vital part of the research if we want to rely on the data. We’ll introduce a secondary label for all collaboration nodes. Again, we will define a unique constraint for optimization and indexing. 

CREATE CONSTRAINT collaboration IF NOT EXISTS ON (c:Collaboration) ASSERT c.id IS UNIQUE;

Let’s take a look at the potential candidates that could be merged into a single entity. We will remove stop words from the author’s name like “the”, “collaboration”, “for” and examine how well does this simple technique work.

WITH ["The","the","Collaboration", "collaboration", "for", "For", "\n","on behalf of","By "] as remove_words
MATCH (a:Author)
WHERE a.name contains "ollaboration"
WITH trim(reduce(v = a.name, word in remove_words | apoc.text.replace(v, word, ' '))) as author,
collect(a.name) as nodes
WHERE size(nodes) > 1
RETURN author, nodes
ORDER BY author
LIMIT 10

Results

author nodes
9405 [9405 collaboration, collaboration the 9405]
A1 [A1 Collaboration, Collaboration A1, Collaboration for the A1]
AGATA [Collaboration The AGATA, collaboration the AGATA]
ALADIN [Collaboration ALADIN, Collaboration the ALADIN, collaboration ALADIN]
ALADIN2000 [Collaboration ALADIN2000, The ALADIN2000 Collaboration]
ALICE [ALICE Collaboration, ALICE collaboration, Collaboration ALICE, Collaboration for ALICE, Collaboration for the ALICE, Collaboration the ALICE, collaboration the ALICE]
ATLAS [ATLAS Collaboration, Collaboration ATLAS, The ATLAS Collaboration]
Adamczewski-Musch J. HADES [Adamczewski-Musch J. HADES collaboration, Adamczewski-Musch J. HADES Collaboration]
Adamczyk L. STAR [Adamczyk L. STAR Collaboration, Adamczyk L. STAR Collaboration]
Adhikari K. P. CLAS [Adhikari K. P. for the CLAS collaboration, Adhikari K. P. for the CLAS Collaboration, Adhikari K. P. for the CLAS collaboration]
view raw arxiv_collab.csv hosted with ❤ by GitHub

It appears that this simple technique found many candidates to be merged. It also found examples where both an author and a collaboration can be mentioned as a single author. An example would be Adhikari K. P. for the CLAS Collaboration. We will first merge these candidates as presented in the previous query and go from there.

WITH ["The","the","Collaboration", "collaboration", "for", "For", "\n","on behalf of","By "] as remove_words
MATCH (a:Author)
WHERE (a.name contains "ollaboration" OR a.name IN ["INDRA","H1"]) AND NOT a.name CONTAINS "ollaborations"
WITH trim(reduce(v = a.name, word in remove_words | apoc.text.replace(v, word, ' '))) as collab,
collect(a) as nodes
CALL apoc.refactor.mergeNodes(nodes) YIELD node
SET node.name = collab, node:Collaboration
RETURN distinct 'done' as result

Next, we can decouple the entities where both the author and the collaboration are present as a single entity. We will start with a simple assumption that the last word in the author’s name represents the collaboration. We will compare the last word of the author’s name with the existing collaborations in our graph. If the last word matches an existing collaboration, we will decouple the single author node into collaboration and author nodes.

MATCH (c:Collaboration)
WHERE size(split(c.name,' ')) > 1
// get the last word of the name
WITH c, split(c.name,' ')[-1] as collab
// Match existing collaboration
MATCH (c1:Collaboration{name:collab})
RETURN trim(replace(c.name, collab, '')) as author, c1.name as collab
LIMIT 5

Results

author collab
Rybczynski M. NA49
Sikler F. NA49
Sitar B. NA49
Strabel C. NA49
Stroebele H. NA49

This query only looked at intermediate results but did not store the decoupled relationships. The decoupling results are satisfactory, so we will go ahead and save the decoupled relationships back to the graph.

MATCH (c:Collaboration)
WHERE size(split(c.name,' ')) > 1
WITH c, split(c.name,' ')[-1] as collab
MATCH (new_colab:Collaboration{name:collab})
WITH c,new_colab, trim(replace(c.name, collab, '')) as author
MATCH (c)-[:WROTE]->(article:Article)
MERGE (new_author:Author{name:author})
MERGE (new_colab)-[:WROTE]->(article)
MERGE (new_author)-[:WROTE]->(article)
WITH distinct c
DETACH DELETE c
RETURN distinct 'done' as result

Let’s examine the potential collaboration candidates that our previous decoupling technique missed.

MATCH (c:Collaboration)
WHERE size(split(c.name,' ')) > 1
RETURN c.name as author
LIMIT 6

Results

author
Aidala C. A. EIC
Malace S. Jefferson Lab Hall A
Paolone M. Jefferson Lab Hall A
Strauch S. Jefferson Lab Hall A
Shitov Yu. SuperNEMO
Bocquet JP. GRAAL

Some of the collaboration names are longer than just a single word, like “Jefferson Lab Hall A” and we have missed those. Some of the others weren’t tagged as collaborations before, because it is not explicitly tagged as collaboration in text. With a bit of manual work, I have listed a bunch of other collaborations that need to be decoupled as well.

UNWIND ["Jefferson Lab Hall A", "NA61/SHINE", "Crystal Ball at MAMI", "WA98", "A2 at MAMI","JETSCAPE", "DLS", "WASA-at-COSY", "CERES/NA45", "ALADIN'2000","AMADEUS", "CB-ELSA", "GRAAL", "KEK-PS E559","CELSIUS-WASA","RHIC Spin", "EIC", "LSSS", "COSY-11", "PAX", "Hall A", "LPC-CHARISSA-DEMON","Crystal Ball at MAMI", "KLOE-2", "Graal", "HAL QCD", "MAJORANA.","Daya Bay","UConn-Yale-TUNL-Weizmann-PTB-UCL", "KamLAND-Zen", "COMPASS", "PREX", "ALICE HLT"] as collab
MATCH (c:Collaboration)
WHERE size(split(c.name,' ')) > 1 AND NOT c.name in ["Jefferson Lab Hall A", "Hall A DVCS", "Hall A", "HAL QCD"] AND
c.name contains collab
WITH c, trim(replace(c.name,collab,'')) as author, collab
MERGE (new_author:Author{name:author})
MERGE (new_collab:Author{name:collab})
SET new_collab:Collaboration
WITH c, new_author, new_collab
MATCH (c)-[:WROTE]->(article:Article)
MERGE (new_author)-[:WROTE]->(article)
MERGE (new_collab)-[:WROTE]->(article)
WITH distinct c
DETACH delete c
RETURN distinct 'done' as result

Last, but not least, the “Jefferson Lab Hall A” shows up also as “Hall A DVCS” and “Hall A”, so we’ll merge them into a single entity as well.

MATCH (c:Collaboration)
WHERE c.name contains "Hall A"
WITH c ORDER BY c.name
WITH collect(c) as nodes
CALL apoc.refactor.mergeNodes(nodes, {properties: {name:'overwrite'}}) YIELD node
RETURN distinct 'done' as result

We have cleaned the entities where a single collaboration or a pair of collaboration and an author appeared. The dataset also has entities where two or more collaborations can occur along with an author. Because this is not a data cleaning workshop, we’ll skip cleaning this data.

MATCH (a:Author)
WHERE a.name CONTAINS "Collaborations"
RETURN a.name as author LIMIT 5

Results

author
Acha A. HKS – JLab E05-115 and E01-001 – Collaborations
Achenbach P. HKS – JLab E05-115 and E01-001 – Collaborations
Adhikari K. P. The CLAS and Hall-A Collaborations
Aghasyan M. The CLAS and Hall-A Collaborations
Aguar-Bartolomé P. The Crystal Ball at MAMI, TAPS, and A2 Collaborations
view raw arxiv_collabs.csv hosted with ❤ by GitHub

Exploratory graph analysis

Finally, we can start our exploratory graph analysis. We will begin by taking a look at the number of articles written by publication year.

MATCH (a:Article)
RETURN a.date.year as year, count(*) as articles
ORDER BY year

Results

The nuclear experiment category on ArXiv was created in 1994. Since then, it has steadily risen in popularity until the year 2011, where the number of articles written per year has plateaued for the last decade.

We will again take a look at the most active authors but disregard any collaborations this time.

MATCH (a:Author)
WHERE NOT a:Collaboration
RETURN a.name as author,
size((a)-[:WROTE]->()) as number_of_articles
ORDER BY number_of_articles DESC
LIMIT 10

Results

Lebedev A. stays as the most active author, while Ma. Y. G. and Li X. follow in the second and third place. Another interesting statistic to look at is who has collaborated with most of the other authors.

MATCH (a:Author)-[:WROTE]->()<-[:WROTE]-(other)
WHERE NOT a:Collaboration AND NOT other:Collaboration
RETURN a.name as author,
count(distinct other) as number_of_coauthors
ORDER BY number_of_coauthors DESC
LIMIT 10

Results

Wang Y. has collaborated with more than 5000 other authors. When I first observed this data, I taught for sure there was some error in the data. It turns out this is correct. The only explanation is that there are articles where many scientists collaborated on a paper. We’ll take a look at articles with the highest count of authors.

MATCH (a:Author{name:"Wang Y."})-[:WROTE]->(article)
RETURN article.id as article,
size((article)<-[:WROTE]-()) as number_of_authors
ORDER BY number_of_authors DESC
LIMIT 10

Results

There are four articles where more than 800 researchers have collaborated. I’ve also checked the ArXiv page to verify this data, and it turns out to be valid. I can only applaud the coordinators of these articles, where they could finish a paper with almost a thousand collaborators.

Next, we’ll take a look at pairs of researchers that have collaborated the most.

MATCH (a:Author)-[:WROTE]->()<-[:WROTE]-(other)
WHERE id(a) < id(other) AND NOT a:Collaboration AND NOT other:Collaboration
RETURN a.name + ' with ' + other.name as authors,
count(*) as number_of_collaborations
ORDER BY number_of_collaborations DESC
LIMIT 10

Results

Lebedev A. has worked on more than 200 articles with Li X. and around 180 articles with Pei H. In the third place, we have Li X. and Pei H. We might assume that Lebedev A., Li X., and Pei H. are the powerhouse in the Nuclear Experiments category.

We can also examine the most active authors by publication year for the last decade.

MATCH (a:Author)-[:WROTE]->(article:Article)
WHERE NOT a:Collaboration AND article.date.year > 2010
WITH article.date.year as year, a.name as author, count(*) as count
ORDER BY count DESC
RETURN year,
collect(author)[..3] as most_active_authors
ORDER BY year

Results

year most_active_authors
2011 [Ma Y. G., Chung P., Arrington J.]
2012 [Lebedev A., Dion A., Fleuret F.]
2013 [Li X., Lebedev A., Masui H.]
2014 [Lebedev A., Li X., Pei H.]
2015 [Lebedev A., Li X., Manion A.]
2016 [Ma Y. G., Zhang Y., Lebedev A.]
2017 [Zhang J., Lebedev A., Skoby M. J.]
2018 [Lebedev A., Taranenko A., Zhang J.]
2019 [Li X., Wang Y., Taranenko A.]
2020 [Taranenko A., Lebedev A., Ma Y. G.]

Co-authorship network analysis

If we remember, the idea for this article came out of necessity to find a potential collaborator for my basement project. After some basic graph explorations, it seems that one can’t do nuclear experiments on its own but has to join a collaboration or find a study to participate with 800 other scientists.

We will explore the co-authorship network with the help of the Neo4j Graph Data Science library. If you need a quick refresher on how does the GDS library work, I would suggest the Introduction to graph algorithm course.

An important thing to note is that our network analysis of the co-authorship network does not consider how influential the papers the authors have written are. It can be seen as more of a social network analysis, where we assume that if two authors collaborated on an article, they know each other. This way, I can find authors with lots of networking influence to collaborate on my project and introduce me to various people and organizations.

We will use the Cypher projection feature to project the co-authorship network. The co-authorship relationship between the authors does not exist in our stored graph, so it could be said that we will be projecting a virtual network where we reduce the 2-hop relationship to a direct relationship without having to save the direct relationship in Neo4j. We will also disregard nodes with the Collaboration label in our analysis.

CALL gds.graph.create.cypher('coauthorship-network',
'MATCH (a:Author) WHERE NOT a:Collaboration RETURN id(a) as id',
'MATCH (s:Author)-[:WROTE]->()<-[:WROTE]-(t:Author)
WHERE NOT s:Collaboration AND NOT t:Collaboration
RETURN id(s) as source, id(t) as target, count(*) as weight')

We will begin with the Weakly Connected Components algorithms. It is used to find disconnected components or islands within the graph.

CALL gds.wcc.write('coauthorship-network', {writeProperty:'wcc_coauthorship'})
YIELD componentCount, componentDistribution

Results

There is a total of 2281 components in our network. More than half of them consist of a single node, which indicates that the specific author has no collaborations. As with most real-world networks, we have a single super component that contains 85% of all nodes and then many smaller ones.

Next, we will execute the weighted variant of the PageRank algorithm to find the authors with the highest networking potential. We expect that authors who collaborated with other influential networking authors will rank the highest. We will store the results to Neo4j with the write mode of the algorithm.

CALL gds.pageRank.write('coauthorship-network', 
{writeProperty:'pagerank_coauthorship', maxIterations:20,
relationshipWeightProperty:'weight'})
YIELD centralityDistribution

Let’s examine the authors with the highest networking potential.

MATCH (a:Author)
WHERE NOT a:Collaboration
RETURN a.name as author, a.pagerank_coauthorship as pagerank
ORDER BY pagerank DESC
LIMIT 10

Results

It seems that Zhang Y., Lebedev A., and Li X. have the highest networking potential. If someone can get me their contact, I would appreciate it.

We can also examine authors with the highest PageRank score grouped by connected component for the five largest components.

MATCH (a:Author)
WHERE NOT a:Collaboration
WITH a.wcc_coauthorship as componentId, a.pagerank_coauthorship as pagerank, a
ORDER BY pagerank DESC
RETURN componentId, count(*) as countOfMembers, collect(a.name)[..3] as mostInfluentialMembers
ORDER BY countOfMembers DESC
LIMIT 5

Results

componentId countOfMembers mostInfluentialMembers
0 46666 [Zhang Y., Lebedev A., Li X.]
19971 200 [Pomerantz I. The CLAS and Hall-A Collaborations, Ilieva Y. The CLAS and Hall-A Collaborations, Gilman R. The CLAS and Hall-A Collaborations]
14137 177 [Strakovsky Igor GWU, Bellwied Rene Houston U., Briscoe William GWU]
25481 133 [Tang L. HKS – JLab E05-115 and E01-001 – Collaborations, Chen C. HKS – JLab E05-115 and E01-001 – Collaborations, Gogami T. HKS – JLab E05-115 and E01-001 – Collaborations]
17156 132 [Elaasar M. HKS, Ent R. HKS, Han Y. HKS]

More than anything, the WCC results alert us about the authors’ data we haven’t cleaned. I assume the largest component, which consists of 46666 members, is the component with the cleaned author names. The second-largest component consists of authors, where that author appeared along with two collaborations. Last, there are a couple of associations or organizations I have missed in my manual cleanup process.

We will also use the approximate Betweenness centrality to observe who has the most influence over the information flow in the co-authorship network.

CALL gds.betweenness.stream('coauthorship-network', 
{samplingSize:1000})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name as author, score as betweenness
ORDER BY betweenness DESC LIMIT 10

Results

Citation network analysis

The time has come to enrich our graph with external data. The internal citation network between ArXiv articles is available on GitHub. You will need to download the internal references file. If you look closely at the Kaggle ArXiv dataset footnotes, this is the same tool they used to scrape the ArXiv articles and authors. It makes sense that the author’s information is dirty as it was scraped from the PDFs.

After we have downloaded the internal references dataset, we can import it using the following Python script.

import_citations_query = """
UNWIND $data as row
MATCH (source:Article{id:row.source})
WITH source, row.cites as cites
UNWIND cites as cite
MATCH (target:Article{id:cite})
MERGE (source)-[:CITES]->(target)
RETURN distinct 'done' as result
"""

articles_id = articles_df['id'].to_list()
citations_params = []
session = driver.session()

with open('internal-references-pdftotext.json') as citation_file:
    citations = json.load(citation_file)

for article in citations:
    if not article in articles_id:
        continue
    citations_params.append({'source':article, 'cites': citations[article]})    
    if (len(citations_params) % 1000) == 0:
        session.run(import_citations_query, {'data': citations_params})
        citations_params = []

session.run(import_citations_query, {'data': citations_params})
session.close()        

Let’s first count the number of internal citations we have imported.

MATCH ()-[r:CITES]->() 
RETURN count(*) as count_of_citations

There are 56172 citations between 20874 articles. The dataset was compiled on 30th April 2019, so we don’t have any citations after that date. We can now look at the most cited articles within our network.

MATCH (a:Article)
RETURN a.title as title,
size((a)<-[:CITES]-()) as citations
ORDER BY citations DESC LIMIT 10

Results

title citations
Formation of dense partonic matter in relativistic nucleus-nucleus collisions at RHIC: Experimental evaluation by the PHENIX collaboration 256
Experimental and Theoretical Challenges in the Search for the Quark Gluon Plasma: The STAR Collaboration's Critical Assessment of the Evidence from RHIC Collisions 247
Glauber Modeling in High Energy Nuclear Collisions 222
Hydrodynamic description of ultrarelativistic heavy-ion collisions 220
Quark Gluon Plasma an Color Glass Condensate at RHIC? The perspective from the BRAHMS experiment 202
Collective phenomena in non-central nuclear collisions 201
The PHOBOS Perspective on Discoveries at RHIC 194
Performance of the ALICE Experiment at the CERN LHC 177
Observation and studies of jet quenching in PbPb collisions at nucleon-nucleon center-of-mass energy = 2.76 TeV 169
Long-range angular correlations on the near and away side in p-Pb collisions at √(s_ NN) = 5.02 TeV 156
view raw arxiv_citations.csv hosted with ❤ by GitHub

To my uneducated eye, it seems that the most cited articles talk about nuclear collisions. Next, we will calculate the ArticleRank score for each article in the citation network. ArticleRank is a variation of the PageRank algorithm. Find more details here.

First, we will project the citation network with the Native projection feature.

CALL gds.graph.create('citation-network', 'Article', 'CITES')

Now we can execute the ArticleRank and store the results back to Neo4j.

CALL gds.alpha.articleRank.write('citation-network', 
{writeProperty:'articlerank_citation'})

Let’s examine the ArticleRank algorithm results.

MATCH (a:Article)
RETURN a.title as title, a.articlerank_citation as articlerank
ORDER BY articlerank
DESC LIMIT 10

Results

title articlerank
Hydrodynamic description of ultrarelativistic heavy-ion collisions 7.417194875504284
Collective phenomena in non-central nuclear collisions 5.009129377282968
Eccentricity fluctuations and its possible effect on elliptic flow measurements 4.826668283758324
Glauber Modeling in High Energy Nuclear Collisions 3.8863243274621846
The Color Glass Condensate and High Energy Scattering in QCD 3.617933058337312
Reconstructing azimuthal distributions in nucleus-nucleus collisions 3.539150645807085
The PHOBOS Glauber Monte Carlo 3.293096240782206
Formation of dense partonic matter in relativistic nucleus-nucleus collisions at RHIC: Experimental evaluation by the PHENIX collaboration 3.2591419976698206
An Experimental Exploration of the QCD Phase Diagram: The Search for the Critical Point and the Onset of De-confinement 3.168803844159089
Experimental and Theoretical Challenges in the Search for the Quark Gluon Plasma: The STAR Collaboration's Critical Assessment of the Evidence from RHIC Collisions 3.0016942140627623

We can observe that the ranking is a bit different as with the direct count of citation rank. The ranking difference is due to taking into account the number of citations as well as the importance of the citing papers.

We can also look at the author with the highest sum of ArticleRank of his articles.

MATCH (a:Author)-[:WROTE]->(article:Article)
WHERE NOT a:Collaboration
RETURN a.name as author, sum(article.articlerank_citation) as rank
ORDER BY rank
DESC LIMIT 10

Results

The usual suspects come out on top like Lebedev A. and Li X. Another option we have is to look at the most influential articles by the year of publication.

MATCH (a:Article)
WHERE a.articlerank_citation > 0.151 AND a.date.year > 2010
WITH a, a.date.year as year, a.pagerank_citation as pagerank
ORDER BY pagerank DESC
RETURN year, collect(a.id)[..3] as most_influential_articles
ORDER BY year

Results

year most_influential_articles
2011 [1101.0710, 1101.0988, 1101.1257]
2012 [1201.0373, 1201.0392, 1201.0699]
2013 [1301.0099, 1301.0165, 1301.0324]
2014 [1403.4257, 1403.4455, 1403.4668]
2015 [1511.02834, 1511.02957, 1511.03338]
2016 [1601.00040, 1601.00079, 1601.00188]
2017 [1709.05325, 1709.05618, 1709.05649]
2018 [1801.01124, 1801.01213, 1801.01277]
2019 [1901.01319, 1901.04378, 1901.04482]

I have added the article ids as otherwise, the results would not look nice in a blog post. You can take each of these ids and search for them on the ArXiv website.

Prepare an article search engine

We have analyzed both the co-authorship and the citation network. I have a good idea who I need to contact for my basement nuclear reactor. Let’s now create a search engine for the articles, where you get the most relevant articles based on the keyword you provide. We will use the Fulltext search index, which is available in Neo4j. I have written a detailed blog post explaining the FTS functionalities in Neo4j.

First, we have to store the publication year of the article as a string property to be able to load it in the Full-Text Search index.

MATCH (a:Article)
SET a.year = toString(a.date.year)

Now we can create the Full-Text Search index, where we index the title, abstract, and year property of articles.

CALL db.index.fulltext.createNodeIndex("titlesAndAbstracts",
["Article"],["title", "abstract", "year"])

We can take execute a sample FTS query looking for articles containing LHC keyword in the text.

CALL db.index.fulltext.queryNodes("titlesAndAbstracts", "LHC")
YIELD node, score
RETURN node.title as article, score
LIMIT 5

Results

article score
Physics perspectives with AFTER@LHC (A Fixed Target ExpeRiment at LHC) 4.070171356201172
Quarkonium-photoproduction prospects at a fixed-target experiment at the LHC (AFTER@LHC) 3.9622769355773926
Charmonium production at the LHC 3.870992422103882
Prospectives for A Fixed-Target ExpeRiment at the LHC: AFTER@LHC 3.811666250228882
Double-quarkonium production at a fixed-target experiment at the LHC (AFTER@LHC) 3.7244210243225098
view raw arxiv_fts.csv hosted with ❤ by GitHub

We can upgrade our query and search for articles containing LHC keyword that were published in 2015. We can also combine the Lucene score with the ArticleRank score to return more influential papers.

CALL db.index.fulltext.queryNodes("titlesAndAbstracts", "LHC AND year:2015")
YIELD node, score
RETURN node.title as title,
node.id as id,
score * node.articlerank_citation as combined_score
ORDER BY combined_score DESC
LIMIT 5

Results

title id combined_score
Heavy-flavour and quarkonium production in the LHC era: from proton-proton to heavy-ion collisions 1506.03981 3.9775452998468297
New parton distribution functions from a global analysis of quantum chromodynamics 1506.07443 1.5848195160429013
Studies of Transverse-Momentum-Dependent distributions with A Fixed-Target ExpeRiment using the LHC beams (AFTER@LHC) 1502.00984 1.4173197202469938
Production of light nuclei and anti-nuclei in pp and Pb-Pb collisions at LHC energies 1506.08951 1.3544381748399803
Measurement of charged-particle spectra in Pb+Pb collisions at √(s_𝖭𝖭) = 2.76 TeV with the ATLAS detector at the LHC 1504.04337 1.3431175946645912
view raw arxiv_search.csv hosted with ❤ by GitHub

Another cool thing we can do with Fulltext search is to introduce the time-decay scoring. I’ve learned this trick from Christophe Willemsen.

WITH apoc.text.join([x in range(0,10) | 
"year:" + toString((date().year - x)) + "^" +
toString(10-x)]," ") as time_decay
CALL db.index.fulltext.queryNodes("titlesAndAbstracts", "LHC " + time_decay) YIELD node, score
RETURN node.title as title, node.id as id, score * node.articlerank_citation as combined_score
ORDER BY combined_score DESC
LIMIT 5

Results

title id combined_score
Performance of the ALICE Experiment at the CERN LHC 1402.4476 13.141212962671805
Heavy-flavour and quarkonium production in the LHC era: from proton-proton to heavy-ion collisions 1506.03981 9.478550376224753
Observation of long-range near-side angular correlations in proton-lead collisions at the LHC 1210.5482 6.864689978179114
Collective flow and viscosity in relativistic heavy-ion collisions 1301.2826 6.646857179646787
Observation and studies of jet quenching in PbPb collisions at nucleon-nucleon center-of-mass energy = 2.76 TeV 1102.1957 6.2870712363435235
view raw arxiv_time_decay.csv hosted with ❤ by GitHub

Develop an article recommendation engine

If we are already making an application on top of this dataset, we can also add a recommendation engine. We want to be able to recommend similar articles to the one we are currently reading. How do we go about that? We don’t have any data yet in the graph that we could use to group similar articles together. Luckily, we can use many NLP tools to find similar papers given their title and abstract text. We will use the SciBert model to extract embeddings for each article and then use the kNN algorithm to find similar papers based on their embeddings.

First, we have to calculate the embeddings and store them back to Neo4j. We will use the transformers library to load the pre-trained SciBert model and calculate the embeddings based on the title and abstract of an article.

from transformers import *

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased', model_max_length=512)
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

store_embeddings_query = """
UNWIND $data as row
MATCH (a:Article{id:row.article_id})
SET a.embeddings = row.embeddings
"""
input_embeddings = []
session = driver.session()
for i, row in articles_df.iterrows():
    inputs = tokenizer(row['clean_title'] + ' ' + row['clean_abstracts'], return_tensors="pt", truncation=True)
    outputs = model(**inputs)
    embeddings = outputs[1].detach().numpy().tolist()[0]
    input_embeddings.append({'embeddings':embeddings, 'article_id': row['id']})

    if len(input_embeddings) % 1000 == 0:
        session.run(store_embeddings_query, {'data':input_embeddings})
        input_embeddings = []

session.run(store_embeddings_query, {'data':input_embeddings})

This process took around an hour on my laptop. With the embeddings stored in Neo4j, we can go ahead infer the kNN similarity network using the new k-Nearest Neighbour algorithm available in the GDS. As always, we have to first project the in-memory graph.

CALL gds.graph.create('article_similarity','Article','*', {nodeProperties:['embeddings']})

We want to be able to recommend at most ten similar articles, so we will choose the topK value of 10 in the configuration option for the kNN algorithm. We will store the results as SIMILAR relationships between articles, where the score property will indicate the cosine similarity between the embeddings of the two articles.

CALL gds.beta.knn.write('article_similarity', {nodeWeightProperty:'embeddings', 
writeProperty:'score', writeRelationshipType:'SIMILAR', topK:10})

We can now observe the recommendations for a random article.

MATCH (a:Article{id:"1210.5482"})-[s:SIMILAR]->(similar)
RETURN a.title as original, similar.title as recommendation, s.score as score
ORDER BY score DESC
LIMIT 5

Results

original recommendation score
Observation of long-range near-side angular correlations in proton-lead collisions at the LHC Evidence for collectivity in pp collisions at the LHC 0.9850505489919138
Observation of long-range near-side angular correlations in proton-lead collisions at the LHC Baryon and meson screening masses 0.9813577205666131
Observation of long-range near-side angular correlations in proton-lead collisions at the LHC High Precision Measurement of the Proton Elastic Form Factor Ratio μ_pG_E/G_M at low Q^2 0.9811153799905147
Observation of long-range near-side angular correlations in proton-lead collisions at the LHC Direct Observation of Proton-Neutron Short-Range Correlation Dominance in Heavy Nuclei 0.9808553706390096
Observation of long-range near-side angular correlations in proton-lead collisions at the LHC In-medium minijet dissipation in Au+Au collisions at √(s_NN) = 130 and 200 GeV studied with charge-independent two-particle number fluctuations and correlations 0.9795923167214423

And just like before, we can combine the similarity score and ArticleRank score to recommend more relevant articles.

MATCH (a:Article{id:"1210.5482"})-[s:SIMILAR]->(similar)
RETURN a.title as original, similar.title as recommendation,
s.score * similar.articlerank_citation as score
ORDER BY score DESC
LIMIT 5

Results

original recommendation score
Observation of long-range near-side angular correlations in proton-lead collisions at the LHC Evidence for collectivity in pp collisions at the LHC 0.55478818226193
Observation of long-range near-side angular correlations in proton-lead collisions at the LHC High Precision Measurement of the Proton Elastic Form Factor Ratio μ_pG_E/G_M at low Q^2 0.5520610771884321
Observation of long-range near-side angular correlations in proton-lead collisions at the LHC Beam energy dependence of the expansion dynamics in relativistic heavy ion collisions: Indications for the critical end point? 0.209210956688751
Observation of long-range near-side angular correlations in proton-lead collisions at the LHC Direct Observation of Proton-Neutron Short-Range Correlation Dominance in Heavy Nuclei 0.17378764875407884
Observation of long-range near-side angular correlations in proton-lead collisions at the LHC In-medium minijet dissipation in Au+Au collisions at √(s_NN) = 130 and 200 GeV studied with charge-independent two-particle number fluctuations and correlations 0.16395744897399406

Conclusion

I also wanted to extract scientific concepts using any available pre-trained NLP models, but unfortunately, I did not get any presentable results. I would have to analyze one of the biomedical categories as there are about a thousand pre-trained biomedical NLP models and very few from other categories. From studying the Nuclear Experiments category of articles, I learned that collaboration might be the way to go. I am very open to collaborating with an NLP practitioner to train a model, which would extract scientific concepts.

I encourage you to try Neo4j and load your data as a graph, which will allow you to find any network insights you might have missed till now. If you run into any questions along the way, you can join the Neo4j community, where friendly people can help you with your questions.

As always, the code is available on GitHub.