covid-19 preprint network

Every node in the network is a preprint, and a link is drawn between preprints based on the semantic similarity of their abstracts.

Data acquisition

The underlying dataset is based on a curated list of COVID-19 related preprints from medRxiv and bioRxiv.

Text preprocessing

Before creating the network based on abstract similarity, the text is preprocessed using spaCy. English stopwords and punctuation are removed, as well as often occurring trivial words like "covid". Then, the abstracts are lemmatized. Lemmas that are neither (proper) nouns, adjectives nor verbs are removed from the analysis.

Network creation

Every node is an article. Links are drawn between nodes according to the following rule:

Let 𝑀𝑖,𝑗 be be the number of common lemmata between nodes 𝑖 and 𝑗.
A link is created between 𝑖 and 𝑗 if 𝑀𝑖,𝑗 > 𝑡𝑙.

𝑡𝑙 is defined as 16 exploratively.
The network is layed out using a force-directed algorithm (nodes that share links are closer to each other) with the force-graph library.


This project is made by members of the Odycceus project at the Max Planck Institute for Mathematics in the Sciences, Leipzig with the help of Alexander Dejaco.

Please contact the developer Armin Pournaki if you have ideas for improvements and/or found bugs: pournaki[at]