Developed a document retrieval system to return titles of scientific papers containing the answer to a given user question. I used the first version of the COVID-19 Open Research Dataset (CORD-19)
‼️ Because of memory restrictions, GitHub and Browsers can’t open always big jupyter notebooks. For this reason I have every notebook linked with the ✔️ jupyter nbviewer ✔️ in the following table. If you have any problems opening the notebooks, follow the links.
Articles in the folder comm_use_subset
.
Question examples | Possible Answers |
---|---|
What are the coronoviruses? | Coronaviruses (CoVs) are common human and animal pathogens that can transmit zoonotically and cause severe respiratory disease syndromes. |
What was discovered in Wuhuan in December 2019? | In December 2019, a novel coronavirus, called COVID-19, was discovered in Wuhan, China, and has spread to different cities in China as well as to 24 other countries. |
What is Coronovirus Disease 2019? | Coronavirus Disease 2019 (COVID-19) is an emerging disease with a rapid increase in cases and deaths since its first identification in Wuhan, China, in December 2019. |
What is COVID-19? | COVID-19 is a viral respiratory illness caused by a new coronavirus called SARS-CoV-2. |
What is caused by SARS-COV2? | Coronavirus disease (COVID-19) is caused by SARS-COV2 and represents the causative agent of a potentially fatal disease that is of great global public health concern. |
How is COVID-19 spread? | First, although COVID-19 is spread by the airborne route, air disinfection of cities and communities is not known to be effective for disease control and needs to be stopped. |
Where was COVID-19 discovered? | In December 2019, a novel coronavirus, called COVID-19, was discovered in Wuhan, China, and has spread to different cities in China as well as to 24 other countries. |
How does coronavirus spread? | The new coronavirus was reported to spread via droplets, contact and natural aerosols from human-to-human. |
This repository consists of 5 notebooks
Firstly, I needed to pre-process the CORD-19 (first version) dataset, that is consisted of multiple papers, focused on COVID-19 pandemic and disease. This dataset gives a lot of information for every paper and so I had to choose what I should and shouldn’t use. I decided to use the corpus of each paper, with the pre-process of:
Data-storage I am using a big dictionary of: sentence −→ (paper_id,paper_title) If a sentence belongs to more than one papers, it is not a problem, as question is answered from at least one paper. This dictionary I am storing it and reading it every time I need (.pickle file)
In these two tasks I had to build multiple QA models and remark their performance in various ways of creating the embeddings. I have implemented the following embedding approaches:
SBERT_CORD19_QA_Roberta.ipynb
SBERT_CORD19_QA_CrossEncoders.ipynb
SBERT_CORD19_QA_InferSent.ipynb
SBERT_CORD19_QA_Doc2Vec.ipynb
Finally, from my experiments the best model based on criteria:
I conclude that Sentence Transformer in combination with Cross Encoders is the best model as it has the best results in answers. The time needed for creating the embeddings is approximately the same among the models. (all these remarks for the 6000 papers)
© Konstantinos Nikoletos | 2020 - 2021 |