QA-with-SBERT-for-CORD19


Q&A with SBERT NN to CORD-19

Developed a document retrieval system to return titles of scientific papers containing the answer to a given user question. I used the first version of the COVID-19 Open Research Dataset (CORD-19)

Notebook viewer

‼️ Because of memory restrictions, GitHub and Browsers can’t open always big jupyter notebooks. For this reason I have every notebook linked with the ✔️ jupyter nbviewer ✔️ in the following table. If you have any problems opening the notebooks, follow the links.

Notebook Link to jupyter nbviewer Link to Colab
SBERT_CORD19_Preprocess.ipynb nbviewer Open In Colab
SBERT_CORD19_QA_CrossEncoders.ipynb nbviewer Open In Colab
SBERT_CORD19_QA_Doc2Vec.ipynb nbviewer Open In Colab
SBERT_CORD19_QA_InferSent.ipynb nbviewer Open In Colab
SBERT_CORD19_QA_Roberta.ipynb nbviewer Open In Colab

CORD-19

Articles in the folder comm_use_subset.

Question & Answer examples

Question examples Possible Answers
What are the coronoviruses? Coronaviruses (CoVs) are common human and animal pathogens that can transmit zoonotically and cause severe respiratory disease syndromes.
What was discovered in Wuhuan in December 2019? In December 2019, a novel coronavirus, called COVID-19, was discovered in Wuhan, China, and has spread to different cities in China as well as to 24 other countries.
What is Coronovirus Disease 2019? Coronavirus Disease 2019 (COVID-19) is an emerging disease with a rapid increase in cases and deaths since its first identification in Wuhan, China, in December 2019.
What is COVID-19? COVID-19 is a viral respiratory illness caused by a new coronavirus called SARS-CoV-2.
What is caused by SARS-COV2? Coronavirus disease (COVID-19) is caused by SARS-COV2 and represents the causative agent of a potentially fatal disease that is of great global public health concern.
How is COVID-19 spread? First, although COVID-19 is spread by the airborne route, air disinfection of cities and communities is not known to be effective for disease control and needs to be stopped.
Where was COVID-19 discovered? In December 2019, a novel coronavirus, called COVID-19, was discovered in Wuhan, China, and has spread to different cities in China as well as to 24 other countries.
How does coronavirus spread? The new coronavirus was reported to spread via droplets, contact and natural aerosols from human-to-human.

This repository consists of 5 notebooks

Firstly, I needed to pre-process the CORD-19 (first version) dataset, that is consisted of multiple papers, focused on COVID-19 pandemic and disease. This dataset gives a lot of information for every paper and so I had to choose what I should and shouldn’t use. I decided to use the corpus of each paper, with the pre-process of:

Data-storage I am using a big dictionary of: sentence −→ (paper_id,paper_title) If a sentence belongs to more than one papers, it is not a problem, as question is answered from at least one paper. This dictionary I am storing it and reading it every time I need (.pickle file)

Embeddings

In these two tasks I had to build multiple QA models and remark their performance in various ways of creating the embeddings. I have implemented the following embedding approaches:

Model comparison

Finally, from my experiments the best model based on criteria:

I conclude that Sentence Transformer in combination with Cross Encoders is the best model as it has the best results in answers. The time needed for creating the embeddings is approximately the same among the models. (all these remarks for the 6000 papers)


© Konstantinos Nikoletos 2020 - 2021