CORD-19-ANN: Semantic Search Engine Using S-BERT

6 min readApr 20, 2020

Allen Institute as a part of their open research efforts released a data dump of scholarly articles as an initiative to aid efforts in tackling COVID-19. This dataset contains 51,000 articles as of the date this article being written and is increasing in size.

When searching the data, key word search is probably going to be effective, however, supplementing this with semantic sentence embeddings would provide valuable insight into the data either by clustering or a semantic search engine. Semantic searching does have its issues though, described nicely in this issue here.

Efforts have already popped up such as this and covidex, giving us a great starting point towards integrating language models for CORD-19 search. However, current approaches lack certain usability; they tend to be trained on paragraphs or only search the abstract and title of the articles. Continuing this, we worked towards training different sentence embedding language models on medical data, evaluating these models using appropriate benchmarks and finally providing an efficient way to index the data.

I’ll start by adding a disclaimer saying I am not a medical expert and have tried to take care in making sure I do not overstep this boundary. I was privileged enough to gain high level feedback from members within the healthcare team at the company I work for and am looking for more feedback from others. All code and models are available, please make issues/suggestions at the CORD-19-ANN repo here!

Approach

In short the high level approach can be described as such:

Tokenise CORD-19 into sentences using SciSpacy
Fine-tune BioBERT & BlueBERT using medNLI/SNLI/MultiNLI dataset via sentence-transformers
Generate sentence embeddings for CORD-19 using S-BioBERT/S-BlueBERT
Create search index using FAISS with reasonable compression
Create visual clustering using UMAP

Tokenisation

For most search engines, the input query is a short span. The CORD-19 dataset is formatted into individual articles where paragraphs have been split into sections. Paragraphs probably don’t model the input query well, but sentences would be closer. As a result, to format the raw CORD-19 paragraphs into sentences I used SciSpacy’s tokeniser. With multiprocessing, the tokenisation completes in reasonable time.

Training Sentence Embedding Models

When evaluating a set of language models, I picked standard BERT as well as BioBERT and BlueBERT; BERT models trained on biomedical data. They show SOTA results across different medical tasks and hopefully provide higher quality sentence embeddings on the CORD-19 dataset. It is important to note that the latest pre-trained BlueBERT models are only uncased, whereas BioBERT is a cased model. It would be interesting to determine if case plays an important part in the quality of embeddings for CORD-19, but that is saved for future work.

To fine-tune the sentence embedding models and generate embeddings I use the UKPLab sentence-transformers repo. The current state of the repo made it difficult for trying different language models that have appeared in the HuggingFace transformers repo, as well as different language models such as BlueBERT/BioBERT. Modifications had to be made to make the package language model agnostic, as well as exposing necessary flags for different LM configurations. The changes that were needed to be made can be seen at a fork here.

To fine-tune the language models, I paired the MedNLI clinical dataset and the SNLI/MultiNLI datasets. All provide english pairs, however MedNLI provides clinical sentence pairs which are valuable training data when fine-tuning the models. I fine-tuned various language models using the SNLI/MultiNLI datasets and the MedNLI dataset. Results are below:

Test Results across MedNLI and the STS Benchmark

Above we see BioBERT/BlueBERT out-performing pre-trained BERT on MedNLI medical sentence pairs albeit with a drop in accuracy on the STS benchmark, a general text similarity dataset. These models were then used to generate embeddings across the CORD-19 sentences, allowing us to carry out approximate nearest neighbour searches as well as clustering.

Using embeddings generated by BlueBERT and BioBert, I inspected nearest neighbours by jumping a few steps ahead and creating a search index. I noticed results were poorer when using BlueBERT particularly for abbreviations which made sense due to BlueBERT being trained on uncased text. As a result I opted for using BioBERT embeddings downstream.

Creating The Search Index

When creating the index for similarity search I opted for FAISS. Initial work was done using nmslib, however, given the flexibility of FAISS and extensive documentation at different scales I ended up migrating to FAISS. The embeddings sat around 20gb of disk space, which would mean 20gb of RAM was needed using a Flat index. This was not viable for serving thus I investigated different FAISS index configurations, where query time and recall can be traded off for memory by varying the configurations. Below is an analysis across a subset of the CORD-19 dataset calculating the 1@R score described in this FAISS article. This involves calculating the recall of finding the ground truth nearest neighbour in the first R nearest neighbours. In this case the ground truth was the Flat index configuration.

Most of the indices contain the gold truth nearest neighbour within the first ten top neighbours, however there is some variability in the top result (Recall@1). In order to pick the appropriate compression I also compared the Recall@1 with the size of the index.

Recall@R/Index Size comparison for FAISS Configurations

Based on recall and memory restrictions I opted for using the configuration “PCAR128,SQ8” as a compromise between the two values. I omitted HNSW from the above plot as the indices were significantly larger, however had the best search times by a significant margin. This would be useful in situations where search time can be optimised with less of a memory restriction.

Finally for visualisation, a frontend site was made on top of a http search API as seen at the top of this article. The API can be used independently of the website and may be favourable for certain tasks.

Clustering using UMAP

Clustering can be useful to verify qualitatively that the embeddings work, as well as find insight into the data. Using UMAP I was able to cluster a sample of the data in a bokeh plot:

The notebook can be run on Google Colab here.

Future Work

I plan to continue improvements and taking feedback, all code/models can be seen at the CORD-19-ANN repo. The toolset is fairly general to apply to any dataset that contains text and could provide value outside of the CORD-19 dataset. If you’re interested in doing so please leave an issue and I’m more than happy to help.

A few future endeavours I’d like to implement:

Automate ingestion as the dataset gets larger and more articles are added to keep the index up to date
Evaluate adding the COVIDQA training data when training sentence embedding models
Improve UI interaction based on feedback of users

Acknowledgements

Thanks to @maheratashfaraz for helping me build the frontend as well as the authors of the repos highlighted above!