blur-bg

Extracting Bibliographic References from Grey Literature

Techniques:

Entity extraction Content categorization Content clustering

Client Website:

https://wellcome.org/

Objectives

The aim of this project was to allow The Wellcome Trust to monitor where academic research funded by its grants was being cited in policy documents.

Problem

Extracting bibliographic references from non-academic literature (so-called ‘grey’ literature) should be a fairly simple task, but it is complicated if documents don’t use a standard referencing format or use complex formatting. Nonetheless, funders such as the Wellcome Trust want to track when scientific research that they have funded appears in policy documents (such as those published by national governments, intergovernmental bodies, and NGOs) as this can be evidence that their funding is having an impact on policy making.

Solution

A number of solutions exist for extracting bibliographic references from documents, however there are few well developed examples which make use of the most recent advances in Deep Neural Networks and Natural Language Processing. After a literature review, we took a multitask Recurrent Neural Network (RNN) described in the literature, and adapted it for the policy document use case.

Impact

We found the neural model to perform well even when trained on a relatively modest dataset that we manually labelled ourselves. The model has now been incorporated as a key component of the Wellcome’s Reach product, which allows academics to track where their work has been cited, and allows funders to track the impact of their grants.

Next Project

Scaling Virtual Reality Medical Conversations using Multi-Head Classification

Implement a fast and accurate multi-model solution able to scale to thousands of labels

Visit Project

Do you have a Natural Language Processing problem you need help with?

Let's Talk