Extracting bibliographic references from grey literature

Deep reference parser

CLIENT

Wellcome Trust

NLP

CONTENT CATEGORIZATION ENTITY EXTRACTION

MODEL

CRF BiLSTM

GIT

github share icon

Objective

The Wellcome Trust needed to know when research grants that it had funded were influencing the development of policy at the national or international level to help it make decisions on where to best allocate funding in future.

Problem

Extracting bibliographic references from non-academic literature (so-called ‘grey’ literature) should be a fairly simple task, but it is complicated if documents don’t use a standard referencing format or use complex formatting. Nonetheless, funders such as the Wellcome Trust want to track when scientific research that they have funded appears in policy documents (such as those published by national governments, intergovernmental bodies, and NGOs) as this can be evidence that their funding is having an impact on policy making.

Solution

A number of solutions exist for extracting bibliographic references from documents, however there are few well developed examples which make use of the most recent advances in Deep Neural Networks and Natural Language Processing. After a literature review, we took a multitask Recurrent Neural Network (RNN) described in the literature, and adapted it for the policy document use case.

Impact

We found the neural model to perform well even when trained on a relatively modest dataset that we manually labelled ourselves. The model has now been incorporated as a key component of the Wellcome’s Reach product, which allows academics to track where their work has been cited, and allows funders to track the impact of their grants.

Categorising grants by theme to help a major donor understand their impact

NEXT PROJECT ->

Are you interested in working with us?