Extracting bibliographic references from grey literature

Deep reference parser


Entity extraction Content categorization Content clustering


Research grants that it had funded were influencing the development of policy at the national or international level to help it make decisions on where to best allocate funding in future.


Extracting bibliographic references from non-academic literature (so-called ‘grey’ literature) should be a fairly simple task, but it is complicated if documents don’t use a standard referencing format or use complex formatting. Nonetheless, funders such as the Wellcome Trust want to track when scientific research that they have funded appears in policy documents (such as those published by national governments, intergovernmental bodies, and NGOs) as this can be evidence that their funding is having an impact on policy making.


A number of solutions exist for extracting bibliographic references from documents, however there are few well developed examples which make use of the most recent advances in Deep Neural Networks and Natural Language Processing. After a literature review, we took a multitask Recurrent Neural Network (RNN) described in the literature, and adapted it for the policy document use case.


We found the neural model to perform well even when trained on a relatively modest dataset that we manually labelled ourselves. The model has now been incorporated as a key component of the Wellcome’s Reach product, which allows academics to track where their work has been cited, and allows funders to track the impact of their grants.

Next Projects

Categorising grants by theme to help a major donor understand their impact

The Wellcome Trust needed better visibility on where their research grants were being spent so they could better understand the grants’ impact and use this to inform ongoing strategic decisions.

Visit Project

Do you have a Natural Language Processing problem you need help with?

Let's Talk