Extracting Bibliographic References from Grey Literature
Techniques:
Objectives
The aim of this project was to allow The Wellcome Trust to monitor where academic research funded by its grants was being cited in policy documents.
Problem
Extracting bibliographic references from non-academic literature (so-called ‘grey’ literature) should be a fairly simple task, but it is complicated if documents don’t use a standard referencing format or use complex formatting. Nonetheless, funders such as the Wellcome Trust want to track when scientific research that they have funded appears in policy documents (such as those published by national governments, intergovernmental bodies, and NGOs) as this can be evidence that their funding is having an impact on policy making.
Solution
A number of solutions exist for extracting bibliographic references from documents, however there are few well developed examples which make use of the most recent advances in Deep Neural Networks and Natural Language Processing. After a literature review, we took a multitask Recurrent Neural Network (RNN) described in the literature, and adapted it for the policy document use case.
Impact
We found the neural model to perform well even when trained on a relatively modest dataset that we manually labelled ourselves. The model has now been incorporated as a key component of the Wellcome’s Reach product, which allows academics to track where their work has been cited, and allows funders to track the impact of their grants.