github share icon


Overton is a platform that collates policy documents, guidelines, and think tank publications and extracts key information such as references, names of authors and their affiliated organizations. Their initial approach to this problem was based on exactly matching author names and organization pairs against a manually curated “dictionary”. This approach worked but resulted in two problems: a large number of false positives, and the need to constantly update the “dictionary” to incorporate more names and organizations. As such, Overton wanted to experiment with developing a statistical model which could automate that extraction process while reducing false positives.


Extracting entities like people and organization names from a document is a well studied problem in Natural language processing (NLP) , and is generally referred to as Named Entity Recognition (NER). There are a number of pre-trained, open source models that can be used to get started, and these generally perform reasonably well on common problems. Such models are typically trained on a generic corpus of documents, for example Wikipedia or collections of news articles. To achieve high performance however usually requires that we train a ‘domain specific’ model using a corpus of documents that matches those that we expect to send to the model when deployed.

For this project we chose to use spaCy as a starting point because it offers pretrained models for people, organization, and other entities with near state-of-the-art performance, 🚀 whist being extremely fast 🏎. Our first step was to annotate some data in the policy domain so that we could evaluate the performance of both off-the-shelf models, and any custom models that we later trained. We used Prodigy, an annotation tool from the same team behind spaCy, to annotate examples from policy documents provided by Overton. Apart from the excellent user interface, Prodigy integrates well with spaCy, and speeds up the annotation process by prioritizing ambiguous examples for annotation. Prodigy continually trains an underlying machine learning model to refine what it considers to be ‘ambiguous’ as it receives new annotations from a human annotator. This process is called ‘active learning’.

Once we had a sufficiently large dataset for evaluation, we evaluated the performance of the pretrained models that spaCy provides to understand how well they “transfer” to the policy domain. These models proved to be correct about 48% of the time, which is far below the 90%+ that you expect in such problems.

To develop our own custom model specific to the policy document domain, we took an iterative approach, annotating batches of 1k sentences selected from policy documents and prioritized according to ambiguity. After each set of 1k documents, we trained a new model and evaluated it. After annotating around 10k sentences we had a model that correctly identifies authors and organizations about 85% of the time. This was good enough for Overton to test the model with its users in production.


Overton was able to incorporate this new model into their platform easily due to the simple packaging options that spaCy provides. The problem of too many false positives was resolved, and more importantly, the new approach gave them a better way to scale and by continuing to iterate on the new model with more annotated data.

Extracting bibliographic references from grey literature


Are you interested in working with us?