Extracting references to authors and organizations from policy documents
NLP:
Objectives
Overton needed a way to extract the authors and their affiliated organizations of academic publications from policy documents.
Problem
Extracting entities like people and organization names from a document is a well studied problem in Natural language processing (NLP) , and is generally referred to as Named Entity Recognition (NER). There are a number of pre-trained, open source models that can be used to get started, and these generally perform reasonably well on common problems. Such models are typically trained on a generic corpus of documents, for example Wikipedia or collections of news articles. To achieve high performance however usually requires that we train a ‘domain specific’ model using a corpus of documents that matches those that we expect to send to the model when deployed.
Solution
For this project we chose to use spaCy as a starting point because it offers pretrained models for people, organization, and other entities with near state-of-the-art performance, 🚀 whist being extremely fast 🏎. Our first step was to annotate some data in the policy domain so that we could evaluate the performance of both off-the-shelf models, and any custom models that we later trained. We used Prodigy, an annotation tool from the same team behind spaCy, to annotate examples from policy documents provided by Overton. Apart from the excellent user interface, Prodigy integrates well with spaCy, and speeds up the annotation process by prioritizing ambiguous examples for annotation. Prodigy continually trains an underlying machine learning model to refine what it considers to be ‘ambiguous’ as it receives new annotations from a human annotator. This process is called ‘active learning’.
Once we had a sufficiently large dataset for evaluation, we evaluated the performance of the pretrained models that spaCy provides to understand how well they “transfer” to the policy domain. These models proved to be correct about 48% of the time, which is far below the 90%+ that you expect in such problems.
To develop our own custom model specific to the policy document domain, we took an iterative approach, annotating batches of 1k sentences selected from policy documents and prioritized according to ambiguity. After each set of 1k documents, we trained a new model and evaluated it. After annotating around 10k sentences we had a model that correctly identifies authors and organizations about 85% of the time. This was good enough for Overton to test the model with its users in production.
Impact
Overton was able to incorporate this new model into their platform easily due to the simple packaging options that spaCy provides. The problem of too many false positives was resolved, and more importantly, the new approach gave them a better way to scale and by continuing to iterate on the new model with more annotated data.