Extracting Complex Medical Information from PDF documents
Techniques:
Objectives
We worked with an NGO to develop an automated pipeline that can extract complicated, and varied medical information from PDF documents, in a standardised way.
Problem
The World Health Organisation (WHO) publishes Target Product Profile (TPPs) which describe the desired characteristics of a target product aimed at a particular disease. A TPP may describe the acceptable characteristics for treatments, diagnostic tests, or medical devices. TPPs are published by various stakeholders such as regulatory agencies, pharmaceutical companies, Non-Governmental Organisations, or Academic institutions.
We worked with a large non-profit organisation, that wanted to develop a tool to automatically extract these desired characteristics from TPPs, which are typically PDF documents. TPPs can vary considerably in their content and structure, meaning that any solution needed to be highly flexible.
Each of these PDFs is a natural-language document, with information stored in a semi-structured way, in a mix of tables and text making any automatic parsing of the data within them difficult. Each TPP can also contain upwards of 66 different characteristics to extract.
Solution
We tackled the problem initially using Large Language Models (LLMs) like GPT-4, which are powerful tools for extracting unstructured information from natural language. The model performed well in some cases, but given the specialised nature of the task and the complexity of the data, there were also recurring problems with data being missed or not being output in a suitable structure.
We explored various methods to deal with this, including prompt tuning, focused Retrieval Augmented Generation (RAG) (providing specific snippets of the TPP as context), and in-context learning (providing examples of what the output should look like).
Each of these was useful, but we had most success when we combined our LLM-based approach with a rule-based table extractor. Since the TPP information was primarily stored in tables, we could first go through the tables and extract any information from them, and then use the LLM as a fallback to further capture any information that needed more flexibility.
By combining these approaches, we were able to make the overall solution much faster and more reliable.
Impact
The client now has an automated solution for extracting information from TPPs which they can use to get insights into the data, filter TPPs, and perform other downstream analysis.
The solution runs both quickly and cheaply, allowing it to work on large datasets, and can be integrated flexibly into their existing workflows without disruption.