Abstract
In recent years, state-of-the-art clinical Natural Language Processing (NLP), as in other domains, has been dominated by neural networks and other statistical models. In contrast to the unstructured nature of Electronic Health Record (EHR) text, biomedical knowledge is increasingly available in structured and codified forms, underpinned by curated databases, machine-readable clinical guidelines, and logically defined terminologies. This thesis examines the incorporation of external medical knowledge into clinical NLP and tests these methods on a use case of ascertaining physical pain in clinical notes of mental health records.Pain is a common reason for accessing healthcare resources and has been a growing area of research, especially its impact on mental health. Pain also presents a unique NLP problem due to its ambiguous nature and the varying circumstances in which it can be used. For these reasons, pain has been chosen as a use case, making it a good case study for the application of the methods explored in this thesis. Models are built by assimilating both structured medical knowledge and clinical NLP and leveraging the inherent relations that exist within medical ontologies. The data source used in this project is a mental health EHR database called CRIS, which contains de-identified patient records from the South London and Maudsley NHS Foundation Trust, one of the largest mental health providers in Western Europe.
A lexicon of pain terms was developed to identify documents within CRIS mentioning painrelated terms. Gold standard annotations were created by conducting manual annotations on these documents. These gold standard annotations were used to build models for a binary classification task, with the objective of classifying sentences from the clinical text as “relevant”, which indicates the sentence contains relevant mentions of pain, i.e., physical pain affecting the patient, or “not relevant”, which indicates the sentence does not contain mentions of physical pain, or the mention does not relate to the patient (ex: someone else in physical pain). Two models incorporating structured medical knowledge were built:
1. a transformer-based model, SapBERT, that utilises a knowledge graph of the UMLS ontology, and
2. a knowledge graph embedding model that utilises embeddings from SNOMED CT, which was then used to build a random forest classifier. This was achieved by modelling the clinical pain terms and their relations from SNOMED CT into knowledge graph embeddings, thus combining the data-driven view of clinical language, with the logical view of medical knowledge.
These models have been compared with NLP models (binary classifiers) that do not incorporate such structured medical knowledge:
1. a transformer-based model, BERT_base, and
2. a random forest classifier model.
Amongst the two transformer-based models, SapBERT performed better at the classification task (F1-score: 0.98), and amongst the random forest models, the one incorporating knowledge graph embeddings performed better (F1-score: 0.94). The SapBERT model was run on sentences from a cohort of patients within CRIS, with the objective of conducting a prevalence study to understand the distribution of pain based on sociodemographic and diagnostic factors.
The contribution of this research is both methodological and practical, showing the difference between a conventional NLP approach of binary classification and one that incorporates external knowledge, and further utilising the models obtained from both these approaches in
a prevalence study which was designed based on inputs from clinicians and a patient and public involvement group. The results emphasise the significance of going beyond the conventional approach to NLP when addressing complex issues such as pain.
Date of Award | 2024 |
---|---|
Original language | English |
Awarding Institution |
|
Supervisor | Sumithra Velupillai (Supervisor), Robert Stewart (Supervisor) & Angus Roberts (Supervisor) |