Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit

Zeljko Kraljevic; Thomas Searle; Anthony Shek; Lukasz Roguski; Kawsar Noor; Daniel Bean; Aurelie Mascio; Leilei Zhu; Amos A. Folarin; Angus Roberts; Rebecca Bendayan; Mark P. Richardson; Robert Stewart; Anoop D. Shah; Wai Keong Wong; Zina Ibrahim; James T. Teo; Richard J. B. Dobson

doi:10.1016/j.artmed.2021.102083

Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit

Zeljko Kraljevic, Thomas Searle, Anthony Shek, Lukasz Roguski, Kawsar Noor, Daniel Bean, Aurelie Mascio, Leilei Zhu, Amos A. Folarin, Angus Roberts, Rebecca Bendayan, Mark P. Richardson, Robert Stewart, Anoop D. Shah, Wai Keong Wong, Zina Ibrahim, James T. Teo, Richard J. B. Dobson

Research output: Contribution to journal › Article › peer-review

107 Citations (Scopus)

Abstract

Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of information extraction (IE) technologies to enable clinical analysis. We present the open source Medical Concept Annotation Toolkit (MedCAT) that provides: (a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; (b) a feature-rich annotation interface for customizing and training IE models; and (c) integrations to the broader CogStack ecosystem for vendor-agnostic health system deployment. We show improved performance in extracting UMLS concepts from open datasets (F1:0.448–0.738 vs 0.429–0.650). Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over ∼8.8B words from ∼17M clinical records and further fine-tuning with ∼6K clinician annotated examples. We show strong transferability (F1 > 0.94) between hospitals, datasets and concept types indicating cross-domain EHR-agnostic utility for accelerated clinical and research use cases.

Original language	English
Article number	102083
Journal	Artificial Intelligence in Medicine
Volume	117
DOIs	https://doi.org/10.1016/j.artmed.2021.102083
Publication status	Published - Jul 2021

Keywords

Clinical concept embeddings
Clinical natural language processing
Clinical ontology embeddings
Electronic health record information extraction

Access to Document

10.1016/j.artmed.2021.102083

Cite this

Kraljevic, Z., Searle, T., Shek, A., Roguski, L., Noor, K., Bean, D., Mascio, A., Zhu, L., Folarin, A. A., Roberts, A., Bendayan, R., Richardson, M. P., Stewart, R., Shah, A. D., Wong, W. K., Ibrahim, Z., Teo, J. T., & Dobson, R. J. B. (2021). Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit. Artificial Intelligence in Medicine, 117, Article 102083. https://doi.org/10.1016/j.artmed.2021.102083

@article{6cc30e42af924ac387bac749a4a0c04a,

title = "Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit",

abstract = "Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of information extraction (IE) technologies to enable clinical analysis. We present the open source Medical Concept Annotation Toolkit (MedCAT) that provides: (a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; (b) a feature-rich annotation interface for customizing and training IE models; and (c) integrations to the broader CogStack ecosystem for vendor-agnostic health system deployment. We show improved performance in extracting UMLS concepts from open datasets (F1:0.448–0.738 vs 0.429–0.650). Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over ∼8.8B words from ∼17M clinical records and further fine-tuning with ∼6K clinician annotated examples. We show strong transferability (F1 > 0.94) between hospitals, datasets and concept types indicating cross-domain EHR-agnostic utility for accelerated clinical and research use cases.",

keywords = "Clinical concept embeddings, Clinical natural language processing, Clinical ontology embeddings, Electronic health record information extraction",

author = "Zeljko Kraljevic and Thomas Searle and Anthony Shek and Lukasz Roguski and Kawsar Noor and Daniel Bean and Aurelie Mascio and Leilei Zhu and Folarin, {Amos A.} and Angus Roberts and Rebecca Bendayan and Richardson, {Mark P.} and Robert Stewart and Shah, {Anoop D.} and Wong, {Wai Keong} and Zina Ibrahim and Teo, {James T.} and Dobson, {Richard J. B.}",

note = "Funding Information: We would like to thank all the clinicians who provided annotation training for MedAT; this includes Rosita Zakeri, Kevin O'Gallagher, Rosemary Barker, David Nicholson Thomas, Rhian Raftopoulos, Pedro Viana, Elisa Bruno, Eugenio Abela, Mark Richardson, Naoko Skiada, Luwaiza Mirza, Natalia Chance, Jaya Chaturvedi, Tao Wang, Matt Solomon, Charlotte Ramsey and James Teo. RD's work is supported by (1) National Institute for Health Research (NIHR) Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London. (2) Health Data Research UK, which is funded by the UK Medical Research Council, Engineering and Physical Sciences Research Council, Economic and Social Research Council, Department of Health and Social Care (England), Chief Scientist Office of the Scottish Government Health and Social Care Directorates, Health and Social Care Research and Development Division (Welsh Government), Public Health Agency (Northern Ireland), British Heart Foundation and Wellcome Trust. (3) The National Institute for Health Research University College London Hospitals Biomedical Research Centre. DMB is funded by a UKRI Innovation Fellowship as part of Health Data Research UK MR/S00310X/1 (https://www.hdruk.ac.uk). RB is funded in part by grant MR/R016372/1 for the King's College London MRC Skills Development Fellowship programme funded by the UK Medical Research Council (MRC, https://mrc.ukri.org) and by grant IS-BRC-1215-20018 for the National Institute for Health Research (NIHR, https://www.nihr.ac.uk) Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London. ADS is supported by a postdoctoral fellowship from THIS Institute. AS is supported by a King's Medical Research Trust studentship. RS is part-funded by: (i) the National Institute for Health Research (NIHR) Biomedical Research Centre at the South London and Maudsley NHS Foundation Trust and King's College London; (ii) a Medical Research Council (MRC) Mental Health Data Pathfinder Award to King's College London; (iii) an NIHR Senior Investigator Award; (iv) the National Institute for Health Research (NIHR) Applied Research Collaboration South London (NIHR ARC South London) at King's College Hospital NHS Foundation Trust. This paper represents independent research part funded by the National Institute for Health Research (NIHR) Biomedical Research Centre at South London and Maudsley NHS Foundation Trust, The UK Research and Innovation London Medical Imaging & Artificial Intelligence Centre for Value Based Healthcare (AI4VBH); the National Institute for Health Research (NIHR) Applied Research Collaboration South London (NIHR ARC South London) and King's College London. The views expressed are those of the author(s) and not necessarily those of the NHS, MRC, NIHR or the Department of Health and Social Care. We thank the patient experts of the KERRI committee, Professor Irene Higginson, Professor Alastair Baker, Professor Jules Wendon, Professor Ajay Shah, Dan Persson and Damian Lewsley for their support. Funding Information: JTHT received research support and funding from InnovateUK, Bristol-Myers-Squibb, iRhythm Technologies, and holds shares 5000 in Glaxo Smithkline and Biogen. Funding Information: ADS is supported by a postdoctoral fellowship from THIS Institute. AS is supported by a King's Medical Research Trust studentship. RS is part-funded by: (i) the National Institute for Health Research (NIHR) Biomedical Research Centre at the South London and Maudsley NHS Foundation Trust and King's College London; (ii) a Medical Research Council (MRC) Mental Health Data Pathfinder Award to King's College London; (iii) an NIHR Senior Investigator Award; (iv) the National Institute for Health Research (NIHR) Applied Research Collaboration South London (NIHR ARC South London) at King's College Hospital NHS Foundation Trust. Funding Information: This paper represents independent research part funded by the National Institute for Health Research (NIHR) Biomedical Research Centre at South London and Maudsley NHS Foundation Trust , The UK Research and Innovation London Medical Imaging & Artificial Intelligence Centre for Value Based Healthcare (AI4VBH) ; the National Institute for Health Research (NIHR) Applied Research Collaboration South London (NIHR ARC South London) and King's College London . The views expressed are those of the author(s) and not necessarily those of the NHS, MRC, NIHR or the Department of Health and Social Care. We thank the patient experts of the KERRI committee, Professor Irene Higginson, Professor Alastair Baker, Professor Jules Wendon, Professor Ajay Shah, Dan Persson and Damian Lewsley for their support. Funding Information: RB is funded in part by grant MR/R016372/1 for the King's College London MRC Skills Development Fellowship programme funded by the UK Medical Research Council (MRC, https://mrc.ukri.org ) and by grant IS-BRC-1215-20018 for the National Institute for Health Research (NIHR, https://www.nihr.ac.uk ) Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London. Funding Information: RD's work is supported by (1) National Institute for Health Research (NIHR) Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London. (2) Health Data Research UK, which is funded by the UK Medical Research Council, Engineering and Physical Sciences Research Council, Economic and Social Research Council, Department of Health and Social Care (England), Chief Scientist Office of the Scottish Government Health and Social Care Directorates, Health and Social Care Research and Development Division (Welsh Government), Public Health Agency (Northern Ireland), British Heart Foundation and Wellcome Trust. (3) The National Institute for Health Research University College London Hospitals Biomedical Research Centre. Funding Information: DMB is funded by a UKRI Innovation Fellowship as part of Health Data Research UK MR/S00310X/1 ( https://www.hdruk.ac.uk ). Publisher Copyright: {\textcopyright} 2021 Elsevier B.V. Copyright: Copyright 2021 Elsevier B.V., All rights reserved.",

year = "2021",

month = jul,

doi = "10.1016/j.artmed.2021.102083",

language = "English",

volume = "117",

journal = "Artificial Intelligence in Medicine",

issn = "0933-3657",

publisher = "Elsevier",

}

Kraljevic, Z , Searle, T , Shek, A, Roguski, L, Noor, K, Bean, D , Mascio, A, Zhu, L, Folarin, AA , Roberts, A , Bendayan, R , Richardson, MP , Stewart, R, Shah, AD, Wong, WK, Ibrahim, Z , Teo, JT & Dobson, RJB 2021, 'Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit', Artificial Intelligence in Medicine, vol. 117, 102083. https://doi.org/10.1016/j.artmed.2021.102083

TY - JOUR

T1 - Multi-domain clinical natural language processing with MedCAT

T2 - The Medical Concept Annotation Toolkit

AU - Kraljevic, Zeljko

AU - Searle, Thomas

AU - Shek, Anthony

AU - Roguski, Lukasz

AU - Noor, Kawsar

AU - Bean, Daniel

AU - Mascio, Aurelie

AU - Zhu, Leilei

AU - Folarin, Amos A.

AU - Roberts, Angus

AU - Bendayan, Rebecca

AU - Richardson, Mark P.

AU - Stewart, Robert

AU - Shah, Anoop D.

AU - Wong, Wai Keong

AU - Ibrahim, Zina

AU - Teo, James T.

AU - Dobson, Richard J. B.

N1 - Funding Information: We would like to thank all the clinicians who provided annotation training for MedAT; this includes Rosita Zakeri, Kevin O'Gallagher, Rosemary Barker, David Nicholson Thomas, Rhian Raftopoulos, Pedro Viana, Elisa Bruno, Eugenio Abela, Mark Richardson, Naoko Skiada, Luwaiza Mirza, Natalia Chance, Jaya Chaturvedi, Tao Wang, Matt Solomon, Charlotte Ramsey and James Teo. RD's work is supported by (1) National Institute for Health Research (NIHR) Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London. (2) Health Data Research UK, which is funded by the UK Medical Research Council, Engineering and Physical Sciences Research Council, Economic and Social Research Council, Department of Health and Social Care (England), Chief Scientist Office of the Scottish Government Health and Social Care Directorates, Health and Social Care Research and Development Division (Welsh Government), Public Health Agency (Northern Ireland), British Heart Foundation and Wellcome Trust. (3) The National Institute for Health Research University College London Hospitals Biomedical Research Centre. DMB is funded by a UKRI Innovation Fellowship as part of Health Data Research UK MR/S00310X/1 (https://www.hdruk.ac.uk). RB is funded in part by grant MR/R016372/1 for the King's College London MRC Skills Development Fellowship programme funded by the UK Medical Research Council (MRC, https://mrc.ukri.org) and by grant IS-BRC-1215-20018 for the National Institute for Health Research (NIHR, https://www.nihr.ac.uk) Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London. ADS is supported by a postdoctoral fellowship from THIS Institute. AS is supported by a King's Medical Research Trust studentship. RS is part-funded by: (i) the National Institute for Health Research (NIHR) Biomedical Research Centre at the South London and Maudsley NHS Foundation Trust and King's College London; (ii) a Medical Research Council (MRC) Mental Health Data Pathfinder Award to King's College London; (iii) an NIHR Senior Investigator Award; (iv) the National Institute for Health Research (NIHR) Applied Research Collaboration South London (NIHR ARC South London) at King's College Hospital NHS Foundation Trust. This paper represents independent research part funded by the National Institute for Health Research (NIHR) Biomedical Research Centre at South London and Maudsley NHS Foundation Trust, The UK Research and Innovation London Medical Imaging & Artificial Intelligence Centre for Value Based Healthcare (AI4VBH); the National Institute for Health Research (NIHR) Applied Research Collaboration South London (NIHR ARC South London) and King's College London. The views expressed are those of the author(s) and not necessarily those of the NHS, MRC, NIHR or the Department of Health and Social Care. We thank the patient experts of the KERRI committee, Professor Irene Higginson, Professor Alastair Baker, Professor Jules Wendon, Professor Ajay Shah, Dan Persson and Damian Lewsley for their support. Funding Information: JTHT received research support and funding from InnovateUK, Bristol-Myers-Squibb, iRhythm Technologies, and holds shares 5000 in Glaxo Smithkline and Biogen. Funding Information: ADS is supported by a postdoctoral fellowship from THIS Institute. AS is supported by a King's Medical Research Trust studentship. RS is part-funded by: (i) the National Institute for Health Research (NIHR) Biomedical Research Centre at the South London and Maudsley NHS Foundation Trust and King's College London; (ii) a Medical Research Council (MRC) Mental Health Data Pathfinder Award to King's College London; (iii) an NIHR Senior Investigator Award; (iv) the National Institute for Health Research (NIHR) Applied Research Collaboration South London (NIHR ARC South London) at King's College Hospital NHS Foundation Trust. Funding Information: This paper represents independent research part funded by the National Institute for Health Research (NIHR) Biomedical Research Centre at South London and Maudsley NHS Foundation Trust , The UK Research and Innovation London Medical Imaging & Artificial Intelligence Centre for Value Based Healthcare (AI4VBH) ; the National Institute for Health Research (NIHR) Applied Research Collaboration South London (NIHR ARC South London) and King's College London . The views expressed are those of the author(s) and not necessarily those of the NHS, MRC, NIHR or the Department of Health and Social Care. We thank the patient experts of the KERRI committee, Professor Irene Higginson, Professor Alastair Baker, Professor Jules Wendon, Professor Ajay Shah, Dan Persson and Damian Lewsley for their support. Funding Information: RB is funded in part by grant MR/R016372/1 for the King's College London MRC Skills Development Fellowship programme funded by the UK Medical Research Council (MRC, https://mrc.ukri.org ) and by grant IS-BRC-1215-20018 for the National Institute for Health Research (NIHR, https://www.nihr.ac.uk ) Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London. Funding Information: RD's work is supported by (1) National Institute for Health Research (NIHR) Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London. (2) Health Data Research UK, which is funded by the UK Medical Research Council, Engineering and Physical Sciences Research Council, Economic and Social Research Council, Department of Health and Social Care (England), Chief Scientist Office of the Scottish Government Health and Social Care Directorates, Health and Social Care Research and Development Division (Welsh Government), Public Health Agency (Northern Ireland), British Heart Foundation and Wellcome Trust. (3) The National Institute for Health Research University College London Hospitals Biomedical Research Centre. Funding Information: DMB is funded by a UKRI Innovation Fellowship as part of Health Data Research UK MR/S00310X/1 ( https://www.hdruk.ac.uk ). Publisher Copyright: © 2021 Elsevier B.V. Copyright: Copyright 2021 Elsevier B.V., All rights reserved.

PY - 2021/7

Y1 - 2021/7

N2 - Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of information extraction (IE) technologies to enable clinical analysis. We present the open source Medical Concept Annotation Toolkit (MedCAT) that provides: (a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; (b) a feature-rich annotation interface for customizing and training IE models; and (c) integrations to the broader CogStack ecosystem for vendor-agnostic health system deployment. We show improved performance in extracting UMLS concepts from open datasets (F1:0.448–0.738 vs 0.429–0.650). Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over ∼8.8B words from ∼17M clinical records and further fine-tuning with ∼6K clinician annotated examples. We show strong transferability (F1 > 0.94) between hospitals, datasets and concept types indicating cross-domain EHR-agnostic utility for accelerated clinical and research use cases.

AB - Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of information extraction (IE) technologies to enable clinical analysis. We present the open source Medical Concept Annotation Toolkit (MedCAT) that provides: (a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; (b) a feature-rich annotation interface for customizing and training IE models; and (c) integrations to the broader CogStack ecosystem for vendor-agnostic health system deployment. We show improved performance in extracting UMLS concepts from open datasets (F1:0.448–0.738 vs 0.429–0.650). Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over ∼8.8B words from ∼17M clinical records and further fine-tuning with ∼6K clinician annotated examples. We show strong transferability (F1 > 0.94) between hospitals, datasets and concept types indicating cross-domain EHR-agnostic utility for accelerated clinical and research use cases.

KW - Clinical concept embeddings

KW - Clinical natural language processing

KW - Clinical ontology embeddings

KW - Electronic health record information extraction

UR - http://www.scopus.com/inward/record.url?scp=85106551455&partnerID=8YFLogxK

U2 - 10.1016/j.artmed.2021.102083

DO - 10.1016/j.artmed.2021.102083

M3 - Article

AN - SCOPUS:85106551455

SN - 0933-3657

VL - 117

JO - Artificial Intelligence in Medicine

JF - Artificial Intelligence in Medicine

M1 - 102083

ER -

Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this