Enhancing surgical instrument segmentation: integrating vision transformer insights with adapter

Meng Wei; Miaojing Shi; Tom Vercauteren

doi:10.1007/s11548-024-03140-z

Enhancing surgical instrument segmentation: integrating vision transformer insights with adapter

Meng Wei, Miaojing Shi, Tom Vercauteren

Research output: Contribution to journal › Article › peer-review

73 Downloads (Pure)

Abstract

PURPOSE: In surgical image segmentation, a major challenge is the extensive time and resources required to gather large-scale annotated datasets. Given the scarcity of annotated data in this field, our work aims to develop a model that achieves competitive performance with training on limited datasets, while also enhancing model robustness in various surgical scenarios.

METHODS: We propose a method that harnesses the strengths of pre-trained Vision Transformers (ViTs) and data efficiency of convolutional neural networks (CNNs). Specifically, we demonstrate how a CNN segmentation model can be used as a lightweight adapter for a frozen ViT feature encoder. Our novel feature adapter uses cross-attention modules that merge the multiscale features derived from the CNN encoder with feature embeddings from ViT, ensuring integration of the global insights from ViT along with local information from CNN.

RESULTS: Extensive experiments demonstrate our method outperforms current models in surgical instrument segmentation. Specifically, it achieves superior performance in binary segmentation on the Robust-MIS 2019 dataset, as well as in multiclass segmentation tasks on the EndoVis 2017 and EndoVis 2018 datasets. It also showcases remarkable robustness through cross-dataset validation across these 3 datasets, along with the CholecSeg8k and AutoLaparo datasets. Ablation studies based on the datasets prove the efficacy of our novel adapter module.

CONCLUSION: In this study, we presented a novel approach integrating ViT and CNN. Our unique feature adapter successfully combines the global insights of ViT with the local, multi-scale spatial capabilities of CNN. This integration effectively overcomes data limitations in surgical instrument segmentation. The source code is available at: https://github.com/weimengmeng1999/AdapterSIS.git .

Original language	English
Pages (from-to)	1313-1320
Number of pages	8
Journal	International Journal of Computer Assisted Radiology and Surgery
Volume	19
Issue number	7
Early online date	8 May 2024
DOIs	https://doi.org/10.1007/s11548-024-03140-z
Publication status	Published - Jul 2024

Access to Document

10.1007/s11548-024-03140-zLicence: CC BY

Enhancing surgical instrument segmentation_WEI_Publishedonline8May2024_GOLD VoR (CC BY)Final published version, 747 KB

Cite this

@article{233938a709db40d080ee73d101b918a5,

title = "Enhancing surgical instrument segmentation: integrating vision transformer insights with adapter",

abstract = "PURPOSE: In surgical image segmentation, a major challenge is the extensive time and resources required to gather large-scale annotated datasets. Given the scarcity of annotated data in this field, our work aims to develop a model that achieves competitive performance with training on limited datasets, while also enhancing model robustness in various surgical scenarios.METHODS: We propose a method that harnesses the strengths of pre-trained Vision Transformers (ViTs) and data efficiency of convolutional neural networks (CNNs). Specifically, we demonstrate how a CNN segmentation model can be used as a lightweight adapter for a frozen ViT feature encoder. Our novel feature adapter uses cross-attention modules that merge the multiscale features derived from the CNN encoder with feature embeddings from ViT, ensuring integration of the global insights from ViT along with local information from CNN.RESULTS: Extensive experiments demonstrate our method outperforms current models in surgical instrument segmentation. Specifically, it achieves superior performance in binary segmentation on the Robust-MIS 2019 dataset, as well as in multiclass segmentation tasks on the EndoVis 2017 and EndoVis 2018 datasets. It also showcases remarkable robustness through cross-dataset validation across these 3 datasets, along with the CholecSeg8k and AutoLaparo datasets. Ablation studies based on the datasets prove the efficacy of our novel adapter module.CONCLUSION: In this study, we presented a novel approach integrating ViT and CNN. Our unique feature adapter successfully combines the global insights of ViT with the local, multi-scale spatial capabilities of CNN. This integration effectively overcomes data limitations in surgical instrument segmentation. The source code is available at: https://github.com/weimengmeng1999/AdapterSIS.git .",

author = "Meng Wei and Miaojing Shi and Tom Vercauteren",

note = "{\textcopyright} 2024. The Author(s).",

year = "2024",

month = jul,

doi = "10.1007/s11548-024-03140-z",

language = "English",

volume = "19",

pages = "1313--1320",

journal = "International Journal of Computer Assisted Radiology and Surgery",

issn = "1861-6410",

publisher = "Springer Verlag",

number = "7",

}

TY - JOUR

T1 - Enhancing surgical instrument segmentation

T2 - integrating vision transformer insights with adapter

AU - Wei, Meng

AU - Shi, Miaojing

AU - Vercauteren, Tom

PY - 2024/7

Y1 - 2024/7

N2 - PURPOSE: In surgical image segmentation, a major challenge is the extensive time and resources required to gather large-scale annotated datasets. Given the scarcity of annotated data in this field, our work aims to develop a model that achieves competitive performance with training on limited datasets, while also enhancing model robustness in various surgical scenarios.METHODS: We propose a method that harnesses the strengths of pre-trained Vision Transformers (ViTs) and data efficiency of convolutional neural networks (CNNs). Specifically, we demonstrate how a CNN segmentation model can be used as a lightweight adapter for a frozen ViT feature encoder. Our novel feature adapter uses cross-attention modules that merge the multiscale features derived from the CNN encoder with feature embeddings from ViT, ensuring integration of the global insights from ViT along with local information from CNN.RESULTS: Extensive experiments demonstrate our method outperforms current models in surgical instrument segmentation. Specifically, it achieves superior performance in binary segmentation on the Robust-MIS 2019 dataset, as well as in multiclass segmentation tasks on the EndoVis 2017 and EndoVis 2018 datasets. It also showcases remarkable robustness through cross-dataset validation across these 3 datasets, along with the CholecSeg8k and AutoLaparo datasets. Ablation studies based on the datasets prove the efficacy of our novel adapter module.CONCLUSION: In this study, we presented a novel approach integrating ViT and CNN. Our unique feature adapter successfully combines the global insights of ViT with the local, multi-scale spatial capabilities of CNN. This integration effectively overcomes data limitations in surgical instrument segmentation. The source code is available at: https://github.com/weimengmeng1999/AdapterSIS.git .

AB - PURPOSE: In surgical image segmentation, a major challenge is the extensive time and resources required to gather large-scale annotated datasets. Given the scarcity of annotated data in this field, our work aims to develop a model that achieves competitive performance with training on limited datasets, while also enhancing model robustness in various surgical scenarios.METHODS: We propose a method that harnesses the strengths of pre-trained Vision Transformers (ViTs) and data efficiency of convolutional neural networks (CNNs). Specifically, we demonstrate how a CNN segmentation model can be used as a lightweight adapter for a frozen ViT feature encoder. Our novel feature adapter uses cross-attention modules that merge the multiscale features derived from the CNN encoder with feature embeddings from ViT, ensuring integration of the global insights from ViT along with local information from CNN.RESULTS: Extensive experiments demonstrate our method outperforms current models in surgical instrument segmentation. Specifically, it achieves superior performance in binary segmentation on the Robust-MIS 2019 dataset, as well as in multiclass segmentation tasks on the EndoVis 2017 and EndoVis 2018 datasets. It also showcases remarkable robustness through cross-dataset validation across these 3 datasets, along with the CholecSeg8k and AutoLaparo datasets. Ablation studies based on the datasets prove the efficacy of our novel adapter module.CONCLUSION: In this study, we presented a novel approach integrating ViT and CNN. Our unique feature adapter successfully combines the global insights of ViT with the local, multi-scale spatial capabilities of CNN. This integration effectively overcomes data limitations in surgical instrument segmentation. The source code is available at: https://github.com/weimengmeng1999/AdapterSIS.git .

UR - http://www.scopus.com/inward/record.url?scp=85192704620&partnerID=8YFLogxK

U2 - 10.1007/s11548-024-03140-z

DO - 10.1007/s11548-024-03140-z

M3 - Article

C2 - 38717737

SN - 1861-6410

VL - 19

SP - 1313

EP - 1320

JO - International Journal of Computer Assisted Radiology and Surgery

JF - International Journal of Computer Assisted Radiology and Surgery

IS - 7

ER -

Enhancing surgical instrument segmentation: integrating vision transformer insights with adapter

Abstract

Access to Document

Other files and links

Fingerprint

Cite this