TY - JOUR
T1 - A Fine-grained Network for Joint Multimodal Entity-Relation Extraction
AU - Li, Yuan
AU - Cai, Yi
AU - Xu, Jingyu
AU - Li, Qing
AU - Wang, Tao
PY - 2024/10/17
Y1 - 2024/10/17
N2 - Joint multimodal entity-relation extraction (JMERE) is a challenging task that involves two joint subtasks, i.e., named entity recognition and relation extraction, from multimodal data such as text sentences with associated images. Previous JMERE methods have primarily been employed. Pipeline models apply pre-trained unimodal models separately and ignore the interaction between modalities or word-pair relation tagging methods, which neglect neighboring word pairs. To address these limitations, we propose a fine-grained network for JMERE. Specifically, we introduce a fine-grained alignment module that utilizes a phrase-patch to establish connections between text phrases and visual objects. This module can learn consistent multimodal representations from multimodal data. Furthermore, we address the task-irrelevant image information issue by proposing a gate fusion module, which mitigates the impact of image noise and ensures a balanced representation between image objects and text representations. Furthermore, we design a multi-word decoder that enables ensemble prediction of tags for each word pair. This approach leverages the predicted results of neighboring word pairs, improving the ability to extract multi-word entities. Experimental results on a benchmark dataset demonstrate the superiority of our proposed model over state-of-the-art models in JMERE.
AB - Joint multimodal entity-relation extraction (JMERE) is a challenging task that involves two joint subtasks, i.e., named entity recognition and relation extraction, from multimodal data such as text sentences with associated images. Previous JMERE methods have primarily been employed. Pipeline models apply pre-trained unimodal models separately and ignore the interaction between modalities or word-pair relation tagging methods, which neglect neighboring word pairs. To address these limitations, we propose a fine-grained network for JMERE. Specifically, we introduce a fine-grained alignment module that utilizes a phrase-patch to establish connections between text phrases and visual objects. This module can learn consistent multimodal representations from multimodal data. Furthermore, we address the task-irrelevant image information issue by proposing a gate fusion module, which mitigates the impact of image noise and ensures a balanced representation between image objects and text representations. Furthermore, we design a multi-word decoder that enables ensemble prediction of tags for each word pair. This approach leverages the predicted results of neighboring word pairs, improving the ability to extract multi-word entities. Experimental results on a benchmark dataset demonstrate the superiority of our proposed model over state-of-the-art models in JMERE.
M3 - Article
SN - 1041-4347
JO - IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
JF - IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
ER -