Relation-ontology drive topic classification

Qi Hao

Informatics

Student thesis: Doctoral Thesis › Doctor of Philosophy

Abstract

Conventional topic models have been extensively used to extract topics in documents for topic classification. For example, Latent Dirichlet Allocation (LDA) is a topic modelling technique that produces a probabilistic model based on word co-occurrence, for the purpose of text classification. However, it is very challenging for these topic models to accurately capture the semantical information of the topics because they focus on the occurrences of words in text rather their meanings and context. They ignore the fact that words may have multiple meanings and that different words may have the same meanings. In addition, they ignore the semantical structures in text, such as the relationship between words in their context. This PhD thesis proposes a novel topic classification technique that uses ontological information and the relationship between words to provide a more accurate topic model for topic classification. Firstly, LDA is extended to use the semantic concepts from an ontology to help capture some of the possible semantical meanings of the words appearing in the documents. The topic model allows topics to be defined more generally in terms of ontological concepts rather than words and this captures the semantical meaning of the words more accurately. In order to capture the relationship between words in the context, this work also introduces a new entity-based algorithm for multiple-relation extraction from unstructured text. The new algorithm uses standard Natural Language Processing (NLP) techniques to analyse unstructured text. The algorithm offers clear performance advantages over conventional single-relation extraction techniques and verb-based techniques. Finally, the extracted structured relations were incorporated with the ontology-driven topic model, resulting in what we called a relation-ontology driven topic classification technique. This topic model allows the topics to be defined more accurately in terms of relations between ontological concepts rather than word co-occurrence. This captures the semantical meaning and semantical structures in text. Our classification approach can be combined with a self-training procedure to reduce the amount of manually classified data required. The classification performance of these topic models was compared against several variations of existing techniques on four widely used datasets. The results show that the inclusion of the ontology component and the contextual relationships help to reduce the training time by nearly quarter whilst achieving the highest accuracy overall in the classification.

Date of Award	1 Nov 2020
Original language	English
Awarding Institution	King's College London
Supervisor	Jeroen Keppens (Supervisor) & Odinaldo Rodrigues (Supervisor)

Cite this

Documents

2020_Hao_Qi_1624617_ethesis
File: application/pdf, 1.14 MB
Type: Thesis