TY - CHAP
T1 - Automatic Topic Hierarchy Generation Using WordNet
AU - Monteiro Viera, Jose Miguel
AU - Brey, Gerhard Andreas
PY - 2012
Y1 - 2012
N2 - In order to make full use of the rich content of large text collections various finding aids are needed. One very effective way of accessing this kind of collec- tion is via a subject taxonomy or a topic hierarchy. Most subject classification techniques [Sebastiani 2002] are based on supervised methods and need a sub- stantial amount of training data that are used by the various machine learning algorithms on which they are based. In many cases this constitutes a significant problem if the resources to create these training data are not available.Unsupervised methods such as clustering algorithms, though not requiring the same resources in data preparation as machine-learning based methods, need considerable attention after the techniques have been applied in order to make the clusters meaningful to users. The use of existing powerful tools such as the semantic tagger developed at Lancaster University1 avoids these problems, providing semantic tags for each document, but often these semantic tags are very general and therefore not ideal for a user who searches for more concrete subject terms.The aim of the research described here is the automatic generation of a topic hierarchy, using WordNet [Miller 1995, Fellbaum 1998] as the basis for a faceted browse interface, with a collection of 19th-century periodical texts as the test corpus.Our research was motivated by the Castanet algorithm, a technique devel- oped by Marti Hearst and Emilia Stoica [Stoica 2004, Stoica 2007] to automati- cally generate metadata topic hierarchies. Castanet was developed and success- fully applied to short descriptions of documents. In our research we attempt to adapt and extend the Castanet algorithm so that it can be applied to the text of the actual documents for the many collections for which no abstracts or summaries are available. It should also be a viable alternative to the other techniques mentioned above.
AB - In order to make full use of the rich content of large text collections various finding aids are needed. One very effective way of accessing this kind of collec- tion is via a subject taxonomy or a topic hierarchy. Most subject classification techniques [Sebastiani 2002] are based on supervised methods and need a sub- stantial amount of training data that are used by the various machine learning algorithms on which they are based. In many cases this constitutes a significant problem if the resources to create these training data are not available.Unsupervised methods such as clustering algorithms, though not requiring the same resources in data preparation as machine-learning based methods, need considerable attention after the techniques have been applied in order to make the clusters meaningful to users. The use of existing powerful tools such as the semantic tagger developed at Lancaster University1 avoids these problems, providing semantic tags for each document, but often these semantic tags are very general and therefore not ideal for a user who searches for more concrete subject terms.The aim of the research described here is the automatic generation of a topic hierarchy, using WordNet [Miller 1995, Fellbaum 1998] as the basis for a faceted browse interface, with a collection of 19th-century periodical texts as the test corpus.Our research was motivated by the Castanet algorithm, a technique devel- oped by Marti Hearst and Emilia Stoica [Stoica 2004, Stoica 2007] to automati- cally generate metadata topic hierarchies. Castanet was developed and success- fully applied to short descriptions of documents. In our research we attempt to adapt and extend the Castanet algorithm so that it can be applied to the text of the actual documents for the many collections for which no abstracts or summaries are available. It should also be a viable alternative to the other techniques mentioned above.
M3 - Conference paper
BT - Digital Humanities
ER -