Phonetic Error Analysis Beyond Phone Error Rate

Erfan Loweimi*, Andrea Carmantini, Peter Bell, Steve Renals, Zoran Cvetkovic

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

180 Downloads (Pure)

Abstract

In this paper, we analyse the performance of the TIMIT-based phone recognition systems beyond the overall phone error rate (PER) metric. We consider three broad phonetic classes (BPCs): {affricate, diphthong, fricative, nasal, plosive, semi-vowel, vowel, silence}, {consonant, vowel, silence} and {voiced, unvoiced, silence} and, calculate the contribution of each phonetic class in terms of the substitution, deletion, insertion and PER. Furthermore, for each BPC we investigate the following: evolution of PER during training, effect of noise (NTIMIT), importance of different spectral subbands (1, 2, 4, and 8 kHz), usefulness of bidirectional vs unidirectional sequential modelling, transfer learning from WSJ and regularisation via monophones. In addition, we construct a confusion matrix for each BPC and analyse the confusions via dimensionality reduction to 2D at the input (acoustic features) and output (logits) levels of the acoustic model. We also compare the performance and confusion matrices of the BLSTM-based hybrid baseline system with those of the GMM-HMM based hybrid, Conformer and wav2vec 2.0 based end-to-end phone recognisers. Finally, the relationship of the unweighted and weighted PERs with the broad phonetic class priors is studied for both the hybrid and end-to-end systems.
Original languageEnglish
Number of pages15
JournalIEEE/ACM Transactions on Audio, Speech, and Language Processing
Publication statusAccepted/In press - 5 Sept 2023

Fingerprint

Dive into the research topics of 'Phonetic Error Analysis Beyond Phone Error Rate'. Together they form a unique fingerprint.

Cite this