Comparison of pruning strategies for segmental HMMs

Y. Shiga and P. J. B. Jackson

Abstract: The capability of segment models (SMs) in generating a feature-vector trajectory for each state gives SMs the potential not only for automatic speech recognition, but also for parametric speech synthesis. However, expanding the state space for the trajectory makes SMs computationally costly; therefore, as pointed out in [1], efficient search is essential for a recognizer to perform decoding within a reasonable time, for training and recognition. In this paper, we detail our pruning strategy for segmental HMMs (SHMMs), together with a summary of the underlying decoding algorithm, derived from the Viterbi algorithm for HMMs.
For the SHMM decoder to compute the best path efficiently, the algorithm uses two probabilities for each node in the trellis, before and after the segment: starting node probability (SNP) and ending node probability (ENP). SNP is the maximum product of ENP and transition probability for all predecessors. ENP is the maximum product of SNP and output probability over the speech segment for all durations.
In the framework of this decoding algorithm, the pruning strategy contains four types of pruning: (1) pre-search, (2,3) two levels of beam pruning, and (4) duration pruning. Prior to the main decoding process, the pre-search traces back all possible paths in the trellis, and eliminates nodes that do not occur. The elimination is done separately for invalid starting and ending nodes, at which SNPs and ENPs are not computed. Then, in the main decoding stage, beam pruning is applied independently to SNPs and ENPs at each time frame. Duration pruning is embedded within computation of ENPs by omitting output probability computation for segments whose duration probability is below the threshold.
The pre-search does not cause any loss of accuracy theoretically, but avoids unnecessary computation in the case of phone-level/non-embedded training or classification. On the other hand, the beam search methods are ‘lossy’ and more effective at reducing the computational load for recognition. Duration pruning is also ‘lossy’, and effective for both training and recognition. Their efficiency is measured experimentally on the TIMIT speech corpus, and performance evaluated as  phone recognition accuracy. The results show the sensitivity of each pruning method to introducing errors, for a given computational saving. While we plan to conduct further experiments with the segmental recognizer,  the concept of SNP and ENP pruning could be applied to standard HMMs.

[1] M.J.Russell, Reducing computational load in segmental hidden Markov model decoding for speech recognition, Electronics Letters, 41(25), pp.1408-1409, December 2005.