UK Speech - One Day Meeting 2003

Abstracts of papers presented at the UK Speech one-day meeting for young speech researchers held at UCL on 24 April 2003.

Contents:
¤UKSpeech home

Internet links:
¤UEA Computing Sciences
¤UCL Phonetics
& Linguistics


Product of Gaussians and Multiple Stream Systems

S. S. Airey and M. J. F. Gales
(Cambridge University Engineering Department)

Recently there has been interest in the use of product of experts (PoE) for machine learning classification problems. PoEs, in which decisions are based on the normalised product of component experts rather than a mixture, offer an alternative to the standard mixture of experts model. This work presents a version of PoE, called product of Gaussians (PoG). Here the individual experts are based on Gaussian mixture models (GMMs). PoG systems are of particular interest since the product of two Gaussians, when appropriately normalised, is itself Gaussian distributed. This work uses PoGs within a HMM framework. Each state of the HMM is modeled by the product of state specific GMM experts.

When using the PoE framework, one important issue is the calculation of the normalisation term. In contrast to many PoE systems, using PoG it is possible to explicitly compute the normalisation term given the expert parameters. This normalisation may be applied at the state level, or as investigated in this work, the individual component level. In this case the PoG system can be viewed as a generalisation of the standard multiple stream systems used, for example, in HTK. An interesting aspect of the PoG system is that the individual experts need not themselves be valid PDFs, as long as the final producted distribution is. This allows additional flexibility when estimating the model parameters.

This work will describe the use of PoG for speech recognition. Initialisation and maximum likelihood parameter estimation will be described. The performance of the PoG system on a state-of-the-art speech recognition task is evaluated and compared to standard GMM systems and multiple stream systems. Possible future directions for using PoEs, such as products of HMMs, for speech recognition will also be discussed.

Devising a system of computerised metrics for the Frenchay Dysarthria Assessment Intelligibility Tests

James Carmichael
(University of Sheffield)

This study reports on the development of a computerised system of isolated-word intelligibility metrics designed to improve the scoring consistency of the intelligibility assessment component of the Frenchay Dysarthria Assessment Test (FDA). The proposed intelligibility measurements are based on the goodness-of-fit probability scores derived from the forced alignment of the dysarthric speech to corresponding automatic speech recognition hidden Markov models (HMMs) trained on data from a variety of normal speakers. We conjecture that since these HMMs attempt to model variability in normal speech, the extent to which dysarthric speech matches them will be related to its intelligibility. Baseline results reveal a distinct separation of probability score clusters for normal and dysarthric data along with a definite correlation between the recogniser's and the human listeners' assessment of a given speaker's intelligibility. This correlation holds true both in terms of 'raw' scores (the percentage of correct interpretations) and the naďve listener's perception of the degree of effort taken to successfully decode a particular utterance. Furthermore, the HMM-based recogniser performs consistently over repeated sessions, unlike human listeners whose assessment - as one of the experiments demonstrates - is often affected by their subconscious adjustments to the dysarthric's initially unfamiliar speech patterns (learning effect).

Statistical feature extraction from acoustic images for automatic database retrieval

Ioannis Paraskevas
(University of Surrey)

The fine classification of audio utterances is an important problem because the features that have to be extracted need to be very accurate in order to contribute to the effective classification. In this paper, results are presented for a fine classification problem: namely the classification of two groups of different kinds of gunshots. The problem of accurate classification can be divided into two parts: i) feature extraction and ii) classification. The more effective the feature extraction, the more effectively the classifier will be able to categorize the various audio samples. In this paper, a novel method for the automatic recognition of acoustic utterances is presented using acoustic images as the basis for the feature extraction. The feature extraction process is based on the time-frequency distribution of an acoustic unit. A novel feature extraction technique based on the statistical analysis of the spectrogram Hartley transform (distribution) and Choi-Williams distributions of the data is reported as well as a brief discussion of the classifier used. The image is compressed using a statistical analysis of the acoustic image formed from the time-frequency distributions of acoustic data. The kurtosis, L-moments and entropy of the distributions as well as the energy, contrast etc. of the corresponding co-occurrence matrices of the distributions are calculated and then combined into a feature matrix. These appropriate features are then presented to the classifier. Initial results obtained indicate that the method is capable of accurate discrimination for fine classification.

Hidden Markov Models applied to Eye Movement

Neil Cooke
(Birmingham University)

Speech recognition systems used in multi-modal human computer interfaces present an opportunity to explore the integration of the speech stream with other modalities. Two possible uses for integration are to improving recognition of individual streams, and to uncover a hidden process that has a different manifestation in each modality.

To explore stream integration it is desirable to provide analogous formalisations of individual streams, so that the combined state-space between streams can be investigated within a mathematically rigorous framework.

The eye stream can uncover the focus of user attention. Attentive-aware interfaces may be required in future pervasive computing environments. As computational artefacts are merged into the real world, means of input requiring explicit user request will be replaced by implicit input, consisting of any number of data streams sensing modalities and/or contextual variables.

In this research I have used the Hidden Markov Model on the eye stream. As in the speech domain, sensing technologies for eye movements are not 100% reliable nor deterministic, thus eye movement may treated as a stochastic process. I apply a selection of Hidden Markov models to eye movement and to evaluate their robustness to noisy eye tracker data streams.

To demonstrate improvement over non-HMM methods of uncovering user attention from eye movement, I compare a standard HMM and durational HMM with non-HMM methods. In addition I utilise gaussian mixtures in observation distributions to further decrease noise sensitivity. Software implementation of the models us within an object oriented framework using C#.

The data corpus used in this research is from Birmingham University Department of Psychology, Cognition and Language research group. The data provides task-constrained eye data which provides a useful analogue to potential computer-interface task models.

Speech Training and Recognition for Dysarthric Users of Assistive Technology

S. Cunningham, R. Palmer, M. Parker, J. Carmichael, P. Green, P. Enderby, M. Hawley, S. Brownsell, and P. O'Neil.
(University of Sheffield)

People with severe physical disabilities use assistive technology to interact with computers and their home environment. For such people speech offers a potentially attractive alternative to more conventional forms of interaction.

For people with neurological conditions such as cerebral palsy, multiple sclerosis or head-injury, their physical disabilities may be accompanied with a common neuromuscular speech disorder known as dysarthria. It has been shown that people with mild or moderate dysarthria can use commercially available automatic speech recognition (ASR) systems with some degree of success. It is however, generally the case that these systems perform very poorly for people with severe dysarthria.

This study reports on the progress of a three-year project investigating the integration of ASR and environmental control systems for people with severe dysarthria. For this purpose a suit of software tools have been developed to facilitate data collection, analysis, recogniser training and evaluation, together with environmental control and communication.

It is often the case that speakers with severe dysarthria show a high level of inconsistency in their productions. An aim of the project is to investigate whether it is possible to train the users of the technology to produce more consistent productions. In particular we have investigated whether users can improve the consistency of their productions of the words chosen for the recogniser's vocabulary. For this purpose software has been developed to enable a course of training to progress under the control of the clients, in their own home. We report preliminary results for a field trial of this training software.

Vowel normalization for accent: A comparison of northern and southern British English speakers.

Bronwen G. Evans and Paul Iverson
(University College London)

Vowel perception research has demonstrated that people use both intrinsic (e.g. pitch and formant frequency ratios) and extrinsic (e.g. ranges of vowel formant frequencies in a carrier sentence) acoustic information to adjust their vowel categories for individual talkers. The present investigation examined whether listeners also make use of sociolinguistic factors, in particular accent, to accomplish vowel normalization.

The experiments contrasted two varieties of British English: Sheffield English, a northern variety of English, and Standard Southern British English (SSBE). Both exhibit differences in vowel distributions. In Sheffield English, for example, the vowel in 'buck' is realised with a lower F1 than in SSBE, such that it is a homonym or near-homonym of 'book'. Three groups of listeners were tested in the perception of these accents: Southern listeners living in London, northern listeners living in London, and northern listeners living in the north of England. Listeners gave goodness ratings on synthesized vowels embedded in natural carrier sentences that were produced in either Sheffield or SSBE accents by a single male speaker.

The results demonstrated that northern and southern subjects living in London normalize vowels for accent, adjusting their best exemplar locations for vowels such as in the words 'bud' and 'cud' to the accent of the carrier sentence. However, these subjects did not normalise to the same vowel targets; northern subjects chose vowels for 'bud' and 'cud' that had lower F1 frequencies than did southern subjects, regardless of the carrier sentence. Moreover, subjects living in the north of England did not normalize at all. The results will be discussed in terms of the hypothesis that vowel normalization is accomplished through assimilation to the vowel categories of one's native accent.

Speech dereverberation via LP residual processing - a review

Nikolay D. Gaubitch, Patrick A. Naylor, Darren B. Ward
(Imperial College London)

The quality of speech recorded in enclosed spaces is degraded by reverberation due to sound wave reflections from surrounding walls and objects. Moreover, the severity of the quality degradation is magnified as the distance between speaker and microphone increases. Therefore, dereverberation of recorded speech is vital for the enhancement of perceived speech quality and for tasks such as speech recognition and speaker verification in "hands-free" telephony applications.

Recently, several dereverberation algorithms based on the source-filter speech production model have been proposed. The source-filter model describes speech production in terms of an excitation sequence exciting a time-varying all-pole filter. The excitation sequence consists of random noise for unvoiced speech and quasi-periodic pulses for voiced speech, and the filter models the human vocal tract. The all-pole filter coefficients can be estimated through Linear Predictive (LP) analysis on the recorded speech and subsequently, the excitation sequence, or the LP residual, can be obtained by inverse filtering the speech waveform.

It has been observed by various researchers that in reverberant environments, the LP residual contains the original impulses followed by several other peaks due to multi-path reflections. Furthermore, it is assumed that the poles obtained from the LP analysis are unaffected by reverberation. Consequently, dereverberation of speech can be achieved by attenuating the peaks in the excitation sequence due to reverberation and synthesizing the speech waveform with the enhanced LP residual and the time-varying all-pole filter with coefficients calculated from the reverberant speech.

We review three multi-channel LP residual enhancement methods based on: 1) coarse channel estimation with matched filter residual weighting, 2) maximum-kurtosis subband adaptive filtering, and 3) coherent channel addition for residual weighting. We evaluate their performances against each other as well as against the delay-and-sum beamformer which is one of the simplest enhancement approaches. Finally, we investigate the effects of reverberation on the LP analysis/synthesis using both simulated and real reverberant speech recordings and elaborate on the underlying assumptions for the above methods.

Standards for the Evaluation of Speech Synthesis for Computer- Assisted Language Learning (CALL): Speech Synthesis for Proofreading a Case Study

Zöe Handley and M-J Hamel
(UMIST)

Speech is the primary mode of human communication. It therefore follows that spoken language activities should be privileged in language training.

Although, spoken language activities are beginning to find their way into CALL, CALL programs are still biased towards the presentation of written language (Pennington and Esling, 1996).

Speech Synthesis (SS), has the potential to permit the provision of many spoken language activities, however, it has been exploited in relatively few applications. We suggest that this is because it has not yet been adequately evaluated. Only two teams of researchers, namely Stratil's (Stratil, Burkhardt et al, 1987; and Stratil, Weston et al, 1987) and Santiago- Oriola's (Santiago-Oriola, 1999; and Santiago-Oriola and Pérennou, 1999) have formally evaluated SS in CALL. However, these evaluations are inadequate. Only two requirements have been evaluated, namely the intelligibility of the speech and user reactions, and the validity of the measures employed is questionable.

Enthusiastic about the potential benefits that SS could bring to CALL, we aim to determine whether SS lives up to the claims that have been made of it. This will be achieved through the development of standards for the evaluation of SS for CALL. The development of standards for all the potential applications of SS in CALL, is beyond the scope of this project. We therefore propose to lay the foundations for these standards through the investigation of talking texts, a tool which allows learners to select any section of text from an electronic document and have it read aloud to them by means of Text-To-Speech (TTS) synthesis, to support learners proofreading of their own productions.

The research presented here constitutes our pilot investigation, a comparison between Anglophone learners' use of the following methods of proofreading for proofreading French texts: (1) silent reading (2) supported by a native speaker (3) supported by TTS.

Automatic Call-Routing with recognized phonetic sequences

Huang & S.J. Cox
(University of East Anglis)

Since 1996, Gorin, Riccardi and Wright etc., they have done some interesting work on Automatic Call-Routing. The aim of the work is to automatically classify the type of the calls from customs and transmit them to the correct destination. In order to improve flexibility of the work, in 1999, they cultivated another interesting field -classification without transcriptions. They used training statistical language models for both recognition and understanding from large corpora. This means we can find a method for automatically learning vocabulary, grammar and semantics from speech corpus without transcriptions.

The being used database is from Nuance Software Company, about 15,000 spoken utterances, which includes 61 call types. In order to compare the results with AT&T , we also choose 4511 utterances for training and 3518 for test, in which there are 18 call types altogether. In addition, for a comparison with baseline, we transcribe the word sequence with using dictionary pronunciation which is CMU dictionary. Here, we make use of speech recognizer provided by Nuance Software for phone recognition. But, here, we only used 8000 spoken utterance from the 15,000 as data corpus with 7-gram statistical language model. We got about 48% phone accuracy in which 27% substitution, 21% deletions and 4% insertions.

AT&T used posterior probability given the salient morphemes extracted from the received enquiries for deciding correct destination of them. They use an iterative procedure for extracting salient phone phrase from training set by calculating mutual information to segment phonetic sequence into phone phrases and get salient phone-phrases as morphemes; Parser is built for choosing the most possible segmentation ; Then calculate a posterior possibility distribution to choose the most possible one the destination of the call. Actually, the key step in this is how to extract the best phone phrases in the training set and apply them to test set. Many researcher have used count in memory(Harris), mutual information, entropy, even neural networks. The above all methods are based on probability of context of strings.

I also got segmented phone phrases from the output of the phone recognizer by building N-gram statistical language model with putting silence symbol between two phone transcriptions. It is better method for me to get segmented phonetic sequences. Although considering the coverage of the segmented phone phrases to test set is a measure, I consider using Kull-Back Distance between the distribution of the segmented phone phrases with that of phone transcriptions in the training set. As the baseline, we only use tri-phone and quad-phone as terms, and build call-terms matrix with mutual information as weighting scheme. We got 63% for training set and 57.85% for test set. If we use the segmented phone phrases got from silence added language model, we got 68.8% for training set. But, because the coverage of the phone phrase got from training set is not enough to test set, about 63.38%, the correct classification rate in test set is down, only about 47.38%. Now, I am considering to use clustering methods and LDA to improve the correct classification rate.

Statistical Language Modelling of Dialogue Material in the British National Corpus.

Gordon Hunter and Mark Huckvale
(University College London)

The majority of statistical language modelling studies have concentrated on written text material (or read versions thereof). However, it is well-known that dialogue is significantly different from written text in its lexical content and sentence structure. Furthermore, there are expected to be significant connections between successive turns within a dialogue, but "turns" are not generally meaningful in written text.

Here, we describe studies on statistically modelling the large quantity of dialogue material within the British National Corpus (BNC). The dialogue data had a much smaller vocabulary than either the entire BNC text material or a text sample of equivalent size to the dialogue. The proportion of material covered by a lexicon of a specified size was generally higher for dialogue data than for text. The perplexities of simple trigram language models trained on the dialogue material were found to be about half the values for models trained on a similar amount of text. The trend shown by perplexities of text trigram models as a function of size of training set suggests that some 500 million words of text are required to obtain a perplexity as low as obtained using 7 million words of dialogue.

Investigations have also been performed on the effectiveness of cache, trigger and cluster-based models for dialogue material. We have focused on pairs of successive, relatively short dialogue turns, expected to be typical of highly-interactive dialogues. Unlike what is often found for text material, cache-based models seemed to perform better than those based on trigger pairs (Hunter & Huckvale, 2002). This is believed to be due to the relatively short length of the turn pairs, and to successive dialogue turns frequently involving repetitions of words for confirmation or clarification.

Several different approaches to clustering turn pairs were employed, including the use of lexically-based and perplexity-based metrics for measuring similarity of turns. Preliminary results have been quite encouraging Although, when interpolated with a trigram model, cluster models using lexically or perplexity-based metrics led to similar reductions in perplexity over a simple trigram model, the perplexity-based gave far superior performance in terms of computational time required.

Packet loss masking for distributed speech recognition using interleaving.

A.B.James and B.P.Milner
(Univserity of East Anglia)

Distributed Speech Recognition (DSR) is an emergent technology that employs a client-server approach to speech recognition on mobile devices. In this model, the feature extraction is carried out on the device and the resultant feature vectors are transmitted over a channel to a server for recognition.

It is often the case that feature vectors are transmitted over channels that do not guarantee data delivery, such as mobile networks or packet-based networks using real time protocols (such as UDP and RTP). In this case it is possible, and indeed likely, that the feature vector stream will be distorted by packet loss. In addition to this, the presence of packet loss is normally a result of specific network conditions and therefore tends to occur in bursts.

The aim of this work is to improve the accuracy of distributed speech recognition systems operating over channels suffering from burst-like packet loss. First of all, it will be shown that burst-like packet loss has a more severe effect on word recognition accuracy than packet loss that is evenly spread throughout the feature vector stream. Secondly, a simply block interleaver is used to distribute the burst-like nature of the packet loss, leading to considerable performance gains when tested on the Aurora connected digits database. Finally, the effect of varying both the interleaving depth and the structure of the interleaver will be examined.

A Partial-Update Adaptive Algorithm for Stereophonic Acoustic Echo Cancellation

Andy Khong, P.A Naylor
(Imperial College London)

The advent of teleconferencing in recent years has seen an increase in the research field of acoustic echo cancellation (AEC) in communication (telephony and computer) networks. AEC is required to remove the echoes produced by the coupling of the speaker and the microphone from a hands free phone. Adaptive filters can be used to estimate the room's impulse response and effectively prevents echo from feeding back to the network. To enhance the sound realism, two channel audio is necessary. However, it has been known that the two input signals from the two channels are highly correlated and this coherence reduces the convergence rate of the adaptive filters.

Several algorithms have been developed to reduce the complexity and yet achieve fast convergence rate. One way to reduce complexity of the classical Normalised Least Mean Square Algorithm (NLMS) is the MMaxNLMS. The MMaxNLMS is within the family of partial update adaptive filters where only a fraction of weights is being updated per sample input. The update coefficients are chosen such that only weights, which correspond to the highest weight input energy, are being updated. It has been shown that the MMaxNLMS maintains the closest performance to the full update NLMS despite of updating only a small number of coefficients.

The poster presentation will examine the performance of MMaxNLMS algorithm. The problem of coherence will be explained and a novel algorithm to improve the convergence rate of the MMaxNLMS and NLMS algorithm will also be presented. The idea is to select a set of update coefficients that minimise the weight input coherence and at the same time having the highest input energy. This effectively exploits the low complexity of MMaxNLMS while at the same time minimising the input coherence of the signal that drives the two channel adaptive filters.

Unit Selection in Concatenative TTS Synthesis Systems Using Mel Filter Bank Amplitudes

Lambert

Concatenative text-to-speech (TTS) synthesis systems generate speech from text by joining together segments of recorded speech taken from large speech corpora. In order to generate more natural sounding speech, unit selection needs to reduce the number of concatenation points in the synthesized speech and make the concatenation joins as smooth as possible.

This research considers unit selection based on non-uniform units whereby the most appropriate units according to acoustic and phonetic criteria, are selected from a myriad of similar candidate speech segments that are present in speech database. This unit selection is based on the Viterbi-style algorithm, which dynamically selects the most suitable database units in order to synthesize completely new utterances. The most suitable units are found by considering concatenation and target costs. Mel filter bank amplitudes are used as an acoustic feature to address the concatenation costs, whereas target costs are considered in terms of the phonemic and phonetic properties of required units. The Euclidean distance metric is used to find the minimum local distance between the mel filter bank channels of the adjoining units. The algorithm gives priority to diphone and triphone units and only in their absence phone units are considered. The motivation behind this method is based on the fact that units longer than phones are better at accommodating coarticulation effects associated with the phone's neighbouring context. Based on minimum local distances between successive units, the Viterbi algorithm returns an optimal path showing which units can be concatenated at minimum global costs.

Within subjects and between subjects ANOVA were used for experimental evaluation of two TTS systems, one of which uses this method of unit selection. The evaluation of listeners' scores showed that the TTS system with this method of unit selection was preferred in 52% of test sentences, and in 9% of test sentences the two TTS systems were judged the same.

Speech Recognition using an intermediate articulatory layer and non-linear articulatory-to-acoustic mapping.

Lo, Martin Russell
(University of Birmingham)

This paper describes a novel HMM for speech recognition in which the relationship between the symbolic and acoustic descriptions of speech is regulated by an intermediate, formant-based layer. This approach to speech recognition aims to improve performance by modeling the articulatory phenomena which give rise to variability in speech. The speech dynamics in the formant representation are modeled as linear trajectories which characterize the distribution of formants in each segment type. The produced trajectories are mapped into acoustic features using a nonlinear transformation method knowing as a Radial Basis Function network (RBF). In training process, gradient descent is used as a tuning step to optimize the parameters of the model since the output of RBF can be differentiated with respect to the adjustable parameters of the trajectory unit. The issues of model parameter estimation such as the use of different set of models for initialization, the influences of various learning rates, and the number of iterations for gradient descent have also been investigated.

Experimental results on TIMIT database have been produced and have been compared with similar system with linear articulatory-to-acoustic mapping. Some conclusions also drawn for the issues of model parameter estimation based on the results generated.

Automatic Complexity Control For HLDA Systems

Xunying Liu, Mark Gales and Phil Woodland
(Cambridge University)

Designing a state-of-the-art large vocabulary speech recognition system is a highly complex problem. System generalization in terms of recognition performance for unseen data is important, and a wide range of techniques are available which affect both the performance of and number of free parameters. Selecting the appropriate complexity of a system is both time-consuming and only a limited number of possible systems can be examined.

Apart from validation test using held-out data, most of existing likelihood based model complexity control techniques are normally split into two groups. The first is based on Bayesian techniques, where the model parameters are treated as random variables to be integrated out. One of such schemes is the Information Criterion (BIC). The second category is information theory approaches. A complexity control problem is treated as finding a code length, for example minimum description length ( principle. Those two approaches are related, asymptotically both to the BIC approximation. In this paper we present experimental results showing that these standard likelihood based criteria in general yield high prediction error on recognition performance, and are not suitable for complexity control on LVCSR tasks.

In addition an auxiliary function based tighter Bayesian evidence lower bound is also proposed to practically integrate out the evidence integration with relatively low computational cost. Alternative marginalized discriminative training criteria based complexity control metrics will also be briefly visited, which are more related to the relative systems' recognition error rate.

Speaker Verification over the Internet

Malegaonkar

This paper is concerned with the adverse effects of speech encoding on the accuracy of speaker verification over the Internet. Such a process normally involves operating on test utterances which have been encoded for the purpose of transmission. In some applications, the speaker model generation may also involve using speech material which has been subject to a compression process due to a similar encoding method. In other applications, however, the speaker model training may be based on uncompressed speech data. In such cases, there is a severe mismatch between the training and testing data which in turn can adversely affect the verification accuracy. The experimental investigations include both matched and mismatched conditions. The encoding schemes considered for this study are the standard G711 and G723. The study also includes an analysis of the potential usefulness of score normalization for improving the speaker verification accuracy in mismatched conditions.

A front end using periodic and aperiodic streams for ASR

David M. Moreno and Philip J.B. Jackson
(University of Surrey)

Various acoustic mechanisms produce cues in human speech, such as voicing, frication and plosion. Automatic speech recognition (ASR) front ends often treat them alike, although studies demonstrate the dependence of their signal characteristics on the presence or absence of vocal-fold vibration. Typically, Mel-frequency cepstral coefficients (MFCCs) are used to extract features that are not strongly influenced by source characteristics. In contrast, harmonic and noise-like cues were segregated before characterisation, by separating the contribution of voicing from those of other acoustic sources to improve feature extraction for both parts. The pitch-scaled harmonic filter (PSHF) divides an input speech signal into two synchronous streams: periodic and aperiodic, respective estimates of voiced and unvoiced components of the signal at any time. In digit-recognition experiments with the Aurora 2.0 database (clean and noisy conditions, 4kHz bandwidth), features were extracted from each of the decomposed streams, then combined (by concatenation or further manipulation) into an extended feature vector. Thus, the noise robustness of our parameterisation was compared against a conventional one (39 MFCCs, deltas, delta-deltas). Each separate stream reduced recognition accuracy by less than 1% absolute, compared to the baseline on the original speech; combined, they increased accuracy under noisy conditions (by 7.8% under 5dB SNR, after multi-condition training). Voiced regions provided resilience to corruption by noise. However, no significant improvement on 99.0% baseline accuracy was achieved under clean test conditions. Principal component analysis (PCA) of concatenated features tended to perform better than of the separate streams, and PCA of static coefficients better than after calculation of deltas. With PCA of concatenated static MFCCs, plus deltas, the improvement was 5.6%, implying some redundancy between the complementary streams. Future plans to evaluate the PSHF front end for phoneme recognition with higher bandwidth could help to identify the source of these substantial performance benefits.

Do human listeners build models of environmental noise?

Perez, E., Meyer, G.
(University of Liverpool)

The aim of this study is to explore whether human listeners actively generate representations of background noise to improve speech recognition in noisy situations. We hypothesise that human recognition performance should be highest if the spectral and temporal structure of interfering noise is regular so that a good noise model can be generated while recognition performance is worse if listeners are presented with highly irregular noise.

The speech stimulus used in this study is a vowel-nasal stimulus which is perceived as /en/ if presented in isolation but as /em/ if it is presented with a frequency modulated sinewave in the position where the second formant transition would be expected (Meyer and Barry, 1999; Harding and Meyer, 2003).

As a baseline study we present data that shows to what extent a chirp is integrated into the percept as a function of its position on the speech signal.

In the main experiment the target signal is presented in a background of repeated noise signals that have the duration of the formant transition. The noise signal is a sequence of signals that can occur regularly or irregularly and can have a fixed or variable spectrotemporal structure. We present data that shows the effect of manipulating the regularity of the noise signal.

Recent work on Discriminative Training

D. Povey & P.C. Woodland
(University of Cambridge)

Discriminative training of HMMs is training which aims to directly minimise the training set error rate rather than maximise the likelihood of the training set. Over the years there has been a lot of research in discriminative training for speech recognition, with techniques such as Maximum Mutual Information (MMI) and Minimum Classification Error (MCE) being introduced, but there has previously been little success in making these techniques work for large-vocabulary tasks. We have solved some of the practical problems in implementing MMI in a large-vocabulary context: our technique includes the use of acoustic scaling and weakened language models for better test-set generalisation; the use of lattices for efficiency; and finding a fast optimisation strategy. We have also developed a new discriminative training criterion, Minimum Phone Error (MPE), which gives better results than MMI across various large-vocabulary corpora. MPE typically gives about 10% relative improvement compared with Maximum Likelihood baselines.

Customizable Text to Speech Synthesis

Michael Pucher, Friedrich Neubarth, Erhard Rank
(Telecommunication Research Center Vienna)

Our poster will describe a Speech Synthesis System that allows a user to add new utterances of very high quality to a corpus of pre-recorded utterances of a specific domain when the speaker is not available. Using this system one can select from a list of lists of candidate units from the corpus and optimise prosodic parameters using the prosody of recorded or pre-recorded utterances (copy-sysnthesis), optimise concatenation points and add the new utterances to the corpus. The Unit Selection Module of the system selects the k-best Unit paths from a relational Unit database using diphone and word sized Units. Through the combination of these Unit types and a well designed concatenation cost function one can select high quality synthesized prompts from the corpus. Through the use of word size Units the search is fast, in case words are found, and through the Cost function also Syllables are used although they are not annotated. For the annotation we use the German Festival Speech Synthesis System.

Switching linear dynamical systems for speech recognition

A-V.I. Rosti and M.J.F Gales
(University of Cambridge)

Currently the most popular acoustic model for speech recognition is the hidden Markov model (HMM). However, HMMs are based on a series of assumptions some of which are known to be poor. In particular the assumption that successive speech frames are conditionally independent given the state that generated them. To overcome this shortcoming, segment models have been proposed. These model whole segments of frames rather than individual frames. One form is the stochastic segment model (SSM), which uses a standard linear dynamical system to model the sequence of observations within a segment. Here the dynamics are modelled by a first-order Gauss-Markov process in some low-dimensional state space. The feature vector is a noise corrupted linear transformation of the state vector. Though the training and recognition algorithms are more complex compared to HMMs, it is feasible to use standard techniques for inference with SSMs.

For the SSM segments are assumed to be independent. Intuitively, this is not always valid due to co-articulation between the modelling units. Switching linear dynamical systems (SLDS) have therefore been proposed. In SLDS, the posterior distribution of the state vector is propagated between segments. Unfortunately, exact inference in SLDS is not tractable due to exponential growth of components in time. In this work, approximate methods for the inference in SLDSs will be presented. First there are approximate methods based on heuristic Viterbi-like algorithm. Alternatively variational learning may be used. Finally approaches based on Markov chain Monte Carlo methods can be used, including a training scheme based on stochastic expectation maximisation (SEM). For the SEM scheme, convergence and implementation issues for use with SLDS will be discussed in detail. In addition, initial experiments comparing SSM and SLDS in N-best classification task given model aligned data are presented.

Evaluation of a multilingual synthetic talking face as a communication aid for the hearing impaired

Catherine Siciliano
(University College London)

It is well documented that visual speech information is an important addition to auditory speech communication. For the severely hearing impaired, the visual channel is often the primary means through which phonetic information is transferred. Hearing impaired persons are thus substantially handicapped in telephone communication. The Synface project is developing a multilingual synthetic talking face, driven by an automatic speech recognizer, to assist the hearing impaired with telephone communication.

Previous experiments with a Swedish synthetic talking face have shown that synthesized visual face movements can enhance speech intelligibility in hearing impaired users [1]. The current experiments investigate the gain in intelligibility derived from a multilingual synthetic talking head controlled by hand-annotated speech. Speech materials were simple Swedish, English and Dutch sentences typical of those used in speech audiometry. Speech was degraded to simulate in normal-hearing listeners the information losses that arise in severe-to-profound hearing impairment. Degradation was produced by vocoder-like processing using either two or three frequency bands, each excited by noise. 12 native speakers of each of the three languages took part in intelligibility tests in which the auditory signals were presented alone, with the synthetic face, and with a natural video of the original talker.

Intelligibility in the purely auditory conditions was low (7% for the 2 band vocoder and 30% for the 3 band vocoder). The average intelligibility increase for the synthetic face compared to no face was 20%. This improvement was consistent, statistically reliable, and of sufficient magnitude to be important in everyday communication. The synthetic face falls short of a natural face by an average of 18%. Further experiments with visual identification of English consonants showed that the synthetic face fell short of the natural face on both place and manner of articulation. The results will be used to improve the visual speech synthesis.

Insights into Noise Estimation

Sarah Simpson and Martin Cooke
(University of Sheffield)

One of the major limiting factors in automatic speech recognition (ASR) systems is noise robustness. This investigation explores the limitations of current noise estimation schemes. It demonstrates how noises distort the received signal and indicates where efforts should be concentrated to more accurately model auditory environments.

Six different algorithms were selected, each of which falls into one of three classes of noise estimation technique: speech absence, energy tracking and harmonic selection. As most noise schemes aim to work in 'real auditory environments' (where a noise can be defined as being any undesirable signal) an extended set of noise signals was used in their evaluation. This included the standard noise set from the AURORA 2 corpora as well as a set of everyday intrusive sound sources that are not usually used to test ASR systems (e.g. a telephone ring). These noises were combined with selected digit sequences from the AURORA database, at signal to noise ratios (SNRs) ranging from -10 to 30dB. Each mixture signal was then converted into two different time-frequency representations: an auditory rate-map and a spectrogram. Performance was measured using three different statistical error measures (mean square error, linear cross covariance and gain in SNR) as well as a missing data ASR algorithm.

The results indicate that no algorithm provides a consistent improvement in all SNR and noise conditions. While all techniques do offer some improvement for the AURORA noises, they fail completely when the corrupting signal is more non-stationary (e.g. when the noise signal is a siren, a telephone ring or another speech signal). Not surprisingly, there is a significant (alpha = 0.01, Rs -0.7892) inverse relationship between the algorithm's performance and the degree of non-stationarity within the noise signal. It appears that current noise estimation schemes are mainly suited for use in slowly varying environments.

Support Vector Machine Speaker Verification Methodology

V. Wan and S. Renals
(University of Sheffield)

Current state-of-the-art speaker verification systems are based on discriminatively trained generative models. In those systems, discrimination is achieved at the frame-level: each frame of data is scored separately and the scores are combined to compute the score for the whole utterance. Frame-level discriminative classifiers may discard information that is not seen as useful for frame discrimination. This means that information useful for sequence discrimination may be discarded. As such frame discrimination is not ideal. A better way is to apply discrimination on the sequence as a whole rather than on its constituent parts. In this work we develop the techniques to make support vector machines work on speaker verification using the sequence discrimination approach. The main focus of attention is on the kernel function. For whole sequence discriminative classification we study the set of score-space kernels, which includes the Fisher kernel, for deriving non-linear transformations from a variable length sequence to a fixed length vector. These kernels exploit generative models (Gaussian mixture models) to achieve the non-linear mapping. By representing the entire sequence as a single vector, the SVM can discriminate between whole sequences directly. Experimentally, a support vector machine combined with a score-space kernel and with correct normalisation (normalisation includes whitening and a novel method which we call spherical normalisation) can out perform current state-of-the-art classifiers on the PolyVar speaker verification database. We report equal error rates on the PolyVar database that are 34% lower than a baseline Gaussian mixture model likelihood ratio approach.

Using Wizard-of-Oz simulations to bootstrap Reinforcement-Learning-based dialog management systems

Jason D. Williams and Steve Young
(University of Cambridge)

This paper describes a method for "boot-strapping" a Reinforcement Learning-based dialog manager using a Wizard-of-Oz trial. The state space and action set are discovered through the annotation, and an initial policy is generated using a Supervised Learning algorithm. The method is tested and shown to create an initial policy which performs significantly better than a random exploration; analysis shows the policy can be generated using a small number of dialogs. At 15:07 19/03/2003 +0000, you wrote: Paper Title: A Neural Oscillator Model of Binaural Auditory Selective Attention Authors: Stuart N Wrigley and Guy J Brown Corresponding author: First Name: Stuart Last Name: Wrigley Affiliation: University of Sheffield Status (e.g. PG, RA, Researcher): Postdoc. RA Age-range ( < 25, 25­30, 30 ­35, >35): 25 Address for correspondence: Speech and Hearing Research Group, Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield, S1 4DP Phone Number: 0114 222 1880 Email: s.wrigley@dcs.shef.ac.uk Preferred Presentation Type: Poster Abstract: It has been proposed that listeners separate an acoustic mixture by auditory scene analysis (ASA) in which a perceptual description of each sound source is formed ­ a stream (Bregman, 1990). Typically, ASA is seen as a precursor to attentional mechanisms which simply select one stream as the attentional focus. However, recent work by Carlyon et al. (2001) has suggested that attention plays a key role in the formation of streams. A model of auditory grouping is described in which auditory attention plays a key role. The model is based upon an oscillatory correlation framework (Wang, 1996), in which neural oscillators representing a single perceptual stream are synchronised, and are desynchronised from oscillators representing other streams. The model suggests a mechanism by which attention can be directed to the high or low tones in a repeating sequence of tones with alternating frequencies. The model accounts for a number of interesting phenomena including the subconscious re-direction of attention by the onset of a new, loud stimulus; the streaming effect of alternating tone sequences (van Noorden, 1975) and associated build-up effect (Anstis and Saida, 1985); the failure of streaming to occur when attending to a distractor task (Carlyon et al., 2001); the grouping of a mistuned harmonic and complex (e.g. Darwin et al., 1995); and the capture of tones from a complex which demonstrates the old-plus-new heuristic (Bregman, 1990). References Anstis, S and Saida, S (1985). J Exp Psychol Human 11:257-271. Bregman, AS (1990). Auditory Scene Analysis. MIT Press. Carlyon RP, Cusack R, Foxton JM, Robertson IH (2001). J Exp Psychol Human 27(1):115-127. Darwin CJ, Hukin RW, Al-Khatib BY (1995). J Acoust Soc Am 98(2)Pt1:880-885. van Noorden LPAS (1975). Temporal coherence in the perception of tone sequences. Doctoral thesis, Institute for Perceptual Research, Eindhoven, NL. Wang, DL (1996). Cognitive Sci 20:409-456.

A Neural Oscillator Model of Binaural Auditory Selective Attention

Stuart N Wrigley and Guy J Brown
(University of Sheffield)

It has been proposed that listeners separate an acoustic mixture by auditory scene analysis (ASA) in which a perceptual description of each sound source is formed ­ a stream (Bregman, 1990). Typically, ASA is seen as a precursor to attentional mechanisms which simply select one stream as the attentional focus. However, recent work by Carlyon et al. (2001) has suggested that attention plays a key role in the formation of streams.

A model of auditory grouping is described in which auditory attention plays a key role. The model is based upon an oscillatory correlation framework (Wang, 1996), in which neural oscillators representing a single perceptual stream are synchronised, and are desynchronised from oscillators representing other streams. The model suggests a mechanism by which attention can be directed to the high or low tones in a repeating sequence of tones with alternating frequencies.

The model accounts for a number of interesting phenomena including the subconscious re-direction of attention by the onset of a new, loud stimulus; the streaming effect of alternating tone sequences (van Noorden, 1975) and associated build-up effect (Anstis and Saida, 1985); the failure of streaming to occur when attending to a distractor task (Carlyon et al., 2001); the grouping of a mistuned harmonic and complex (e.g. Darwin et al., 1995); and the capture of tones from a complex which demonstrates the old-plus-new heuristic (Bregman, 1990).

Apply Perceptual Weighting Linear Transformation to Voice Conversion

Hui Ye and Steve Young
(University of Cambridge)

Voice conversion, whose purpose is to transform a source speaker's speech to sound as if it was produced by a target speaker, aims to control speaker identity independently of the message and the environment. This speaker identity is normally determined by the average pitch, the vocal tract feature and the formant characteristics. Generally, the vocal tract feature and the formant characteristics can be represented by the overall shape of the spectral envelope, which is also the key feature to transform in many voice conversion systems. Various approaches have been proposed in previous researches to transform the spectral envelope, such as Codebook Mapping and Linear Transformation. Among these methods, the linear transformation technique has been proved that it outperforms other approaches in terms of speech quality as it was designed to minimize the spectral feature errors between the source speaker and the target speaker. However minimizing the spectral feature errors doesn't mean the real spectral distances were minimized, but a decrease of the spectral distances. Additionally, the least square error criteria to train the linear transformation is not necessarily the best criteria. As the human auditory system is insensitive to errors near a strong tone, therefore, it is reasonable to argue that the better way of training the transformation is to minimize the perceptual error instead of the squared error. In this paper, we present a perceptual weighting linear transformation approach for voice conversion. This technique applies a perceptual weighting filter to the spectral error, from which an optimized linear transformation function was then trained. This approach enables us to to directly minimize the perceptual spectral distance between different speakers' voice. The evaluations of the performance based on objective and subjective tests were also given in this paper.


© 2005 Mark Huckvale University College London February 2005
Valid CSS! This site uses
Cascading Style Sheets.