UK Speech - One Day Meeting 2004

Abstracts of papers presented at the UK Speech one-day meeting for young speech researchers held at UCL on 22 April 2004.

¤UKSpeech home

Internet links:
¤UEA Computing Sciences
¤UCL Phonetics
& Linguistics

Capturing variability in the English fricatives

Blacklock, Oliver S and Shadle, Christine H.
(University of Southampton)

The study of fricative production has many important applications in speech and hearing research, as well as in automatic speech recognition systems. The features used to describe fricative spectra have generally been those that separate the fricatives to the greatest degree, and hence maximise classification ability. For example, spectral moments have been used to measure the changes in fricative production in subjects whose hearing has been affected by trauma or the introduction of cochlear implant, with some success. However, while spectral moments quantify some aspects of the spectral shape of the fricative, these measures ignore many known features of the underlying production mechanism, and hence supply little information about changes in production.

The large variance associated with spectral estimates of fricative sounds may have partially influenced the way broader measures of the spectral shape have generally been favoured over those that rely upon finer spectral peak and trough details. Since it is known that important production information is present within the finer spectral details of the acoustical signal, these must be examined more carefully. Multitaper analysis is a modern technique which gives greater control over the spectral resolution-bias-variance trade-off, allowing for superior spectral estimation to be implemented. Its applicability in fricative analysis is demonstrated.

A corpus of 46 real words containing the eight English fricatives each in six different /VCV/ contexts, was read six times by six male and six female subjects, providing a set of 3312 utterances. Multitaper spectral estimates were used to generate spectrograms for each fricative utterance, in many cases, providing a much clearer image of the general spectral characteristics. Spectral variation measures across contexts and speakers enabled the localization of invariant spectral features. Spectral covariance measures over time and across tokens provide further insight into typical changes that can occur during and across productions.

Formant estimation from MFCC vectors for robust speech recognition

Ben Milner, Jonathan Darch
(University of East Anglia)

Formants are a useful acoustic parameter for speech processing. But formant estimation is difficult and degrades quickly in the presence of noise. Mel-frequency cepstral coefficients (MFCCs), however, are more easily computed and used widely in speech recognition. MFCCs can cope with lower signal to noise ratios (SNRs) when computed from noisy speech.

This work proposes a technique whereby formants are estimated from mel-frequency cepstral coefficients. It builds on work by Shao and Milner (ICASSP 2004, to be published) which predicts pitch from MFCC vectors using a set of Gaussian mixture models (GMMs), linked together within the framework of a series of hidden Markov models (HMMs). The models were trained on a joint vector comprising MFCCs and pitch estimated from clean speech. For testing, the models were used to predict pitch from the MFCC vectors only. This work applies a similar HMM-GMM strategy but with the joint vector comprising MFCCs and the first four formant frequencies.

It is thought that formant estimation will be at least as successful as pitch estimation. Whereas the process of obtaining MFCCs removes most pitch information, the spectral envelope is retained, although in a transformed representation.

Vowel Classification Using Products of Experts

Paul Dixon, Martin Russell
(Univeristy of Birmingham)

Products of Experts are a density estimation technique introduced by Hinton [1]. In a PoE, the outputs of individual probabilistic experts are multiplied together and normalized to specify an overall probability density function (PDF). Hinton also proposes an alternative criterion, known as contrastive divergence for PoE optimisation.

We investigated the issues involved when applying Products of Experts to a phonetic classification task. Results are shown for a simple vowel classification task using a standard vowel classification database the Peterson-Barney data. PoE results are compared with those for a conventional Gaussian mixture model (GMM) with a varying number of components. The results show that the PoE performance depends highly on the initialisation of the training scheme. For small numbers of components the PoE can outperform the GMMs.

A significant problem with a PoE is how to estimate a scaling factor, which ensures their outputs define a true PDF. We introduce a technique for estimating these scaling factors using a genetic algorithm.

Vowel normalization for accent: An investigation of perceptual plasticity in young adults.

Bronwen G. Evans and Paul Iverson
(University College London)

Previous work has emphasized the role of early experience in the ability to accurately perceive and produce foreign or foreign-accented speech. This study examines how listeners at a much later stage in language development - early adulthood - adapt to a non-native accent within the same language. A longitudinal study investigated whether listeners who had had no previous experience of living in multidialectal environments adapted their speech perception and production when attending university. Participants were tested before beginning university and then again 3 months later. An acoustic analysis of production was carried out and perceptual tests were used to investigate changes in word intelligibility and vowel categorization. Preliminary results suggest that listeners are able to adjust their phonetic representations and that these patterns of adjustment are linked to the changes in production that speakers typically make due to sociolinguistic factors when living in multidialectal environments.

Study of channel errors in efr-based speech recognition

Angel M. Gomez, Antonio M. Peinado, Victoria Sanchez
(University of Granada)

Although Network-based speech recognition (NSR) using the conventional speech channel with the Enhanced Full Rate (EFR) codec reveals a degrading performance due to both transmission channel errors and the speech encoding process in comparison with Distributed Speech Recognition (DSR), it is a very attractive approach since no change to existing mobile phones is needed.

In this work we focus on the channel errors in an NSR system and study the different degradations of the speech features caused by them. These degradations are identified as background noise, burst like noise and memory noise, each one with a different impact on the speech recognition accuracy. Further adapted methods can enhance these features improving the performance of an NSR system based on EFR coding to that based on DSR.

Phoneme Correction with Speech Segmentation

Qiang Huang and Stephen Cox
(University of East Anglia)

Extracting useful information from spontaneous speech directly for speech understanding has been popular research topic recent years, but there are still some difficulties to be solved. First, the quality of spontaneous speech is much quicker than reading, and it also has lots of connected pronunciation, second, different people has different accents and variations in pronunciation, which bring automatic speech recogniser (ASR) into trouble. In our work, we find the phoneme accuracy is just 28.06%, which means that it is very difficult to extract useful information from recognised phoneme sequences without any help to correct the errors in them. In order to get improvement, we segment speech into subword-like units, then use clustering and iterative language model (LM) to correct the errors in the recognised phonemes. The results show that great improvement got using phoneme correction method.

An analysis of interleavers for Robust Speech Recognition in burst-like packet loss

A. James and B. Milner
(University of East Anglia)

Distributed Speech Recognition (DSR) is an emergent technology that employs a client-server approach to speech recognition on mobile devices in order to avoid both the need to embed the speech recognition engine on the device itself and the degradation in performance associated with low bit-rate speech codecs. In this model, the feature extraction is carried out on the device and the resultant feature vectors are transmitted over a channel to a server for recognition.

It is often the case that feature vectors are transmitted over channels that do not guarantee data delivery, such as mobile networks or packet-based networks using real time protocols (such as UDP and RTP). In this case it is possible, and indeed likely, that the feature vector stream will be distorted by packet loss.

An analysis into the effect of packet loss shows that a speech recogniser is able to tolerate large percentages of packet loss provided that burst lengths are relatively small. This leads to the analysis of three types of interleaver for distributing long bursts of packet loss into a series of shorter bursts. Cubic interpolation is then used to estimate lost feature vectors. Experimental results are presented for a range of channel conditions and demonstrate that interleaving offers significant increases in recognition accuracy under burst-like packet loss. Of the interleavers tested, decorrelated interleaving gives superior recognition performance and has the lowest delay. For example at a packet loss rate of 50% and average burst length 20 packets (40 vectors or 400ms) performance is increased from 49.6% with no compensation to 86% with interleaving and cubic interpolation.

Language Model Adaptation in a Conversational Email Application

Partha Lal
(Vox Generation Ltd)

FASiL is a multimodal interface to email - users are able to send emails and browse through their received messages. This might appear to be a medium vocabulary task but, when all the user's contacts and folder names are added, the vocabulary can in some cases rise to far more than 10000. The resulting poor recognition can make the system unusable. Whilst this may not be such a problem in directed dialog, it is a problem here since names can appear at many different points within an utterance. The language model contains classes and placeholders for person and folder names - dynamic grammars are then bound to those placeholders at runtime and names are therefore equally likely. To avoid the recognition problems introduced by expanding the vocabulary with thousands of equally likely words, the dynamic grammars are weighted using information in the user's inbox. We also address the problem of new users, LM smoothing and specializing the model to focus groups of users. With adaptation to the application, we achieved a 50% relative drop in word error-rates .

Speaker Recognition Using a Trajectory-Based Segmental HMM

Ying Liu, Martin Russell, Michael Carey
(University of Birmingham)

Currently Hidden Markov Models (HMMs) are successfully used in speaker recognition systems. HMMs provide a framework which is broadly appropriate for modelling speech patterns, accommodating both variability in time-scale and short-term spectral variability. However, conventional HMMs have some constraints inherent in the speech production process and make certain assumptions that are actually at variance with what is known about the nature of speech production, for example, the assumptions of piecewise stationarity and the state independence of the context.

A segmental HMM is an HMM whose states are associated with sequences of acoustic feature vectors ( or segments ), rather than individual vectors. By treating segments as homogeneous units it is possible to develop an accurate model of speech dynamics across an utterance in a way which takes into account predictable factors such as speaker continuity.

To apply this type of model to text-dependent speaker recognition, verification results obtained on YOHO using a simple segmental HMM are presented, which show a 36% reduction in false acceptances using the segmental model compared with a conventional HMM. The presentation describes the issue of using linear dynamic segmental hidden Markov models (SHMMs) for speaker verification experiments. These models use linear trajectories to describe how feature vectors change over time. Experiments performed on the YOHO speech corpora show better results for SHMMs.

Time frequency distribution for speech recognition using the TDRC analysis

Hossein Marvi and Edward Chilton
(University of Surrey)

Time frequency distribution are popular front-end features for automatic speech recognition. In order to extract detailed information of time varying non-stationary signals, in both time and frequency domain the research of time frequency distribution has been of interest. Several time distributions such as spectrogram, Wigner-ville and choi-williams distribution have been proposed. The two dimensional root cepstrum (TDRC) analysis is an alternative approach which can represent both the instantaneous and transitional information of speech wave form.

In this paper the TDRC are suggested as a set of time frequency distribution features for speaker independent speech recognition. It represents the features of speech signal in a matrix form with the coefficients located at the corner of TRDC matrix, being more significant than others.

In this investigation different types of distance measure have been used for classification the pattern. Linear discriminate analysis (LDA) method has been applied to the TDRC features which have been extracted from the speech to achieve extra dimensionality reduction while preserving as much of the class discriminatory information and improving the recognition accuracy. Results of using these features are compared to the performance of a conventional cepstrum. The results demonstrate that the proposed method increased the recognition accuracy of a speaker independent system significantly when compared to the conventional cepstrum.

Toward a model of compensation

Tim Mills
(University of Edinburgh)

Much research has gone into determining how people turn a thought into speech. The model proposed by Perkell et al (2000) based on acoustic goals and the model proposed by Browman and Goldstein (1992, 2000) based on articulatory goals give us two quite different pictures of how speakers construct an articulation from an idea they wish to communicate. Such theories are generally built around patterns seen in normal speech.

What happens when the normal speech process is disrupted in some way? Speakers experience many types of disruption. They whisper, disrupting the means of transmitting pitch and voicing information, which are important cues to lexical and prosodic contrasts. They experience plugged noses, or talk around obstructions such as food and cigarettes. Researchers have devised further types of distortions, such as the bite-block (interfering with the normal jaw and lip movements) or the artificial source (replacing the normal sound source with a buzzer pressed to the throat).

Because speakers use the same language production faculties in the presence of distortions as they do during unperturbed speaking situations, we can extract testable claims about speaker compensation (or lack of compensation) due to distortions from theories based on normal speech.

I intend to examine speaker behaviour in the context of speech distortions in order to evaluate and possibly refine the two models of speech production noted above.I review existing research that bears on this question, and provide an evaluation of how the above-mentioned models deal with it. I also describe a set of experiments I plan to run in order to further illuminate the nature of speech production.

Quantifying Voicing-Frication Interaction Effects in Voiced and Voiceless Fricatives

Jonathan Pincas and Philip J.B. Jackson
(University of Surrey)

Although speech does not, in general, switch cleanly between periodic and aperiodic noise sources, regions of mixed source sound have received little attention: aerodynamic treatments of source production mechanisms show that interaction will result in decreased amplitude of both sources, and limited previous research has suggested some spectral modification of frication sources by voicing. In this paper, we seek to extend current knowledge of voicing-frication interaction by applying a wider range of measures suitable for quantifying interaction effects to a specially recorded corpus of /VFV/ sequences. We present data for one male and one female subject (from a total of 8). Regions of voicing-frication overlap at the onset of voiceless fricatives often show interaction effects. The extent of such overlapping source regions is investigated with durational data. We have created a measure designed to quantify the magnitude of modulation where overlap does occur, in both these areas and in fully voiced fricatives. We employ high-pass filtering and short-time smoothing to produce an envelope which characterises temporal fluctuation of the aperiodic component. Periodicity at or around the fundamental frequency is interpreted as modulation of frication by voicing, and magnitude of amplitude modulation is computed with spectral analysis of the envelope. Further statistical techniques have been employed to describe the profile of aperiodic sound generation over the course of the fricative. In addition to the above, gradients of f 0 contours in VF transitions and total duration of frication are analysed. Results are compared across the voiced/voiceless distinction and place of articulation. Source overlap and interaction effects are often ignored in synthesis systems; thus findings from this paper could potentially be used to improve naturalness of synthetic speech. Planned perceptual experiments will extend the work done by establishing how significant interaction effects are to listeners.

Using syntax to predict prosodic phrase boundaries

Read, S.J. Cox
(University of East Anglia)

With human speech, when a sentence is read aloud, some words seem to naturally group together to form phrases, whereas some words tend to have noticeable pauses between them - representing phrase boundaries. This leads to the theory that a sentence can be described as a hierarchical structure of prosodic phrases. With text-to-speech (TTS) synthesis, the prosodic phrase structure of a sentence is used by a number of different modules - pausing, intonation and duration - therefore, predicting accurate values is highly important to the perceived quality of the speech.

Current prosody prediction techniques utilise various automatically derivable syntactic features, such as the part-of-speech (POS) tags for the words of a sentence, and a shallow syntactic parse. Given an initial POS tag set and an n-gram model for predicting juncture types, by iteratively clustering the tags into appropriate classes, it has been shown that this can significantly increase the performance. By reducing the initial 40 tags into 11 classes, 92.27% of junctures (break/non-break) can be correctly identified, with the insertion rate at 1.76%.

As prosody applies to a whole sentence, predictions need to consider the sentence as a complete unit. With prosodic phrases constituting a number of different units - words, intermediate phrases, intonational phrases - a set of models can be created for analysing each level of this hierarchy. Combining models for words, intermediate phrases and intonational phrases, it is possible to make predictions on the sentence as a single unit.

A similar technique is prediction-by-analogy, in which predictions are based on observations of a similar nature. For prosody prediction, given a set of annotated sentences, we can predict new prosodic structures by finding the most syntactically similar sentence, and then aligning its prosodic annotation.

Voice and accent profiles: structured probabilistic models of speaker, style and accent variation

Dimitrios Rentzos, Qin yan, Saeed Vaseghi
(Brunel University)

Voice and accent profiles are hierarchical structured probability models of the acoustic and intonation features of voice and accent variation. A voice profile is a complete probability model of the voice of an individual or a cluster of speakers sharing a similar voice type. An accent profile is a probabilistic and phonological model of an accent. These models have widespread applications in speech processing such as the adaptation of speech recognition systems to new speakers, styles and accents; in speech synthesis to generate new voices and accents; in speaker recognition to exploit accent and style characteristics of talkers; in speech coding to reduce the necessary bit rate; in voice coaching and speech therapy.

This paper introduces the concepts of voice and accent profiles and presents experimental results of investigation in modelling the acoustic spaces of voice and accents. The paper begins with an introduction to the signal processing methods and tools that are employed for the modelling of formants, intonation and style parameters. Formant tracks are modelled using two-dimensional hidden Markov models. Broad patterns of intonation in accent and voice are modelled using a method based on the rise/fall/connection concept. The variation in style and duration are modelled through analysis of results of automatic segmentation and labelling of speech.

The experimental results of accent modelling include a comparative analysis of the acoustics and intonations of British, Australian and American accents. The formants are analysed and a formula is introduced for ranking the formants according to their contribution in conveying accents. The experimental results of speaker and voice modelling include the application of voice profile in speaker identification. The influence of formants, pitch, intonation pattern and speaking style on speaker identification are evaluated and presented.

A software toolkit has been developed for modelling and transformation of voice and accent profiles. Some demonstrations of accent and voice morphing will be included in the presentation.

Improving Speaker Adaptation Accuracy using Confidence Measures

Mark Ritch and Stephen Cox
(University of Bristol)

Speaker Adaptation is a technique used to improve the performance of a Speaker Independent ASR system where it is to be used by more than one speaker and a lengthy enrolment procedure is either not possible or inappropriate. A Confidence Measure is a measure of how certain a system is that the observation made actually matches the hypothesis made. When derived from a recogniser during the recognition of adaptation data, a Confidence Measure can be used to give an indication as to the accuracy of the labelled adaptation data. This information can then be used to minimise the influence of incorrectly labelled data during unsupervised adaptation. It is shown how Confidence Measures can be incorporated into two techniques commonly used for Speaker Adaptation, namely Maximum a Posteriori and Maximum Likelihood. Although shown to be statistically significant, experimental results using an N-best Confidence Measure yielded only a very small increase in performance when applied to unsupervised adaptation. It is shown, however, that even though this type of Confidence Measure is simple to generate, it is inaccurate when derived from a phoneme-based recogniser. Results obtained though simulation on the other hand, showed that where the Confidence Measure is sufficiently accurate, the Gaussian means of CDHMMs can be favourably adapted.

Pitch prediction from mfcc vectors for speech reconstruction

Xu Shao and Ben Milner
(University of East Anglia)

This work proposes a technique for reconstructing an acoustic speech signal solely from a stream of mel-frequency cepstral coefficients (MFCCs). Previous speech reconstruction methods have required an additional pitch element, but this work proposes two maximum a posteriori (MAP) methods for predicting pitch from the MFCC vectors themselves. The first method is based on a Gaussian mixture model (GMM) while the second scheme utilises the temporal correlation available from a hidden Markov model (HMM) framework. A formal measurement of both frame classification accuracy and RMS pitch error shows that an HMM-based scheme with 5 clusters per state is able to correctly classify over 94% of frames and has an RMS pitch error of 3.1Hz in comparison to a reference pitch. Informal listening tests and analysis of spectrograms reveals that speech reconstructed solely from the MFCC vectors is almost indistinguishable from that using the reference pitch.

Source-filter separation based on an articulatory corpus

Yoshinori Shiga
(University of Edinburgh)

A new approach is presented for estimating voice source and vocal-tract filter characteristics based on an articulatory database. From the viewpoint of acoustics, in order to estimate the transfer function of a system, both the input and output of the system need to be observed. In the case of the source-filter separation problem, however, only the output (i.e. speech) is observable, and the response of the system (vocal tract) and the input (voice source) must be estimated simultaneously. The estimation is hence theoretically impossible, and consequently the estimation problem is generally solved approximately by applying rather oversimplified models. The proposed approach separates these two characteristics under the assumption that each of the characteristics is controlled independently by a different set of factors. The separation is achieved by iterative approximation based on the above assumption using a large speech corpus including electro- magnetic articulograph data. The proposed approach enables the independent control of the source and filter characteristics, and thus contributes toward improving speech quality in speech synthesis.

Intelligibility of an ASR-controlled synthetic talking face

Catherine Siciliano
(University College London)

The goal of the SYNFACE project is to develop a multilingual synthetic talking face, driven by an automatic speech recognizer (ASR), to assist hearing impaired people with telephone communication. Previous multilingual experiments with the synthetic face have shown that time-aligned synthesized visual face movements can enhance speech intelligibility in normal hearing and hearing impaired users [C. Siciliano et al., Proc. Int. Cong. Phon. Sci. (2003)]. Similar experiments are in progress to examine whether the synthetic face remains intelligible when driven by ASR output. The recognizer produces phonetic output in real time, in order to drive the synthetic face while maintaining normal dialog turn-taking. Acoustic modelling was performed with a neural network, while an HMM was used for decoding. The recognizer was trained on the SpeechDAT telephone speech corpus. Preliminary results suggest that the currently achieved recognition performance of around 60% frames correct limits the usefulness of the synthetic face movements. This is particularly true for consonants, where correct place of articulation is especially important for visual intelligibility. Errors in the alignment of phone boundaries representative of those arising in the ASR output were also shown to decrease audio-visual intelligibility.

Precision Matrix Modelling for Large Vocabulary Continuous Speech Recognition

K.C.Sim and M.J.F.Gales
(University of Cambridge)

In recent years, structured covariance matrix approximation has been found to outperform the conventional diagonal covariance matrix systems in HMM-based speech recognition. For example, Factor-analysed HMM (FAHMM) provides a compact covariance matrix representation via the use of global factor loading matrix. On the other hand, methods such as Semi-tied Covariance (STC) (or MLLT), Extended MLLT (EMLLT) and Subspace for Precision and Mean (SPAM) models model the precision matrix structure instead. This form of modelling is more efficient in terms of decoding costs due to the use of Gaussian distributions. In fact, Heteroscedastic Linear Discriminant Analysis (HLDA) which has been commonly viewed as feature decorrelation scheme, is also a specific form of STC precision matrix model with global tying of the variances correspoding to the HLDA nuisance dimensions. All the precision matrix models described above can be classified into a generic framework of basis superposition where the precision matrices are formed by superimposing a set of symmetric matrices (known as basis matrices) weighted by a set of coefficients (known as basis coefficients). We emphasise the implementation of various precision matrix models on Large Vocabulary Continuous Speech Recognition (LVCSR). In particular, issues concerning model training using Minimum Phone Error (MPE) criterion are addressed.

Listening to glimpses of speech

Sarah Simpson and Martin Cooke
(University of Sheffield)

To make sense of speech in everyday conditions, listeners have to employ processing strategies to cope with distortions produced by additive noise, reverberation and channel variability. This contribution highlights one such strategy - glimpsing - which listeners might use to tackle the problem of additive noise. Glimpsing is a process in which listeners identify speech based on spectro-temporal regions with advantageous local SNR. Recent experiments demonstrated that listeners' ability to identify VCVs presented in multi-speaker babble noise can be predicted by a computational model of glimpsing [1]. However, an analysis of consistent listener confusions revealed significant departures from the model, which was based solely on the audibility of the speech target (energetic masking). It is known that the confusability of the target with respect to background sources (informational masking) also plays an important role in speech perception in noise. It is possible that certain listener confusions were due to an incorrect assignment of background glimpses to the foreground source. This paper describes a set of experiments in which the role of energetic masking alone is assessed. Identification performance is measured for signals which have been resynthesised from putative glimpses. One finding is that such signals are surprisingly intelligible in spite of their sparse spectro-temporal energy distribution. The resulting consonant confusion pattern is compared with that found in the earlier experiment and with the computational model.

Simulation of a 'Hole' in Hearing: Perceptual Adaptation by Normally Hearing Listeners to Spectrally-Warped Speech

Matthew Smith and Andrew Faulkner
(University College London)

This study investigated the impact upon speech recognition of creating a 'hole' (420 Hz to 2200 Hz) in the mid-frequencies. Speech was processed and synthesized as the sum of six noise-carrier bands three apical to, and three basal to the hole. Three conditions: "Dropped", "S-warp", and "A-warp" were examined. In the first, frequency information was omitted by eliminating the relevant noise-bands from the output. In the S-warp and A-warp conditions an attempt was made to preserve information from the hole region by reassigning it to noise-bands either side of the hole. In the latter, hole-region frequencies were warped across a restricted frequency range, being routed to the noise-bands immediately adjacent to both sides of the hole. In the former, warping affected the entire frequency range; analysis filters with upper cut-off frequencies below 1011 Hz (the centre of the hole) were mapped to noise-bands apical to the hole, while analysis filters with lower cut-off frequencies above 1011 Hz were mapped to noise-bands basal to the hole.

Eight normally-hearing subjects were trained in the 'preservation' conditions. S-warp performance was consistently better than A-warp performance. Significantly, performance in the 'preservation' conditions improved considerably with training (S-warp sentence scores rose from 32% to 70% correct). Post-training scores were also much higher than baseline scores in the Dropped condition and, for the S-warp processor, comparable to scores obtained with a processor with six tonotopically-matched channels.

A different group of 8 subjects were trained with the Dropped processor. Although some improvement was observed, post-training scores were still considerably lower than those achieved with the S-warp and A-warp processors in part 1.

These results suggest that rerouting spectral information around a hole is better than simply dropping it, even though differences may not be apparent in acute studies.

An MRI and Acoustic Study of Effect of Vowel Context on Fricatives

Khazaimatol Shima Subari and Dr. Christine Shadle
(University of Southampton)

The acoustical and articulatory properties of fricatives are being studied by using magnetic resonance imaging (MRI) and acoustic recordings made of the same subject, same corpus. Because there is evidence that vowel context affects the fricative spectra, different VCV contexts were used. Different imaging techniques requiring different imaging times were also used to ensure that vowel context effects persisted even during fricatives sustained for two minutes.

In the acoustical analysis, the power spectral density was computed at three separate locations within the fricative steady-state: immediately after and before the vowel transitions, and mid-fricative. Comparisons were made between short- and long-sustained-tokens (approx. 0.5 and 4 seconds respectively) and between tokens uttered while sitting and lying down in an effort to control for the constraints of MR imaging. It was observed that the effects of vowel context are maintained in sustained tokens, and that sustained spectra are similar to those uttered naturally. Position of subject did not affect the spectra significantly. However, the imaging process required the subject to sustain the fricative much longer than in the analyzed recordings. Labiodentals show the most variation in spectra across the different vowel contexts.

For image analysis, MR images were processed using 3D-Doctor™. The teeth-airway boundaries were determined using silicone-rubber teeth moulds of the subject, sliced at 4 mm thicknesses, which were scanned and superimposed on the corresponding MR image slice. Area and hydraulic radius functions were obtained using Mermelstein's technique and Blum's Transform. Both techniques produced functions for /f/ that vary significantly in different vowel contexts. Blum's Transform was considered to be more accurate as side-branches are included in its computations. Results from the two techniques differ most in the laryngeal region. Speculations regarding the likely acoustic consequences of these differences will be discussed, as will work currently underway to predict the acoustic output.

A multi-modal portal for automated customer information

Ross Tailby, Ben Milner, Richard Dean, Jim Gunn, Dan Smith
(University of East Anglia)

We describe the design and anticipated development of a customer information system operating over several modalities, implemented for local calibration company, Antech Engineering Ltd. This work is supported by a Knowledge Transfer Partnership (KTP), provided by the department of trade and industry and in conjunction with the University of East Anglia. The poster introduces Antech's current Calibration Workflow Management System (CWMS) implemented on a previous KTP scheme, and highlights how our new information portal will be integrated into CWMS functionality, utilising data and providing information to customers traditionally only available through direct communication with Antech staff.

The poster highlights how customers' will be able to specify their preferences for contact modalities and, depending upon their choice, the content of replies will be altered e.g. SMS replies will be significantly more concise than their email equivalent. The system will work by accepting enquiries through any of the offered modalities, classifying the message as one of a number of possible classifications and then extracting the meaning with this prior knowledge. This meaning is then transferred into a form suitable for querying the CWMS database, with the results passed into XML output using XSLT. Responding messages can then be automatically written and formatted using this data and the customer preferences, and transmitted via the chosen modality.

Automating this customer interaction process will radically improve the level of customer service offered by Antech, and lessen the workload of customer service agents, allowing them to concentrate on other matters. Customers will have their queries answered almost immediately, supporting their organisational activities and encouraging new business.

Can we modify existing automatic speech recognition technology to reliably and safely monitor respiration in patients sedated with propofol?

Tan L, Cox S, Bell GD and Mansfield M
(University of East Anglia)

It is recommended that a) patients receiving propofol for sedation "should receive care consistent with that required for deep sedation" and b) the use of capnography (as an early warning sign of drug-induced hypoventilation or apnoea) be considered (Anesthesiology 2002;96:1004-1017). The accurate continuous measurement of CO2 concentrations in the breath of sedated patients without an ET tube in situ can, however, be problematic as indeed can monitoring transcutaneous CO2 tension

To develop a computerised method of auscultation which would a) allow continuous real time monitoring of respiratory rate and b) alarm when hypoventilation occurred.

The signal from the patient's breath sounds was used to build Hidden Markov Models (HMMs) of the different phases of respiration. HMMs model a type of stochastic process, and have been highly successful in automatic speech recognition for modelling the acoustic patterns of speech, which vary in both time and frequency (Cox S 1990. In Speech and Language Processing, Chapman and Hall). The recorded breathing data was divided into a set for training the models and a set for testing. Results and conclusions.

Even using a crude throat microphone positioned over the trachea and a relatively small training set of data, the result achieved on the testing set was an accuracy of almost 80% in recognising the different phases of respiration. Preliminary results using a more sensitive microphone to pick up both breath and heart sounds have been even more encouraging. Our preliminary results suggest that 'computerised ausculation' may well provide a viable, non invasive and inexpensive alternative to capnography in patients being sedated with propofol.

Features and Classifiers for Automatic Music Genre Classification

Kris West and Stephen Cox
(University of East Angli)

Present theories of music perception are far from complete, due to the fact that is nearly impossible to examine the processes of music perception introspectively. Duplicating and perhaps thereby understanding perceptual tasks performed on music by humans has been a topic of research interest in recent years, as computational approaches to other perceptual tasks, such as speech recognition, have been at least partially successful. This work is an investigation into a number of factors that affect the automatic classification of audio signals into musical genre classes. It seeks to further our understanding of music perception and identify those areas in music perception research that are the least developed.

The first factor investigated in this work is the suitability of two measures of spectral shape for the parameterization of audio signals, prior to classification. The first procedure, which is familiar to speech researchers, is the calculation of measures of spectral shape called MFCCs whilst the second is the calculation of Spectral Contrast feature, which is both a measure of spectral shape and the ratio of harmonic (pitched) to non-harmonic (noisy) components in the spectrum. The success of both the Karhunen-Loeve transform (Principal component analysis) and the Cosine transform, which both attempt to reduce the covariance between dimensions of the data, is examined and the equivalence of Karhunen-Loeve transform to the Cosine transform in the calculation of MFCC-like coefficients is demonstrated. Short-time temporal modeling of calculated features is introduced and its effect on genre classification accuracy is explored.

The second factor examined is the effect of different classifying schemes and topologies on classification accuracy. The use of single Gaussians and Gaussian mixture models for multi-class classification problems are evaluated and a new classification scheme, based on the unsupervised construction of a binary-tree structured classifier with different projections of the sample data at each node, is proposed and evaluated. It is also demonstrated how the classifier structure can be examined to explore the importance of each calculated feature to the final classification.

Speech and Crosstalk Detection in Multi-Channel Audio

Stuart N Wrigley, Guy J Brown, Vincent Wan, and Steve Renals
(University of Sheffield)

The analysis of scenarios in which a number of microphones record the activity of speakers, such as in a round-table meeting, presents a number of computational challenges. For example, if each participant wears a microphone, it can receive speech from both the microphone's wearer (local speech) and from other participants (crosstalk). The recorded audio can be broadly classified in four ways: local speech, crosstalk plus local speech, crosstalk alone and silence. We describe two experiments related to the automatic classification of audio into these four classes. The first experiment attempted to optimise a set of acoustic features for use with a Gaussian mixture model (GMM) classifier. A large set of potential acoustic features were considered, some of which have been employed in previous studies. The best-performing features were found to be kurtosis, 'fundamentalness' and cross-correlation metrics. The second experiment used these features to train an ergodic hidden Markov model (eHMM) classifier. Tests performed on a large corpus of recorded meetings show classification accuracies of up to 96%, and automatic speech recognition performance close to that obtained using ground truth segmentation.

Formant-tracking lp models for speech recogntion and enhancement in car/train noise

Qin Yan, Esfandiar Zavarehei, Saeed Vaseghi
(Brunel University)

This paper investigates the modelling and estimation of speech parameters at formants. Representation of speech with formant parameters is a form of non-uniform sampling of speech spectrum at frequencies with higher than average SNRs. The application is noisy speech recognition and enhancement in the presence of car and train noise. Time-varying formant-tracking linear prediction (LP) models are used for speech and noise. The speech parameter tracks include formant tracks, the tracks of bandwidths (or radii of poles' at formants), the tracks of magnitude spectrum at formants and SNR tracks at formants.

Formants tracks are modelled from the poles of LP models of successive frames of speech. The variation of the formants across the space of different examples of a phoneme is modelled with a two-dimensional hidden Markov model. The temporal trajectory of a formant is modelled and smoothed by a Kalman filter incorporating a state equation and an observation equation. The state equation is based on a low-order LP model of the trajectory of each formant parameter. The observation equation can include an estimate of the background noise or processing noise.

The first set of experimental results presented show the influence of car and train noise on the distribution and the estimates of the formant trajectories. Due to the shapes of the spectra of speech and noise, the 1st formant is most affected by noise and the last formant is least affected. The effects of inclusion of formant features in speech recognition at different SNRs are presented. It is shown that formant features provide better performance at low SNRs compared to MFCC features. Finally, for robust estimation of noisy speech, a method based on combination of LP-spectral subtraction and Kalman filter is presented. Average formant tracking errors at different SNRs are computed and show that after noise reduction the formant tracking errors are reduced by 50%. The de-noised formant tracking LP models are used for recognition and enhancement of noisy speech. Audio demonstrations will include comparisons of noise-contaminated speech before and after de-noising.

Discriminative Cluster Adaptive Training

Kai Yu and Mark Gales
(University of Cambridge)

Recently, cluster adaptive training (CAT) or eigenvoices systems attract much interest in model-based adaptive training research, in which a multi-cluster hmm model, or eigenvoices system, is constructed as the canonical model and a set of interpolation weights are used as transformations to represent non-speech variabilities. Maximum Likelihood (ML) based theory for CAT has been well established. Since for state-of-the-art speech recognition systems, to get best performance, discriminative training is always employed, we are more interested in discriminative cluster adaptive training, especially discriminative training for multi-cluster hmm model. This paper detailedly investigates discriminative training for multi-cluster hmm model based on minimum phone error (MPE) criterion. By using the concept of weak-sense auxiliary function and redefining appropriate smoothing and prior function, MPE training for multi-cluster hmm model is obtained. This can also be viewed as discriminatively training an eigenvoices system. MPE training for interpolation weights in CAT is also investigated, which leads to a complete theory of discriminative CAT. A more complex discriminative adaptive training technique which uses a combination of CAT and constrained MLLR, referred as structured transforms (ST), to represent complex non-speech variabilities, is also discussed. In practice, a simplified version of discriminative adaptive training is employed in which transformations are estimated using ML criterion and then fixed, only model parameters are updated in later discriminative training stage. Experiments were performed on a conversational telephone speech task and showed better performance than standard discriminative training schemes.

© 2005 Mark Huckvale University College London February 2005
Valid CSS! This site uses
Cascading Style Sheets.