Overview

Speech research is currently focused on:

  • Speaker Recognition - for forensic & security applications.
  • Speech Recognition - for transcription, command & control.
  • Speech Enhancement - for forensic and communication applications.
  • Speech Coding - for storage & communication applications (e.g. wireless radio and Internet telephony).
  • Speech synthesis - for voice response systems and text to speech conversion
  • Audio Compression & Coding - for Digital Audio Broadcasting, the Internet and High Definition Television
  • Multi-microphone speech technology
  • Automatic language identification

Read details

Research leader
Organisational unit
Lead unit Information Security Institute Other units
 

Details

Speaker recognition (verification) by voice

The aim is to verify the identity of a person from his/her voice. Applications are in banking, voice lock for security purposes, voice signatures etc. Several techniques for modelling speakers based on Gaussian Mixture Models, Neural Networks, and Hidden Markov Models are being investigated. Study is also being conducted on improving verification performance under adverse conditions. Research is focussed on solving problems of training-testing mismatch due to noise and telephone channel distortions.

Trainable speech synthesis with trended hidden markov models

Work on a trainable speech synthesis system that utilises trended Hidden Markov Models to represent phonetic speech units has been implemented. The performance of this system has been compared with synthesis using the traditional stationary framework and has yielded significant improvement in informal modified rhyme tests. Some examples of speech synthesis are provided below:

Speech Synthesis sound files

Speaker recognition (identification)

The aim is to develop speaker identification techniques with large speaker discriminating capability and to analyse the confidence level of these schemes. The main application is speaker indexing of conversations, visual data retrieval using speech and suspect identification in forensic situations.

 

Near-field adaptive beamforming

The following sound files are taken from real recordings using an 11 microphone array. The desired speaker was located at 70 cm directly in front of the centre microphone. A localised noise source was placed 56 degrees and 2.7 m from the centre microphone.

Near-field adaptive beamforming 

Speech recognition in adverse environments

Research is being conducted in the implementation of small vocabulary robust speech recognition techniques to work in adverse noise environments. The main applications are control of machine operation by voice. Current directions include pre-processing, noise modelling, multi-microphone signal acquisition and multi-modal recognition involving fusion of lip information.

Speech enhancement (single microphone)

The major goals are removal of noise reverberation and co-talker interference. The main application is in enhancement of forensic recordings, and hearing aid design, speech enhancement in HF communication, speech enhancement for speech recognition, speaker recognition and speech coding. Several techniques including those based on vocal tract models and auditory models are being investigated. Techniques for measuring resulting speech quality and intelligibility are also being investigated.

HF radio automatic frequency shift correction

Two samples are available; the first is of a child, the second of an adult male.

Speech enhancement sound files

Speech enhancement (multi-microphone)

The aim is to enhance speech using multiple microphone reception. Applications include speech enhancement for hands free telephone applications such as mobile phones in automobiles and forensic applications. Several techniques including beam forming, simulation of the auditory system etc, are being investigated.

Very low bit-rate speech coding

The aim is to reduce the bit-rate of speech signal where bit-rate can be variable depending on the type of speech activity. Applications include transmission of speech over HF channel and secure speech transmission. Techniques for measuring the quality and intelligibility of coded speech are also being investigated.

Joint coding of speech and audio

The aim of the project is to develop algorithms for scalable coding of audio and speech signals which work over a range of sampling frequencies and bandwidths.

Speech Coding Using Temporal Decomposition

These examples demonstrate a method for encoding spectral characteristics of speech at rates below 180 b/s, using hierarchical temporal decomposition (HTD). A set of the log-area-ratio (LAR) parameters, extracted from a given block of speech, is approximated through Gaussian interpolation between the most-steady frames detected by the HTD. This results in a smaller set of parameters which is encoded using vector quantization. We have shown that the same spectral distortion is obtained with the new coder at rate 180 b/s as that of a scalar quantization, TD-based coder, at 600 b/s.

Speeching coding using temporal decomposition sound files

Speech Coding Using Phonetic Vocoding

This example demonstrates a method of speech coding known as phonetic vocoding. This technique compresses speech by using a HMM based speech recogniser at the coder to quantise the speech into phonetic units and then uses the HMM model statistics at the decoder to reconstruct the spectral envelope. 49 phonemes are modeled by left-right, single mixture Hidden Markov Models using Adaptive Melcepstral Coefficients (AMC) plus energy and their delta terms for the recognition/encoding stage and Line Spectral Frequency (LSF) trained models are used at the synthesis/decoding stage.

The phoneme index, state durations and speaker adaptation information is transmitted along with prosody information. The pitch contour is coded using Piecewise Linear Approximation (PLA). The example may be compared with the original speech and also with speech produced using the same speech synthesiser without any quantisation of the spectral envelope or pitch parameters.

Speech coding using phonetic vocoding sound files

Projects