Printed from www.flong.com
Contents © 2020 Golan Levin and Collaborators

Golan Levin and Collaborators

Projects

Mouther

1995 | Golan Levin with Malcolm Slaney and Tom Ngo

Mouther

Mouther is a experimental prototype I helped develop at Interval Research, in which an animated cartoon face is lip-synched in real-time to the user's speech. I developed the prototype concept and artwork using Tom Ngo's in-house "Embedded Constraint Graphics" (ECG) animation engine, while Malcolm Slaney made Mouther possible by connecting these graphics to special speech-parsing technologies.

Mouther works by driving an embedded-constraint graphic with the output of a phoneme recognizer. This phoneme recognizer was built using mel-frequency cepstral coefficients (MFCC) as features and using maximum-likelihood selection based on Gaussian mixture models (GMMs) of each phoneme. Depending on the amount and diversity of training data, speaker-dependent or speaker-independent GMMs could be formed for each phoneme. To reduce the system's sensitivity to microphone and room acoustics, the MFCC's were filtered by RASTA (a widely accepted method for reducing the dependence of acoustic features on channel characteristics) prior to classification.

The result is a talking cartoon face whose animation is driven by the output of the classifier, and in which different mouth positions are displayed for the different phonemes that are spoken. While further work would be needed to increase the reliability of the classifier and the realism of the transitions between different visemes, the current result is amusing and could be sufficiently responsive for the quality-level needed in children's computer games.