
SPEAKER IDENTIFICATION AND VERIFICATION
(from LIMSI 1995 Scientific Report, March 1995)
The experiments
on the telephone corpus were carried out in collaboration with the
Vecsys company in the context of a contract with France-Telecom.
J.L. Gauvain, L.F. Lamel, B. Prouts
Speaker verification has been the subject of active research for many years, and has many potential applications where propriety of information is a concern. Our studies assess performance levels for both high quality speech and telephone speech and for two operational modes, i.e. text-dependent and text-independent speaker verification.
A statistical modeling approach is taken, where the talker is viewed
as a source of phones, modeled by a fully connected Markov chain[1].
The lexical and syntactic structures of the language are approximated
by local phonotactic constraints, and each phone is in turn modeled by
a 3 state left-to-right HMM. For text-independent identification,
this provides a better model of the talker than can be done with
simpler techniques such as long term spectra, VQ codebooks, or a
simple Gaussian mixture. When applied to speaker identification[2] a
set of phone models is trained for each speaker and identification of
a speaker from the signal
is performed by computing the
phone-based likelihood
for each speaker
, the speaker identity corresponding to the model with the
highest likelihood is then hypothesized. This approach has been shown
to be successful not only for speaker identification but also for
gender and language identification[2]. When the same speaker model is
applied to speaker verification, and the likelihood ratio
is compared to a speaker independent threshold
in order to decide acceptance or rejection.
The Viterbi algorithm is used to compute the joint likelihood
of the incoming signal and the most likely state
sequence instead of
. This implementation is thus
a modified phone recognizer where the output phone string is ignored
and only the acoustic likelihood is taken into account. Maximum a
posteriori (MAP) estimators are used to build speaker-specific models
from a set of speaker-independent models. The speaker-independent
seed models provide estimates of the parameters of the prior densities
and also serve as an initial estimate for the segmental MAP
algorithm[3], allowing a large number of parameters to be
estimated from a small amount of adaptation data.
Two corpora have been used for experiments: the BREF corpus which is used to calibrate the algorithm on high quality speech, but was not designed to perform speaker recognition experiments; and a telephone speech corpus which is presently being recorded over dialed-up telephone lines and has been especially designed to evaluate speaker recognition algorithms. For this second corpus each target speaker is recorded for multiple calls over a period of several months.
Speaker-specific phone models for each target speaker have been trained on about 75 sentences (coming from the same session for the BREF corpus, and from 2 recording sessions for the telephone speech corpus) for 50 speakers from BREF and 45 speakers from the telephone corpus. On the BREF corpus, the text-free identification rate is 99.9% using 4s of speech per trial and a maximum of two trials per validation attempt. In verification mode, the a posteriori equal error rate (the false acceptance and false rejection rates are the same) is 0.2% in text independent mode when two verification attempts are allowed.
The results of the verification experiments on the telephone corpus are shown in Figure 1. In text dependent mode the equal error rate is 3.5% with 4s of speech per trial and a maximum of two trials per 'authentification attempt.


Figure 1. ROC (Receiver Operating Characteristics) curves for
different model types and operational modes for the telephone data:
(a) Baseline multi-Gaussian model using a single mixture of 32
Gaussians per speaker; (b) Phone-based approach using 35 phone models,
text independent verification mode; (c) Phone-based approach using 35
phone models, text dependent verification mode; (d) identical to (c)
with 2 trials (When 2 verification trials are authorized for target
speakers and impostors, the average number of attempts is 1.1.); (e)
identical to (d) with exactly 4s of speech. The dotted line shows the
points of equal error (false acceptance/false rejection).
Comparing the equal error rates (with 1 only trial per attempt and an
average of 4.1s of speech per trial), we can observe that the
phone-based approach in text-independent mode performs significantly
better than the Gaussian mixture model (7.3% v.s 9.0% EER), and that
knowing the text reduces the EER to 5.1%. Allowing 2 trials per
attempt reduces the EER to 4.4% and requiring a fixed minimum amount
of 4s of speech (as in the experiment on the BREF corpus) reduces the
error rate to 3.5%.