\documentclass[a4paper]{article}
% for extended summary
%\usepackage[summary]{nnsp2e}
% final paper
\usepackage{nnsp2e}
\usepackage{amssymb,epsfig}
\input{../../common/definitions}
%\input{../2003/akcommon/glodef}
\title{Kernel-Based Speaker Recognition with the ISIP Toolkit}
\author{Joseph Picone, Tales Imbiriba, Naveen Parihar, \\
Aldebaro Klautau and Sridhar Raghavan\\
Department of Mathematical Modeling, Building 321\\
Technical University of Denmark, DK-2800 Lyngby, Denmark\\
Phone: +45 4525 3923,3921\\
Fax: +45 4587 2599\\
E-mail: jl,cg@imm.dtu.dk\\
Web: eivind.imm.dtu.dk}
\begin{document}
\maketitle
\begin{abstract}
Notes:
Picone: We had planned by now to do some SVM/RVM speaker
recognition experiments. This would be an excellent topic for
collaboration. We could write about the toolkit and core
technology, and you could write about your experiments.
Aldebaro: the paper must be at most 10-pages long.
\end{abstract}
\section{Introduction}
Speaker recognition: verification / identification. GMM as
baseline. Universal Background Model. Recent work using BaseIME
(the corpus we have).
Identification can be seen as several verifications.
New baseline for BaseIME. Open-source. Reproducible results.
\section{Classifiers in Speaker Verification}
The speaker recognition problem is closely related to conventional
supervised classification. Hence, we start by discussing few
related definitions. In the classification scenario one is given a
\emph{training set} $\{(\bx_1,y_1),\ldots,\allowbreak
(\bx_N,y_N)\}$ containing $N$ \emph{examples}, which are considered to
be independently and identically distributed (iid) samples from an
unknown but fixed distribution $P(\bx,y)$. Each example $(\bx,y)$
consists of a vector $\bx\in\calX$ of dimension $L$ (called
instance) and a label $y\in\set1Y$. A \emph{classifier} is a
mapping $\deck:\calX \to \set1Y$. Of special interest are binary
classifiers, for which $Y=3D2$, and for mathematical convenience,
sometimes the labels are $y \in \sets{-1,1}$ (e.g., SVM) or $y \in
\sets{0,1}$ (e.g., RVM).
Some classifiers are able to provide \emph{confidence-valued
scores} $f_i(\bx)$, for each class $i=3D1\ldots Y$. A special case
is when the scores correspond to a distribution over the labels,
i.e., $f_i(\bx) \ge 0$ and $\sum_{i=3D1}^Y f_i(\bx)=3D1$. Commonly,
these classifiers use the \emph{max-wins rule} $\deck(\bx) =3D
\arg_i \max f_i(\bx)$. When the classifier is binary, only a
single score $f(\bx) \in \Re$ is needed. For example, if $y \in
\sets{-1,1}$, the final decision $\deck(\bx) =3D \sign( f(\bx) )$
is the sign of the score.
In speech verification systems, the input consists of a matrix
$\bX \in \Re^{T \times Q}$ that corresponds to a \emph{segment} of
speech. The number $T$ of rows is the number of \emph{frames}
(or blocks) of speech and $Q$ (columns) is the
number of parameters representing each frame. If $T$ is fixed
(e.g., corresponding to 5 milliseconds),
$\bX$ could be turned into a vector of dimension $T \times Q$,
and one would have a conventional classification problem. However,
in unrestricted-text speech verification, any comparison between
elements of two such vectors could fail if they represent different
sounds. Hence, verification systems often adopt alterative =
architectures,
which are similar, but do not fit exactly the stated definition
of a classifier. We organize the most popular architectures as follows.
\ignore{A crucial distinction
between these systems and conventional classifiers is that
%
If one restricts the training and test sets to be composed by
segments of a fixed duration $T$ (e.g., $T=3D10$ seconds), the
frames can be concatenated to create a vector of dimension $L =3D Q
\times T$, and the system can be trained using algorithms for
learning conventional classifiers. However, $L$ can be too large
in this case.}
\subsection{Architectures for speaker recog. based on classifiers}
\subsubsection{a) Generative} Classical architecture using Fisher
kernel, as proposed in~\cite{Jaakkola98}
and applied to speech verification, e.g., in Nathan
Smith (http://mi.eng.cam.ac.uk/$~$nds1002/), Vincent Wan
(http://www.dcs.shef.ac.uk/$~$vinny/svmsvm.html), etc.
\subsubsection{b) Segmental}
An alternative to this approach is to use segmental
features as done by ISIP for speech recognition. Need to
elaborate a bit on that.
\subsubsection{c) Accumulative}
Another alternative is to train a \emph{frame-based} system, where
the input to confidence-valued classifiers is a frame $\bx_t$ with
dimension $L=3DQ$, and the total score of class $i$ is $\frac 1 T
\sum_{t=3D1}^T f_i(\bx_t)$. This is the approach adopted in the
conventional GMM-based system. In this case, the number $N$ of
training examples can be too large.
\subsection{Gaussian Mixture Model}
The GMM is the most popular technique adopted in speaker
verification systems and corresponds to our baseline. Differently
from the other discriminative classifiers we discuss, GMM is a
generative classifier (see, e.g.,~\cite{Klautau03-icml}). GMM is a
special case of a Bayes classifier that adopts a mixture of $G_y$
multivariate Gaussians as its likelihood model for modeling class
$y$, namely
\begin{equation}
\hat{P}(\bx|y) =3D \sum_{g=3D1}^{G_y} w_{yg} \calN(\bx | \bmu_{yg},
\bSigma_{yg}).
\end{equation}
with $\bSigma_{yg}$ being a diagonal covariance matrix (that can
be different for each Gaussian).
\subsection{Kernel Learning}
Very brief description of SVM and RVM. Review of works about SVM
applied to speaker verification.
Here we briefly describe the kernel classifiers used in this work.
We note that a Bayes classifier is called by some authors a
``kernel'' classifier (see, e.g., page 188 in~\cite{Hastie01}).
However, by kernel classifier we mean the ones obtained through
kernel learning, as defined, e.g., in~\cite{Scholkopf02}.
Two of the kernel classifiers (RVM and IVM) that we consider are
based on Bayesian learning, while the others (SVM and PSVM) are
non-probabilistic.
\subsection{Non-probabilistic kernel classifiers}
SVM and other kernel methods can be related to regularized
function estimation in a reproducing kernel Hilbert space
(RKHS)~\cite{Tikhonov77}. One wants to find the function $\deck$
that minimizes
\begin{equation}
\label{eq:risk-functional} \frac 1 N \sum_{n=3D1}^N
L(\deck(\bx_n),y_n) + \lambda ||\deck||_{\calH_\calK}^2,
\end{equation}
where $\calH_\calK$ is the RKHS generated by the kernel $\calK$,
$\deck=3Dh+b$, $h \in \calH_\calK$, $b\in \cR$ and
$L(\deck(\bx_n),y_n)$ is a loss function.
Before trying to use a classifier with a non-linear kernel, it is
wise to first check if a linear kernel
$$
\calK(\bx,\bx')=3D \bx \cdot \bx'
$$
achieves the desired accuracy on the problem. A linear kernel
classifier can be converted to a perceptron, which avoids storing
the support vectors and saves computations during the test stage.
Among non-linear kernels, the Gaussian radial-basis function
kernel (when well-tuned) is competitive in terms of accuracy with
any other kernel (see, e.g.,~\cite{Rifkin02}). The Gaussian kernel
is given by
$$
\calK(\bx,\bx')=3D e^{- \gamma ||\bx-\bx'||^2}.
$$
The solution to the optimization problem described in
Equation~\ref{eq:risk-functional}, as given by the
\emph{representer} theorem~\cite{Kimeldorf71}, is
\begin{equation}
\label{eq:representer-solution} \deck(\bx) =3D \sum_{n=3D1}^N \omega_n
\calK(\bx,\bx_n) + b. \end{equation} This expression indicates
that SVM and related classifiers are
\emph{example-based}~\cite{Scholkopf02}: $\deck$ is given in terms
of the training examples $\bx_n$. In other words, assuming a
Gaussian kernel, the mean of a Gaussian is restricted to be a
training example $\bx_n$.
Some examples $\bx_n$ may not be used in the final solution (e.g.,
the learning procedure may have assigned $\omega_n=3D0$). We call
\emph{support vectors} the examples that are actually used in the
final solution, even for classifiers other than SVM (while some
authors prefer to call relevance vectors the support vectors of
RVM classifiers, and so on). For saving memory and computations in
the test stage, it is convenient to learn a sparse $\deck$, with
few support vectors. We can obtain sparse solutions only if the
loss function $L$ is zero over an interval (see, e.g., problem 4.6
in~\cite{Scholkopf02}). SVM achieves a sparse solution (at some
degree) by choosing the loss
$$L(\deck(\bx_n),y_n) =3D (1 - \deck(\bx_n)y_n)_+,$$
where $(z)_+=3D\max\{0,z\}$. The PSVM~\cite{Fung01} classifier is
obtained by choosing
$$
L(\deck(\bx_n),y_n) =3D (\deck(\bx_n)-y_n)^2
$$
and, consequently, ends up using all the training set as support
vectors. Motivated by the discussion in~\cite{Rifkin02}, which
takes into account some previous work related to PSVM, hereafter
we refer to PSVM as \emph{regularized least-square classifier}
(RLSC).
Learning a RLSC classifier requires solving
$$
(\bK + \lambda \bI) \bf{\omega} =3D \bf{y},
$$
where $\bK$ is the kernel matrix, ${\bf{y}}=3D\{y_n\}$ is the vector
with labels and $\lambda$ plays a role similar to the $C$ constant
in SVM. Training the classifier requires ``only'' a matrix
inversion, but the matrix has size $N \times N$.
Assuming the SVM is implementing the optimal discriminant function
between two classes that have a given Bayes error, the number of
examples that are misclassified will scale linearly with the size
of the training set, and so will the number of support vectors.
Therefore, techniques to obtain sparser solutions are a current
topic of research. Two of these techniques, namely
RVM~\cite{Tipping01} and IVM~\cite{Lawrence02}, are discussed in
the next subsection.
\subsection{Probabilistic kernel classifiers}
Both RVM and IVM are sparse Bayesian learning methods that can be
related to the regularized risk functional of
Eq.~(\ref{eq:risk-functional}) (see, e.g.,~\cite{Scholkopf02}).
They adopt a model for the posterior given by $g(y_n
\deck(\bx_n))$, where $\deck$ is given by
Eq~(\ref{eq:representer-solution}) and $g$ is a sigmoid link
function that converts scores into probabilities. Because the link
function is used along the training stage, these methods
potentially provide probabilities that are more
``calibrated''~\cite{Zadrozny02} than the ones obtained, for
example, by fitting a sigmoid after training a SVM~\cite{Platt99}.
We note that having a basic understanding of logistic regression
(see, e.g.,~\cite{Hastie01}) helps the study of probabilistic
kernel methods such as IVM and RVM. For example, while linear
regression obtains the ``best fitting'' equation by using the
least-squares criterion, logistic regression uses the ``logit''
transformation to restrict the output of the classifier to the
range [0, 1], and then uses a maximum likelihood method to train
the classifier. This is the same approach used in some
probabilistic kernel methods. They also use a link function (e.g.,
the logit transformation for RVM and the probit transformation for
IVM) to convert scores (real numbers) into an indication of
probability, and a maximum likelihood method to train the
classifier.
The Bayesian framework described in~\cite{Tipping01} allows, e.g.,
for solving multiclass problems and estimating the variances of
each Gaussian. However, the computational cost is very high. Even
when solving a binary problem with the standard RVM, the method is
slow because the first iterations require inverting an $N \times
N$ matrix.
IVM is an alternative to RVM with a faster training procedure. IVM
adopts a different strategy and, instead of starting with all $N$
and then eliminating examples as RVM, it uses a heuristic to
select potential support vectors in a greedy way.
We note that many other probabilistic kernel methods have been
recently proposed. For example, in~\cite{Zhu01}, the import vector
machine was derived by choosing the following loss function for
the margin
$$
L(\deck(\bx_n),y_n) =3D \log(1 + e^{-\deck(\bx_n)y_n}).
$$ Only more research will indicate which probabilistic kernel methods =
prevail.
For the moment, RVM and IVM seem to be a reasonable sample of
these methods.
\section{ISIP System} (if you want to collapse section I and II into
one longer introduction and split this section into two: a) ISIP
and b) speaker recognition using ISIP, it's ok with me)
\section{Experimental Results} GMM is the baseline. SVM using
conventional kernels (Gaussian, polynomial, linear) SVM using
Fisher kernels, based on generative (GMM) models
\begin{figure}[htb]
\begin{center}
\includegraphics[width=3D4in]{ver-cel.eps}
\caption{Verification for cellular phone speech}
\label{fig:ver-cel}
\end{center}
\end{figure}
\begin{figure}[htb]
\begin{center}
\includegraphics[width=3D4in]{ide-cel.eps}
\caption{Identification for cellular phone speech}
\label{fig:ide-cel}
\end{center}
\end{figure}
\begin{figure}[htb]
\begin{center}
\includegraphics[width=3D4in]{ver-wir.eps}
\caption{Verification for wired phone speech}
\label{fig:ver-wir}
\end{center}
\end{figure}
\begin{figure}[htb]
\begin{center}
\includegraphics[width=3D4in]{ide-wir.eps}
\caption{Identification for wired phone speech}
\label{fig:ide-wir}
\end{center}
\end{figure}
\subsection{Improving the baseline for the BaseIME corpus}
Here we describe how we improved upon IME. Check what they published in
the SBT paper because we don't want to mention the results in the
classified thesis.
\begin{table}
\begin{center}
\caption{Identificacao}
\begin{tabular}{||c|c|c|c||}
\hline
& 20seg & 10seg & 5seg\\
\hline
wired phone & 98.73 & 98.74 & 98.55\\ \hline
cell phone & 93.67 & 92.13 & 89.71\\ \hline
\end{tabular}
\end{center}
\end{table}
\section{Conclusions}
% imports yourbibliography.bib file
\bibliography{../../common/biblio}
\bibliographystyle{nnsp}
\end{document}
\begin{table}
\caption{This is an example of a caption which goes beyond one
line, actually it has two lines}
\begin{tabular}{ccc}
\hline
a & b & c\\
1 & 3 & 4\\
\hline
\end{tabular}
\end{table}
\begin{figure}
Here comes the figure \caption{This is an example of an figure
text}
\end{figure}
\subsection{Subsection}
This is the subsection.
\subsubsection{Subsubsection}
This the subsubsection. Further levels of sections/paragaphs are
not in use.