In the previous article we considered the simplest variant of a speech recognition application. Now I want to suggest to plunge into this topic a little deeper and consider this question more closely.
So, let's say we split an audio signal into a set of frames of certain length and computed related MFCC-vectors. What should we do next?
It is logical to assume that every frame/MFCC-vector matches with some sound. Hence, all we need is a "Sound - MFCC-vector" codebook. Once we have it we'll be able to "read" our audio data and get the words!
However, the reality is not so simple. Such codebook allows us to transform the frames/MFCC-vectors into a sequence of sounds, not letters or words. Also we shouldn't forget the following:
First of all, different people pronounce the same sound in different ways (voice quality, etc). I.e. the "Sound - MFCC-vector" relationship must be implemented as "one to many". Hopefully, that isn't a big problem and it just increases complexity of our codebook structure for a little bit.
Secondly, a sound can lasts longer than one frame. E.g. "a" sound pronounced out of a word (for recording, without forced stretching) lasts for 300ms. For consonants, this value is in 50% lower, but this doesn't change the idea.
As a result, taking into account that the frame length is 50ms, the incoming set of sounds for "odin" word could look like "ooooddiiiinn".
Thirdly, some letters might be pronounced in different ways. E.g. the english and american variant of "can't". In our example the letter "o" sometimes can be pronounced as "a" (the old russian tradition). Given that fact, we must be ready for the "aaaaddiiiinn" sequence as an incoming data.
In other words, using a codebook (see "CODEBOOK" section) can't completely solve the problem of speech recognition. However, it allows to reduce it to another problem, which could be perfectly solved with such a great instrument as Hidden Markov Models are.
Hidden Markov Models
There are thousands of articles and gigabytes of videos (my favorite one) about HMM. That's why I'm not going to describe the idea in detail, but I'll say a few words about using HMM for solving the speech recognition problem. So, HMM is a set of:
- inner/hidden states (letters in the word);
- observed values (sounds that we get as an input data, MFCC-vectors matched by the Codebook);
- initial distribution vector (defines from which letter the word begins);
- transition matrix (defines order of letters in the word, how much the sound "stretches");
- emission matrix (defines the probability to hear Y sound for X letter);
As an example let's consider the word "odin" (russian "one"):
STATES 4 o d i n OBSERVATIONS 5 o a d i n INITIAL 1.0 0.0 0.0 0.0 TRANSITION 0.75 0.25 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 0.75 0.25 0.0 0.0 0.0 1.0 EMISSION 0.5 0.5 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0
The model above is quite simple:
- the initial distribution vector says us that word "odin" must be started from letter "o" (quite logical, eh?);
- the transition matrix insists that vowels "o" and "i" drag on much longer, than consonants "d" and "n" (i.e. the probability of "o" to "o" transition is 0.75, in the same time "o" to "d" transition has only 0.25 probability points;
- the emissions matrix shows us that the only variant reading is possible for "o" letter (it can be heard as "o" or "a");
Great! But what should we do with this model? Let's say we have the model above (M) and some sequence of observed sounds 0="aaaddiiinn". There are 3 standard HMM usage pattern which make it extremely useful. At the moment we need only 2 of them, so let's consider them below.
Suppose we have a set of prepared models (a set of known words). In this case the problem of O-sequence recognition narrows to searching of the most probable model for the O-sequence. In turn, the probability of an observation sequence O given a HMM model M - P(O|M) is the probability of a sequence of observations for each possible sequence of states of the model. Yes, it sounds a little bit complicated, but, in fact, it is a simple walk through the O-sequence values (mapped by the emission matrix) and the transition matrix. The idea of this is fully described in Forward Algorithm (specification / implementation).
Now imagine that we have a set of pronunciation samples of "odin" word generated by the same person. Using this information we can fit M model parameters to increase the P(O|M) probability for each sample. In other words - we can "teach" the model and improve recognition quality. This "learning" can be done by Baum-Welch algorithm (specification / implementation).