in machine learning ~ read.

Speech recognition for dummies

This article is devoted to understanding such interesting area of software development as Speech Recognition. Unfortunately, I'm not an expert in this stuff, so my clause will have a lot of inaccuracies, mistakes and disappointments. Nonetheless, the chief objective of this paper (as you can see from its topic) is not a professional analysis of the problem, but analysis of the basic concepts, issues and theirs solutions.


Let's start from the point that our speech is a sequence of sounds. A sound, in turn, is a superposition (overlay) of acoustic oscillations (waves) of different frequencies. Finally, a wave, as we all know from physics, has two major characteristics - amplitude and frequency.

Storing a sound signal on a digital device (i.e. your computer) requires splitting the signal into a set of intervals and calculating average amplitude value for each interval.

That is how mechanical vibrations can be transformed to a set of numbers, which can be processed on our phones/computers.

Hence, the speech recognition task is reduced to "matching" a set of numerical values and words from a dictionary (i.e. Russian dictionary :).

Let's see in more details on the way of implementation of such "matching".

Input data

Suppose we have a file/stream with an audio data. First of all we need to understand its internal structure and how it can be read. Let's consider the most simple case - a WAV file.

This format implies the presence of two blocks in the file. The first one is a header with the audio stream information: bitrate, frequency, amount of channels, length of the file, etc. The second one contains the "raw" data - the digital signal, the set of signal's amplitudes values.

The reading logic in this case is quite simple. Read the header, check some restrictions (e.g. no compression), store the data into a special array.



Theoretically, we already know all that we need to compare (element-wise) our sample (which we've read in the previous section) with another one, which text equivalent is known. In other words, we can try to recognize... But we shouldn't do this :)

Our approach must be stable (at least, a little bit) to any changes in the voice (of a man who uttering the word), volume and speed of pronunciation. Of course, we can't guarantee that with element-wise comparison of two audio signals.

That's why we choose the another way.


First of all let's split our data into a set of little time intervals – frames. Notice that frames shouldn't go one by one, instead they should “overlap” each other. Thus the end of the n-frame must intersect with the beginning of n+1-frame.

Frames are more appropriate units for analysis rather than values of singnal's amplitudes just because we are dealing with waves, and waves are supposed to be analyzed on intervals, not in particular points :) Meanwhile overlapping of frames allows us to smooth analysis results by using a frame as a some kind of “window”, which moves along the original signal (its amplitudes values).

Experiments show us that the optimal frame length should correspond to the interval of 10ms, "overlap" - to 50%. Given that the average word length is 500ms (at least in my experiments) - this length will give us approximately 500 / (10 * 0.5) = 100 frames per word.

Splitting words

The first problem, which must be solved in the process of speech recognition is splitting the speech into different words. Let's simplify our task and put that the speech contains some pauses (intervals of silence), that can be used as the words splitter.

In this case we need to find the threshold. All values above it will be words, below – silence. Here we can use several ways:

  • use a constant (the original signal must be always generated under the same circumstances, in the same way);
  • cluster signal's values into two sets: one with words values and another one with silence (works only if silence takes a big part in the original signal);
  • analyze the entropy;

As you can guess, we are going to discuss the last point :) Let's start from the entropy definition - “a measure of disorder” (c). In our case, the entropy shows how much our signal "varies" within a given frame.

In order to calculate the entropy of a given frame, perform the following steps:

  • assume that our signal is normalized and all its values ​​range in [-1;1];
  • build the histogram of the signal's values:
    • split the range into i=0..N intervals (N is a preselected value, recommended from 50 to 100);
    • compute P[i], amount of signal values, matched to the i-th interval;
    • normalize P[i] (divide on the frame's size);
  • calculate the entropy as below;

$$E = \sum_{i=0}^{N-1} P[i]*\log_2(P[i])$$


So, we got the value of the entropy. But this is just an another frame's characteristic. We still need to compare it with something if we want to separate the sound and the silence.
Fortunately, entropy (in contrast to Root Mean Square) is a relatively independent measure. This fact allowed us to put its threshold value as a constant (e.g. 0.1).

Nevertheless, the problems don't end here :( The entropy may sag in the middle of a word (in vowels) or may suddenly jump (in case of a little noise). In order to fix the first problem we need to use the “minimal distance between two words” and “glue” nearby frame sets, which were divided because of the sag. The second problem can be resolved by using “minimal word length” and removing all the candidates which are not as long as we need (and aren't used in the first point).

If the speech doesn't have any pauses between words (or the speaker speaks too fast), we can try to create/extract sets of various subsequences from the original frame set. Then all these subsequences can be used as a source for the recognition process. But that's absolutely another story :)


So, we have the set of frames, which matches with a particular word. We can take the path of least resistance and use average square (root mean square) of all the frame's elements as theirs numerical characteristic. However, such metric gives us a quite little information for further analysis.

That's why it's time to introduce Mel-Frequency Cepstral Coefficients. According to Wikipedia (which, as we know, never lies) MFCC are a kind of signal's energy representation. Pros of their usage are as follows:

  • using of the spectrum of the signal (i.e. expansion by the orthogonal basis of (co)sine functions), which takes into account the “wave” nature of the signal;
  • the spectrum is projected onto a special mel-scale, which allows us to extract the most valuable for human perception frequencies;
  • amount of calculated coefficients may be limited by any value (e.g. 12), that allows us to “squeeze” the frame and, as a consequence, amount of information to further processing;

It's time to consider the process of MFCC computing for a particular frame.

Let's think about our frame as a vector $x[k], 0 <= k < N$, where N is its size.

Fourier transformation

First of all let's calculate the spectrum of the signal by computing the discreet Fourier transformation (preferably its “fast” FFT implementation).

$$X[k] = \sum_{n=0}^{N-1} x[n] * e^{-2 * \pi * i * k * n/N}, 0\leq k< N$$

It's also recommended to apply so-called “window” Hamming function to “smooth” values on the frame's bounds.

$$H[k] = 0.54 - 0.46*\cos((2 * \pi * k)/(N-1))$$

As a result we will have the following vector:

$$X[k] = X[k] * H[k], 0\leq k< N$$

It is important to understand that after this conversion, we have X-axis for the frequency (hz) signal, and the Y-axis for the magnitude (as a way to escape from the complex values​​):

Mel-coefficients calculation

Let's start from the “mel” definition. According to Wikipedia (again), mel is a "psychophysical unit of sound pitch, based on the subjective perception of an average human. It primarily depends on the frequency of the sound (as well as on volume and tone). In other words, this value shows us how much the sound of a certain frequency is "valuable" for us.

Frequency can be converted into mel-values by the following formula (remind it as "formula-1"):

$$M = 1127 * \log(1 + F/700)$$

Backward transformation looks in the following way (remind it as "formula-2"):

$$F = 700*(e^{M/1127}-1)$$

Graph of mel/frequency scale:

But let's get back to our task. Assume that we have a frame which size is N = 256 elements. We know (from the audio format data) that the frequency of the frame is 16000hz. Let's put that the human speech belongs to the range [300; 8000]hz. Amount of mel-coefficients that we want to find is M = 10.

In order to expand the spectrum (which we've calculated above) by the mel-scale, we need to create the “comb” of mel-filters. In fact, each mel-filter is a triangular window function, which summarizes amount of energy at the certain frequency range (thus calculates the mel-coefficient). Hence, since we know the amount of mel-coefficients and the frequency range, we can construct the following set of filters:

Notice, the greater the index number of the mel-coefficient is, the wider the base of the filter is. This is because the process of splitting frequency bands into the filters intervals happens on the mel-scale (not on the frequency-scale).

But I digress. For our case the frequency range is [300, 8000]. According to the “formula-1” on the mel-scale this interval transforms into [401.25; 2834.99].

Next, we need 12 reference points to build 10 triangular filters:

m[i] = [401.25, 622.50, 843.75, 1065.00, 1286.25, 1507.50, 1728.74, 1949.99, 2171.24, 2392.49, 2613.74, 2834.99]

Notice, on the mel-scale the points are located uniformly. Let's transform the values back to frequency using the “formula-2”:

h[i] = [300, 517.33, 781.90, 1103.97, 1496.04, 1973.32, 2554.33, 3261.62, 4122.63, 5170.76, 6446.70, 8000]

As you can see, our values start to stretch, thus aligning the growth dynamics of the "significance" of the sound on the low and high frequencies.

Now let's impose the scale (which we got above) on the spectrum of our frame. As we remember, X-axis is for frequency. Length of the spectrum is 256 elements, which refers to 16000Hz diapason. Solving the simple proportion we obtain the following formula:

$$f(i) = floor( (frameSize+1) * h(i) / sampleRate)$$ in our case this is equivalent to

$$f(i) = 4, 8, 12, 17, 23, 31, 40, 52, 66, 82, 103, 128$$

That is all! Since we know the reference points on the X-axis of our spectrum we can easily construct the necessary filters by the following formula:

Applying the mel-filters

Applying the filter to the spectrum mean pairwise multiplying the filter values ​​with the spectrum values. The sum of the elements in the result vector is a mel-coefficient. Since we have M-filters, the number of mel-coefficients will be the same.

$$S[m] = \log( \sum_{k=0}^{N-1}|X[k]|^2 * H_m[k]), 0\leq m< M$$

However, we need to apply mel-filters not to the spectrum, but to its energy. Then we must logarithm the result. It's believed that the trick above decreases the sensitivity to noise ratios.

Cosine transformation

Discrete Cosine Transform (DCT) is used in order to get the "cepstral" coefficients. It allows to “squeeze” the results, increasing the importance of the first coefficients and reducing the importance of the last ones.

In current case we use DCT-II without additional multiplication on the scale factor $\sqrt[2]{2/N}$.

$$C[l] = \sum_{m=0}^{M-1} S[m] * \cos(\pi * l *(m + 1/2)/M)), 0\leq l< M$$

Now we have the set of M mfcc-coefficients for each frame. These coefficients can (and will) be used for the further analysis!

Examples of the methods above can be found here.

Recognition algorithm

Here, dear reader, you'll face the major disappointment. In the internets I've seen a lot of highly (and not very) intelligent debates about what is the best way for speech recognizing. Somebody stands up for Hidden Markov Models, someone - for neural networks... someone's thoughts can't be understood at all :)

In any case a lot of people choose HMM, and this is the way which I'm going to implement… in the future :)

At the moment I suggest to stop on much less effective, but much more simple way.

So, let's remember that our task is about recognition words from a dictionary. For simplicity's sake, we will recognize first ten numbers from Russian dictionary: “odin”, “dva”, “tri”, “chetyre”, “pyat”, “shest”, “sem”, “vosem”, “devyat”, “desyat” :)

Now, take your iPhone/Android and walk through your colleagues asking them to dictate these words for recording. Save the audio files on your computer and convert them into Wav format. Next, calculate mfcc-coefficients for each audio file (a set/vector of concatenated mfcc-coefficients of all frames from the file) and store matches between words and related sets of mfcc-coefficients into a local storage (file or database);

Let's call each [word; sets of mfcc-coefficients] match “Model”, and the process of creating these matches - “Machine Learning”! In fact, the simple adding of new samples in the local storage has a very tenuous connection with machine learning ... But it's a painfully trendy term :)

Now our task is reduced to the selection of the most "close" model for the set of mfcc-coefficients (the word which we want to recognize). At first glance, the problem can be solved quite simply:

  • calculate distances between each model and the word we want to recognize (distance between the model and the word is average euclid distance between word's mfcc-coefficients vector and model's mfcc-coefficients vectors);
  • choose as a correct the model with the shortest distance;

However, two different speakers can pronounce the same word with different speed. That means size of mfcc-vector can be different for two sample of the same word.

Fortunately, the problem of comparing sequences of different length has been solved in the form of Dynamic Time Warping algorithm. This dynamic programming algorithm is perfectly described on Wiki.

The only thing that we need to do with that algorithm is changing of the distance calculation logic. We must remember that the mfcc-vector of the model is a sequence of mfcc-subvectors (M-size) obtained from the frames. So, DTW algorithm must calculate distances between these subvectors, not their elements. Hence, elements of distance matrix are Euclid distances between frame based mfcc-subvectors.



I haven't been able to verify the approach on a big amount of “training” samples. As for test results based on 3 samples education... They are quite bad – only 65% correct recognitions.

However, my task was implementation of the simplest speech recognition application. So called “proof of concept” :)


The attentive reader has noticed that the article contains a lot of references on the GitHub project. It's worth to say that it's my first C++ project since the university times. Also it's my first attempt to calculate something more complicated than the arithmetic mean since the same times... In other words “it comes with absolutely no warranty” (с) :)

comments powered by Disqus