This article is a high-level introduction to the basic concepts and techniques of Speech Recognition. Note, I am not an expert in this area and whole purpose of writing this post was to dive deep into speech recognition domain from scratch. Also, this article was written in 2014 and a lot of problems that we discuss here can currently be solved by the modern ML frameworks like PyTorch.


Our speech is a sequence of sounds. A sound, in turn, is a superposition (overlay) of acoustic oscillations (waves) of different frequencies. Finally, a wave, as we all know from physics, has two major characteristics - amplitude and frequency.

In order to represent a sound signal on a digital device, we need to split it up into a set of intervals and calculate an average amplitude value for each interval:

In this way, mechanical vibrations are transformed into a set of numbers suitable for processing on modern computers.

It follows that the problem of speech recognition is reduced to “comparison” of a set of numerical values (digital signal) and words from a certain dictionary (Russian, for example).

Let's see how such comparison can be implemented.

Input data

Let's start with a simple WAV audio file. First of all, we need to understand its internal structure and how to parse it:

This format implies the presence of two blocks in the file. The first one is a header with the audio stream information: bitrate, frequency, amount of channels, length of the file, etc. The second one contains "raw" data - a set of amplitude values.

The logic of reading data, in this case, is quite simple. Read the header, check some restrictions (lack of compression, for example), save the data in a specially selected array (example).


Theoretically, we can now compare (element by element) the sample we have with some other, the text of which we already know. That is, try to “recognize” speech… But it is better not to do it :)

Our approach should be resistant (well, at least a little) to changes in the tone of the voice (the person pronouncing the word), volume and speed of pronunciation. Of course, this cannot be achieved by comparing two audio signals in a single step.

That's why we choose the another way.


The first step is to break down our data into small-time intervals-frames. Moreover, the frames should not go strictly after each other, but “overlap”. That is, the end of one frame must intersect with the beginning of another.

Frames are a more appropriate unit of data analysis than specific signal values since it is much more convenient to analyze waves at a certain interval than at specific points. The arrangement of frames “overlap” allows you to smooth out the results of frame analysis, turning the idea of frames into a “window” moving along the original function (signal values).

Experimentally found that the optimal length of the frame should correspond to the interval of 10ms, “overlap” — 50%. Given that the average word length (at least in my experiments) is 500ms-this step will give us about 500 / (10 * 0.5) = 100 frames per word.

Splitting words

The next step is to split up our speech into different words. For simplicity, let us assume that in our case speech contains some pauses (intervals of silence), which can be considered “separators” of words.

In this case, we need to find some value, the threshold-values above which are the word, below-silence. There may be several options:

  • set as a constant (will work if the source signal is always generated under the same conditions, in the same way)
  • cluster the signal values by explicitly allocating a set of values corresponding to the silence (this will only work if the silence occupies a significant part of the original signal)
  • analyze the entropy

As you have already guessed, we are now going to talk about the last point :) Let's start with the fact that entropy is a "measure of disorder “ or “the measure of uncertainty of any experience”. In our case, entropy means how much our signal “fluctuates” within a given frame.

In order to determine the entropy of a particular frame, perform the following steps:

  • assume that our signal is normalized and all its values ​​range in [-1;1]
  • build the histogram of the signal's values:
  • split the range into i=0..N intervals (N is a preselected value, recommended from 50 to 100)
  • compute P[i], amount of signal values, matched to the i-th interval
  • normalize P[i] (divide on the frame's size
  • calculate the entropy as below:


Now, we got the entropy value. But this is just another characteristic of the frame, and in order to separate sound from silence, we still need to compare it with something. Some articles recommend taking the entropy threshold equal to the average between its maximum and minimum values (among all frames). However, in my case, this approach did not give any good results.
Fortunately, entropy (as opposed to the same mean square of values) is a relatively independent quantity. Which allowed me to pick up the value of its threshold as a constant (0.1).

However, the problems do not end there: (Entropy can sink in the middle of a word (on vowels), and can suddenly jump up because of a small noise. In order to deal with the first problem, we have to introduce the concept of “minimum distance between words” and “glue” near recumbent sets of frames, separated due to sagging. The second problem is solved by using the “ minimum word length” and cutting off all candidates who have not been selected (and not used in the first paragraph).

If the speech is not “articulate “ in principle, you can try to split the original set of frames into certainly prepared subsequences, each of which will be subjected to the recognition procedure.


At this point of time, we have a set of frames corresponding to a particular word. We can follow the path of least resistance and use the average square of all its values (Root Mean Square) as a numerical characteristic of the frame. However, such a metric carries very little information suitable for further analysis.

This is where the Mel-frequency cepstral coefficients come into play. According to Wikipedia (which, as we know, does not lie) MFCC is a kind of representation of the energy of the signal spectrum. The advantages of utilizing it are as follows:

  • The signal spectrum is used (that is, the decomposition of orthogonal [co]sinusoidal functions on the basis of the basis), which allows us to take into account the wave “ nature” of the signal in further analysis;
  • The spectrum is projected on a special mel-scale, allowing you to select the most significant frequencies for human perception;
  • The number of calculated coefficients can be limited to any value (for example, 12), which allows you to “compress” the frame and, as a result, the amount of information processed;

Let’s look at the process of calculating MFCC coefficients for some frame.

Let’s represent our frame as a vector:

where N is the size of the frame.

Fourier transformation

First of all, we calculate the signal spectrum using a discrete Fourier transform (preferably its “fast” FFT implementation).

It is also recommended to apply a window Hamming function to the obtained values in order to “smooth out” the values on the borders of the frames.

That is the result will be a vector of the following form:

It is important to understand that after this transformation on the X-axis we have the frequency (hz) of the signal, and on the y-axis-the magnitude (as a way to get away from complex values):

Mel-coefficients calculation

Let’s start with what mel is. Again, according to Wikipedia, mel is a “psychophysical unit of pitch” based on the subjective perception of average people. It depends primarily on the frequency of the sound (as well as on the volume and timbre). In other words, this value shows how much the sound of a certain frequency is “significant “ for us.

You can convert the frequency to chalk using the following formula (remember it as “ formula-1»):

The inverse transformation looks like this (remember it as “ formula-2»):

Mel / frequency dependency graph:

But back to our task. Let’s say we have a frame size of 256 elements. We know (from the audio format data) that the audio frequency in this frame is 16000hz. Suppose that human speech is in the range of [300; 8000] hz. The number of required chalk coefficients is m = 10 (recommended value).

In order to decompose the spectrum obtained above on the mel-scale, we need to create a” comb “ of filters. In fact, each mel filter is a triangular window function that allows you to sum the amount of energy in a certain frequency range and thus get the mel coefficient. Knowing the number of Mel coefficients and the analyzed frequency range, we can build a set of such filters:

Note that the larger the order number of the chalk coefficient, the wider the filter base. This is due to the fact that the splitting of the frequency range we are interested in into the ranges processed by filters occurs on the chalk scale.

But we were distracted again. And so for our case, the range of frequencies we are interested in is [300, 8000]. According to the formula-1 in the chalk scale, this range turns into [401.25; 2834.99].

Next, in order to build 10 triangular filters, we need 12 reference points:

m[i] = [401.25, 622.50, 843.75, 1065.00, 1286.25, 1507.50, 1728.74, 1949.99, 2171.24, 2392.49, 2613.74, 2834.99]

Note that the points are evenly spaced on the chalk scale. Let’s translate the scale back to Hertz using formula-2:

h[i] = [300, 517.33, 781.90, 1103.97, 1496.04, 1973.32, 2554.33, 3261.62, 4122.63, 5170.76, 6446.70, 8000]

As you can see now, the scale began to gradually stretch, thus equalizing the dynamics of the growth of “ significance” at low and high frequencies.

Now we need to superimpose the resulting scale on the spectrum of our frame. As we remember, we have a frequency on the X-axis. The length of the spectrum is 256 elements, while it fits 16000hz. By solving a simple proportion you can get the following formula:

f(i) = floor( (frameSize+1) * h(i) / sampleRate)

which in our case is equivalent to

f(i) = 4, 8, 12, 17, 23, 31, 40, 52, 66, 82, 103, 128

That’s it! Knowing the reference points on the x-axis of our spectrum, it is easy to construct the necessary filters using the following formula:

Applying the mel-filters

The use of a filter consists in pairwise multiplication of its values with the values of the spectrum. The result of this operation is the mel coefficient. Since we have M filters, the coefficients will be the same.

However, we need to apply mel filters not to the values of the spectrum, but to its energy. Then Prolog the results. It is believed that this reduces the sensitivity of the coefficients to noise.

Cosine transformation

Discrete cosine transform (DCT) is used in order to get those “cepstral” coefficients. Its meaning is to “compress” the results obtained, increasing the importance of the first coefficients and reducing the importance of the latter.

In this case, DCTII is used without any multiplication by

which is scale factor.

Now for each frame we have a set of M mfcc coefficients that can be used for further analysis.

Examples of the methods above can be found here.

Recognition algorithm

Here, dear reader, you will find the main disappointment. On the Internet, we have seen a lot of highly intelligent (and not very) debate about what is the best way to recognize. Someone stands for Hidden Markov Models, someone — for neural networks, someone’s thoughts are basically impossible to understand.

At the moment, we propose to stop at a much less effective, but at times simpler method.

And so, let’s remember that our task is to recognize a word from a certain dictionary. For simplicity, we will recognize the names of the first ten digits: “one“,” two“ “ “three“,” four“, “five“,” six“, “seven“,” eight“, “nine“, “ten”.

Now let’s take an iPhone/Android in our hands and go through our friends with a request to dictate these words to the record. Next, we will match (in some local database or a simple file) each word Of the l sets of mfcc coefficients of the corresponding records.

This correspondence we will call “Model”, and the process itself-Machine Learning! In fact, the simple addition of new samples to the database has a very weak connection with machine learning… But it’s too fashionable term :)

Now our task is to select the most” close “ model for a certain set of mfcc coefficients (recognizable word). At first glance the problem can be solved quite simply:
- for each model, we find the average (Euclidean) distance between the identified mfcc vector and the model vectors;
- choose as the correct model, the average distance to which will be the smallest;

Fortunately, the problem of comparing sequences of different lengths has already been solved in the form of a Dynamic Time Warping algorithm. This dynamic programming algorithm is perfectly described on Wiki.

The only thing that we need to do with that algorithm is changing the distance calculation logic. We must remember that the mfcc-vector of the model is a sequence of mfcc-subvectors (M-size) obtained from the frames. So, DTW algorithm must calculate distances between these subvectors, not their elements. Hence, elements of distance matrix are Euclid distances between frame based mfcc-subvectors (example).