I’m pretty sure all of us have encountered a speech recognition system in our lives. Speech recognition is used in smartphones, automated customer service, and many other high-end gadgets. It’s being increasingly integrated in many devices to provide a better hands-free user experience. Apple came up with Siri for iPhone and most of the Android phones have speech recognition enabled in some form or the other. But how does it actually happen? Most people are annoyed by the quality of the speech recognition systems, and I don’t blame them for it. This happens mostly because they have surprisingly little knowledge about how their words are actually understood by the machines. I have worked on speech recognition in the past and so I just wanted to take a stab at it to explain what happens under the hood.
How do machines “hear”?
Speech Recognition refers to the process of translating spoken words into text. This is different from Voice Recognition, where you identify who is speaking rather than what they are trying to say. As with any interactive technology, the speech recognition system needs human input. It uses microphone to hear what the user is saying. When a user speaks something, the sound waves leave the mouth and hit the microphone. The microphone then converts sound waves into electrical signals. This electrical signal will be a continuous signal and machines need digital signals to work with. So this electrical signal is digitized using a converter and stored on the machine. This data is now ready to be processed by the machine.
How do machines “understand” what they hear?
Now how do we make sense of this digitized signal? We transform this signal into the frequency domain to understand what’s inside. Speech recognition systems use complex transformations into the frequency domain to get the most accurate details from this signal. What it basically means is that we are trying to look at this signal from a different perspective, where it’s easier for us to understand it. We use our vocal cords to produce sounds. To produce different sounds, we change our vocal cords into different shapes. So to understand these sounds, we try to model these vocal cord shapes so that we understand the exact nature of the source of these sounds.
Okay so the machine has extracted the necessary features from the sound. How does it “understand” the words? Most modern speech recognition systems use Hidden Markov Models (HMMs). These are statistical models that output a sequence of symbols. Explaining HMMs would take a full blog post, so I will just limit my explanation for now. Speech is temporal data, which means the data is spread over time, and HMMs are good at modeling temporal events. Now why do we need a statistical model and not a very deterministic model? The inherent problem with speech is that it has many different variations. The same word can be said in a different pitch, intensity, accent, tempo etc. So there are no hard and fast rules here, and all of them are correct and perfectly natural in the real world. So a machine can only say that it heard a particular word with high confidence, but not with absolute certainty. All the algorithms aim towards increasing this confidence level so that even though the machine is not absolutely sure, it is pretty sure about the word that was uttered. We train the machine with lot of words and their variations before we let it out. The robustness depends on how good the training is. Since speech has many variations, we need a lot of training data to build a robust speech recognition system.
How do machines understand what exactly we “mean”?
We have uttered a bunch of words and the machine has understood all of them. How does it understand the meaning of this sentence? There is a whole field of study dedicated to this and it’s called Natural Language Processing (NLP). NLP aims at extracting meaningful information from natural language inputs and tries to produce meaningful outputs as well. It is an important field of artificial intelligence and people are working hard towards increasing the accuracy of these systems. Search engines use this to understand what exactly the user is looking for. I will not discuss NLP techniques here. I just wanted to point out that there is another step in speech recognition systems after the machines recognize the words. Once the machine understands the sentence, it will take the necessary actions and give the required outputs.
Speech recognizers are pretty sophisticated systems. So the next time you curse your speech recognizer for not being very accurate, think about what’s happening inside!