Dr. Micha Breakstone, Co-founder @chorus.ai
Somewhere between 2009 and 2010 a new and exciting technology broke into the forefront of Artificial Intelligence research. Within a few months, a combination of advanced computing power and huge amounts of data set the stage for a new era in AI. Sophisticated algorithms, invented back in the 1950s, previously considered no more than an academic thought experiment, were transmuted into the cutting edge of the industry. These algorithms — Deep Neural Networks — broke boundaries, smashed records, and obtained novel achievements in the field of Artificial Intelligence, that had been all but lying dormant for decades.
One of the areas where these achievements were most prominent was Automatic Speech Recognition (ASR), i.e., the task of automatically transcribing voice recordings into written words. The accuracy of transcription engines surged from around 84% in 2012 to almost 90% in less than two years. To give a sense of what this meant for ASR researchers, imagine Usain Bolt stepping on to the track and crushing his 9.58 second world record, bringing it down to 6.1 seconds. A mere tenth of a second slower than a cheetah.
These achievements cannot be attributed to a single company or research group. In fact, for several years now a scientific “game of thrones” has been at play, the top contenders for the crown being Google, Microsoft, Baidu and IBM. It’s a feverous race indeed, and new records are changing hands once every few months, sometimes even weeks. But the race will not go on forever. As a matter of fact, according to Microsoft and IBM, we’re nearing the very last stretch.
Researchers estimate human (or manual) transcription accuracy is slightly less than 95%. Obtaining this accuracy may be the end of the race. Once automatic engines reach this level of accuracy, computers will be able to transcribe any conversation automatically, instantly, and (almost) perfectly. The importance of such an achievement cannot be overstated, both scientifically and financially speaking. Imagine the ramifications of having all spoken information in the world transcribed, and thereby rendered searchable, and amenable to automatic analysis.
In October 2016, Microsoft triumphantly announced that it had broken the human parity barrier, and ever since, Microsoft and IBM have been waging a fierce battle across media outlets, exchanging rhetorical blows every few weeks (see, e.g., this Zdnet review).
But as with any scientific topic that reaches the headlines, the situation is actually quite a bit more complex than the corporate world would have it.
To understand why the situation is more complex, we first need to understand how transcription quality is measured. As an example, imagine one of the speakers on a phone call says, “gotta run, finishing up my Learning Machines post” and the ASR engine transcribes “got a run, finishing up mile earning machines post”.
To get a perfect transcription we’d have to apply a set of editing operations on the the transcribed utterance:
- Replace: “got” → “gotta”
- Delete: “a”
- Replace: “mile” → “my”
- Replace: “earning” → “learning”
That’s a total of 4 editing operations. Given the transcribed sentence comprises a total of 9 words, we’re talking about 4/9 or 44.4% Word Error Rate (WER). Pretty high compared to human quality, which would max at approximately a single editing operation for every 20 words.
So have Microsoft and IBM indeed achieved the Holy Grail (approximately 5% WER)? The answer is somewhat involved, but in a nutshell: under lab conditions — yes, in practice — we’re still pretty far away.
An in-depth look into Microsoft’s and IBM’s research reveals that the results published were measured on 17 year old conversations, conducted by native English speakers, under perfect acoustic conditions, and with well-known vocabularies. (Note that the recordings being 17 years old means that the engines and frameworks have had time to be inadvertently tailored towards optimizing results on these conversations.)
Under non-sterile conditions (i.e., background noises, speakers with an accent that use a non-restricted vocabulary such as people’s names, abbreviations, and slang), the results are substantially worse. Experimenting with the ASR engines made publicly available by some of these large companies reveals a very different picture, whereby standard error rates run between 20% and 25% WER. That is, one editing operation for every four or five words — still much better than the example detailed above (twice as good in fact).
These figures indicate that there is still considerable room for improvement, and with the rate of progress in recent years, it is reasonable to assume that we will see significant advancements in the near future.
But until this future unfolds, it turns out that there’s a pretty good possibility of significant improvement even under real, noisy conditions. Surprisingly, this option is open not only to large corporations but also to smaller teams, including start-ups. To understand this, we first need to understand how transcription engines are built.
The algorithms that decode audio recordings and transcribe them into text are trained on vast amounts of data, resulting in two separate models. The first model is called an Acoustic Model (AM) and it represents a mapping between audio waves and phonemes (i.e., atomic sound units of which words are comprised). The second model is called the Language Model (LM), and it represents the probabilities of different dependencies between words in the sentence. In addition, there is a Phonetic Dictionary (PD) that maps between the AM and the LM.
The acoustic model is learned by a deep neural network trained on thousands of hours of conversation, and the language model is trained on sentences that include hundreds of millions and sometimes billions of words. But here comes the twist. While large corporations have amounts of data that vastly outscale what is available to smaller organizations, start-ups may have a significant advantage: the quality of their data.
When the transcription task is focused on solving a specific challenge, such as transcribing political discussions, or transcribing financial press conferences, or transcribing sales conversations (as we do at Chorus.ai), it turns out that there is a huge advantage for the training data emanating from the relevant semantic space.
Thus, a company specializing in transcription for a specific vertical or domain can collect smaller amounts of data and still get better results than the generic engines of the giant corporates. To be sure, we’re still talking massive amounts, i.e., data derived from hundreds of thousands of hours of conversations, but at least it’s not hundreds of millions of hours.
Empirically, given high quality, domain-specific data, it’s possible to achieve at least 15% improvement over generic engines (relatively speaking). This means that smaller players too, have an important role to play in the race to human parity. That being said, even with domain-specific data, we’re still nowhere near human quality, so that the race for a perfect transcription engine is likely quite a few years away.
Until we are able to build the perfect engine, a deep understanding of the semantic field in question will no doubt continue to constitute a significant advantage for any company engaged in the analysis and understanding of conversations for spoken information in a given domain.
* The writer holds a PhD in formal Semantics, and is a co-founder of chorus.ai, a startup that specializes in real-time understanding and analysis of sales conversations, and which has trained multiple ASR engines for this purpose.
Join 25,000+ people who read the weekly 🤖Machine Learnings🤖newsletter to understand how AI will impact the way they work.
Automatic Speech Recognition: Artificial Intelligence, Big Data, and the race for Human Parity was originally published in Machine Learnings on Medium, where people are continuing the conversation by highlighting and responding to this story.