Analysis performed by Orgad Keller, Amit Ashkenazi, Raphael Cohen (PhD), and Micha Breakstone (PhD)
Chorus.ai is a platform that automatically records, transcribes and summarizes sales and customer success conversations. Understanding who was on a call and who said what is an easy task for humans (most of the time!), but it's a much harder problem than you might think for algorithms. This post describes some of the challenges pertaining to data accuracy and the benefits of building a solution using our own tech stack rather than using a third party solution.
Why would you want to know who said what during a meeting?
- Talk-time ratio across reps and prospects is a good indicator of conversation quality (see graph below)
- It’s helpful to listen to only what prospects said when reviewing a recording
- Isolating a specific speaker makes it easy to see who was engaged in the conversation (e.g., the CFO didn’t say much, but I want to hear what she said)
- Algorithms that identify important moments (like pain points) perform much better if you can isolate them to only what a prospect is saying
Identifying multiple speakers on a single channel audio recording, especially in conference calls with more than two speakers, is an extremely difficult challenge and is far from a solved problem, even in academic settings. Usually referred to as "speaker separation" or "speaker diarization", it’s generally considered even more difficult than speech recognition.
To attack this problem, our research team developed a patent-pending framework that uses Deep Learning to automatically generate a “voice fingerprint” for each sales rep using a combination of vocal characteristics. During the sales call itself, we cluster the audio signals based on those characteristics with each cluster representing a speaker. The voice fingerprints we stored play a crucial role not only in associating each speaker with the right cluster, but in the clustering process itself: the models we trained with the fingerprints allow us to learn and apply mathematical transformations to the audio, which render the differences between different speakers more distinct. See the before and after graphs below.
Using this approach Chorus is able to better separate the voices of speakers, even in challenging circumstances, such as when two speakers have dialed in from the same conference room (e.g., your champion and the decision maker), or in-person meetings in a noisy acoustic environment. We’ve submitted a patent for this pioneering approach, and as of now, no other provider on the market has developed a comparable solution.
There are other important statistics that come in to play for understanding talk-time, such as understanding when a conversation actually begins.
Let’s look at a couple of examples illustrating this using screenshots from Chorus.ai.
In the first example both the rep and prospect joined the call late. Chorus marks the pre-call segment in gray. This allows us to identify the true length of the meeting.
In the second example the rep joined the call first. The prospect joined a few minutes later and asked to wait a minute until her peer joined. Understanding this is critical for transcription accuracy. Other speech recognition frameworks would attempt to transcribe the silence / noise and miscalculate talk time, providing inaccurate data.
As simple as this may seem for a human to do, it’s surprisingly complex for algorithms to do this at scale, but it is critical for creating high quality data that our customers and users can rely on.
At Chorus.ai, our goal is to automatically summarize and surface insights across all of your business conversations and get those insights to the people that need them. Our investment in R&D and building a technology stack optimized for sales and customer success conversations allows us to capture 100% of meetings and provide the highest quality data possible - from transcript accuracy, to engagement in the conversation and who is saying what. In future posts we’ll share more on how these insights impact sales performance and act as indicators on the likelihood of a deal to close or renew.