We’ve recently made a series of updates to our Speaker diarization service, which identifies who said what in a conversation, leading to improvements across a number of relevant benchmarks. In particular, our new Speaker Diarization model is up to 13% more accurate than its predecessor, and available in 5 additional languages.
Speaker Diarization improvements
Speaker diarization is the process of identifying “who said what” in a conversation:
Transcripts with diarization ascribe a speaker to each utterance
Speaker diarization increases the readability of transcripts and powers a wide range of downstream features in end-user applications, like automated video clipping, call coaching, and automated dubbing. As a result, improvements to speaker diarization have an outsized impact on end-user experiences for applications that process speech data. Here is an overview of the improvements to our Speaker Diarization model:
Diarization Accuracy
Our Speaker Diarization demonstrates a 10.1% improvement on Diarization Error Rate (DER), and a 13.2% improvement in concatenated minimum-permutation word error rate (cpWER), which are two widely-adopted metrics which measure the accuracy of a Diarization model
DER measures the fraction of time in the audio file to which an incorrect speaker was ascribed, while cpWER measures the number of errors a speech recognition model makes, where words with incorrectly-ascribed speakers are considered to be incorrect. The Word Error Rate (WER), a classic Speech Recognition accuracy metric, thus serves as a lower bound for the cpWER, which takes into account both transcription and diarization accuracy. Both DER and cpWER ultimately measure errors, so a lower value is better (indicates greater accuracy).
Here we report both the DER and cpWER of AssemblyAI’s speaker diarization service and a number of alternative providers. Note that Whisper metrics are not reported here given that diarization is not a native capability of Whisper, but Gladia is based on Whisper and can therefore provide a ballpark estimate for those interested.
DER and cpWER for several providers
Speaker Number Accuracy
Our Speaker Diarization model demonstrates an 85.4% reduction in speaker count errors. A speaker count error occurs when a diarization model does not properly determine the number of unique speakers in an audio file. For example, if two people are having a conversation, then a Speaker Diarization model determining that any number of people other than 2 are speaking would be a speaker count error.
Properly determining the number of speakers in a file is important not only because it could affect diarization accuracy, but also because downstream features often rely on the number of speakers in a file – for example a call center software that expects two people on a call, the agent and customer.
Below we report the percentage of speaker count errors our Speaker Diarization model makes, along with several other providers. That is, the figure depicts the percentage of audio files processed in which an incorrect number of speakers were determined to be present by the model. AssemblyAI’s Speaker Diarization model achieves the lowest rate at just 2.9%.
Percentage of test files in which the determined number of speakers was incorrect
Increased Language Support
In addition to improvements to Speaker Diarization itself, we’ve increased language support. Speaker Diarization is now available in five additional languages:
ChineseHindiJapaneseKoreanVietnamese
We now support Speaker Diarization in 16 languages — almost all languages supported by our Best tier, which you can browse here.
Where do these improvements come from?
These improvements to Speaker Diarization stem from a series of upgrades rolled out recently as part of our continual iteration and shipping. Three recent improvements in particular power many of these diarization improvements:
Universal-1 – Our new Speech Recognition model Universal-1 demonstrates significant improvements in transcription accuracy, as well as in time stamp prediction, which is critical for aligning speaker labels with ASR outputs. Given that the transcript is a key input into the Speaker Diarization model, Universal-1’s improvements propagated on to our Speaker Diarization service.Improved embedding model – we’ve made upgrades to the speaker-embedding model within our Speaker Diarization model, allowing the model to better identify and extract unique acoustical features to better differentiate between speakers.Sampling frequency – we’ve increased input sampling frequency from 8 kHz to 16 kHz, providing the Speaker Diarization model higher-resolution input data and therefore supplying it with more information to learn differences in speakers’ voices.
Try it yourself
You can test our new diarization model for free in a no-code way by using our Playground. Simply enable Speaker Labels from the list of capabilities and select and example file or upload your own:
0:00
Alternatively, you can get an AssemblyAI API key for free to use our API directly. Here we show how to transcribe a file with Speaker Diarization and print the results using AssemblyAI’s Python SDK
import assemblyai as aai
aai.settings.api_key = “YOUR_API_KEY”Â
audio_url = “https://github.com/AssemblyAI-Examples/audio-examples/raw/main/20230607_me_canadian_wildfires.mp3”
config = aai.TranscriptionConfig(
  speaker_labels=True,
)
transcript = aai.Transcriber().transcribe(audio_url, config)
for utterance in transcript.utterances:
  print(f”Speaker {utterance.speaker}: {utterance.text}”)
# Output:
# Speaker A: Smoke from hundreds of wildfires in Canada is …
# Speaker B: Well, there’s a couple of things. The season …
# Speaker A: So what is it in this haze that makes it harmful?
 Â
Check out our Docs to learn more about using Speaker Diarization with our SDKs (Python, TypeScript, Go, Java, Ruby), or via HTTP requests through our API reference if we do not yet have an SDK for your language of choice.
Speaker Diarization use cases
Speaker Diarization is a powerful feature which can be used for a variety of use cases across industries. Here are a few use cases which would not be possible without performant Speaker Diarization:
Transcript readability
The increase in remote work over the past several years means that more meetings are happening remotely and being recorded for those who were not in attendance. Add to this the increase in webinars and recorded live events, and more speech data than ever is being recorded.
Many users prefer to read meeting and event transcripts and summaries rather than watch recordings, so the readability of these transcripts becomes critical to easily digesting the contents of recorded events.
Search experience in-productÂ
Many Conversation Intelligence products and platforms offer search features, allowing users to e.g. search for instances in which “Person A” said “X”. Diarization is a necessary requirement for these sorts of features, and accurate Diarization models ensure you’re surfacing complete and accurate insights to end users.
Downstream analytics and LLMs
Many features are built on top of speech data and transcripts that allow information to be extracted from recorded speech in a meaningful way. Conversational intelligence features and Large Language Model (LLM) post-processing rely on knowing who said what to extract as much useful information as possible from this raw data. For example, customer service software can use speaker information to determine the ratio of time an agent speaks on a call, or to power coaching features that can help agents phrase questions in a more productive way.
Creator tool features
Transcription and Diarization lay at the absolute foundation of a range of downstream AI-powered features. Transcription and Diarization accuracy are therefore paramount in ensuring the utility, integrity, and accuracy of these downstream features, as reflected in the Machine Learning adage “garbage in, garbage out”.
Here are a few downstream AI-powered features in the area of video processing and content creation which rely on Speaker Diarization:
Automated dubbing: Automated dubbing allows creators to adapt their content to international audiences. For content with more than one speaker, diarization is needed to assign different AI translated voices to each speaker.Auto Speaker Focus: Video content can be made more engaging with auto speaker focus, which ensures the camera is focused on talking subjects during camera changes and automatically resizes videos to center active speakers. Performant speaker diarization is required to ensure the video is properly focused on the current speaker.AI-recommended short clips from long-form content:Â Short-form video content is an essential part of content creation pipelines. Automatically creating short-form content from long-form videos or podcasts helps creators get the most mileage out of the content they create. There are many creator tool companies which will automatically generate recommendations for short-form clips from long-form content. These platforms require accurate Speaker Diarization to ensure that their recommendation algorithms have accurate and complete information on which to base their recommendations.
Source: Read MoreÂ