Awesome Speaker Diarization
Table of contents
Overview
This is a curated list of awesome Speaker Diarization papers, libraries, datasets, and other resources.
The purpose of this repo is to organize the world’s resources for speaker diarization, and make them universally accessible and useful.
To add items to this page, simply send a pull request. (contributing guide)
Publications
Special topics
Review & survey papers
- A Review of Speaker Diarization: Recent Advances with Deep Learning, 2021
- A review on speaker diarization systems and approaches, 2012
- Speaker diarization: A review of recent research, 2010
Large language model (LLM)
- DiarizationLM: Speaker Diarization Post-Processing with Large Language Models, 2024
- Enhancing Speaker Diarization with Large Language Models: A Contextual Beam Search Approach, 2023
- Lexical speaker error correction: Leveraging language models for speaker diarization error correction, 2023
Supervised diarization
- DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors, 2023
- TOLD: A Novel Two-Stage Overlap-Aware Framework for Speaker Diarization, 2023
- Speaker Overlap-aware Neural Diarization for Multi-party Meeting Analysis, 2022
- End-to-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings, 2021
- Supervised online diarization with sample mean loss for multi-domain data, 2019
- Discriminative Neural Clustering for Speaker Diarisation, 2019
- End-to-End Neural Speaker Diarization with Permutation-Free Objectives, 2019
- End-to-End Neural Speaker Diarization with Self-attention, 2019
- Fully Supervised Speaker Diarization, 2018
Joint diarization and ASR
- A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings, 2022
- Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection, 2021
- Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers using End-to-End Speaker-Attributed ASR, 2021
- Joint Speech Recognition and Speaker Diarization via Sequence Transduction, 2019
- Says who? Deep learning models for joint speech recognition, segmentation and diarization, 2018
Online speaker diarization
- Speaker Diarization as a Fully Online Bandit Learning Problem in MiniVox, 2021
- Online Speaker Diarization with Relation Network, 2020
- VoiceID on the Fly: A Speaker Recognition System that Learns from Scratch, 2020
Challenges
- M2MeT: The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Challenge, 2022
- The Hitachi-JHU DIHARD III system: Competitive end-to-end neural diarization and x-vector clustering systems combined by DOVER-Lap
- Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge, 2018
- ODESSA at Albayzin Speaker Diarization Challenge 2018, 2018
- Joint Discriminative Embedding Learning, Speech Activity and Overlap Detection for the DIHARD Challenge, 2018
Audio-Visual Speaker Diarization
- AVA-AVD: Audio-Visual Speaker Diarization in the Wild, 2022
- DyViSE: Dynamic Vision-Guided Speaker Embedding for Audio-Visual Speaker Diarization, 2022
- End-to-End Audio-Visual Neural Speaker Diarization, 2022
- MSDWild: Multi-modal Speaker Diarization Dataset in the Wild, 2022
Other
2021
- Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation
- End-to-end speaker segmentation for overlap-aware resegmentation
- DIVE: End-to-end Speech Diarization via Iterative Speaker Embedding
- DOVER-Lap: A method for combining overlap-aware diarization outputs
- Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: Theory, implementation and analysis on standard tasks
- AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario, 2021
2020
- An End-to-End Speaker Diarization Service for improving Multimedia Content Access
- Spot the conversation: speaker diarisation in the wild
- Speaker Diarization with Region Proposal Network
- Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario
2019
- Overlap-aware diarization: resegmentation using neural end-to-end overlapped speech detection
- Speaker diarization using latent space clustering in generative adversarial network
- A study of semi-supervised speaker diarization system using gan mixture model
- Learning deep representations by multilayer bootstrap networks for speaker diarization
- Enhancements for Audio-only Diarization Systems
- LSTM based Similarity Measurement with Spectral Clustering for Speaker Diarization
- Meeting Transcription Using Virtual Microphone Arrays
- Speaker diarisation using 2D self-attentive combination of embeddings
- Speaker Diarization with Lexical Information
2018
- Neural speech turn segmentation and affinity propagation for speaker diarization
- Multimodal Speaker Segmentation and Diarization using Lexical and Acoustic Cues via Sequence to Sequence Neural Networks
- Joint Speaker Diarization and Recognition Using Convolutional and Recurrent Neural Networks
2017
- Speaker Diarization with LSTM
- Speaker diarization using deep neural network embeddings
- Speaker diarization using convolutional neural network for statistics accumulation refinement
- pyannote. metrics: a toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems
- Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks
- Speaker Diarization using Deep Recurrent Convolutional Neural Networks for Speaker Embeddings
2016
2015
2014
- A study of the cosine distance-based mean shift for telephone speech diarization
- Speaker diarization with PLDA i-vector scoring and unsupervised calibration
- Artificial neural network features for speaker diarization
2013
2011
- PLDA-based Clustering for Speaker Diarization of Broadcast Streams
- Speaker diarization of meetings based on speaker role n-gram models
2009
2008
2006
- An overview of automatic speaker diarization systems
- A spectral clustering approach to speaker diarization
Software
Framework
Link | Language | Description |
---|---|---|
FunASR | Python & PyTorch | FunASR is an open-source speech toolkit based on PyTorch, which aims at bridging the gap between academic researchs and industrial applications. |
MiniVox | MATLAB | MiniVox is an open-source evaluation system for the online speaker diarization task. |
SpeechBrain | Python & PyTorch | SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch. |
SIDEKIT for diarization (s4d) | Python | An open source package extension of SIDEKIT for Speaker diarization. |
pyAudioAnalysis | Python | Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications. |
AaltoASR | Python & Perl | Speaker diarization scripts, based on AaltoASR. |
LIUM SpkDiarization | Java | LIUM_SpkDiarization is a software dedicated to speaker diarization (i.e. speaker segmentation and clustering). It is written in Java, and includes the most recent developments in the domain (as of 2013). |
kaldi-asr | Bash | Example scripts for speaker diarization on a portion of CALLHOME used in the 2000 NIST speaker recognition evaluation. |
kaldi-speaker-diarization | Bash | Icelandic speaker diarization scripts using kaldi. |
Alize LIA_SpkSeg | C++ | ALIZÉ is an opensource platform for speaker recognition. LIA_SpkSeg is the tools for speaker diarization. |
pyannote-audio | Python | Neural building blocks for speaker diarization: speech activity detection, speaker change detection, speaker embedding. |
pyBK | Python | Speaker diarization using binary key speaker modelling. Computationally light solution that does not require external training data. |
Speaker-Diarization | Python | Speaker diarization using uis-rnn and GhostVLAD. An easier way to support openset speakers. |
EEND | Python & Bash & Perl | End-to-End Neural Diarization. |
VBx | Python | Variational Bayes HMM over x-vectors diarization. x-vector extractor recipe |
RE-VERB | Python & JavaScript | RE: VERB is speaker diarization system, it allows the user to send/record audio of a conversation and receive timestamps of who spoke when. |
StreamingSpeakerDiarization | Python | Streaming speaker diarization, extends pyannote.audio to online processing |
simple_diarizer | Python | Simplified diarization pipeline using some pretrained models. Made to be a simple as possible to go from an input audio file to diarized segments. |
Picovoice Falcon | C & Python | A lightweight, accurate, and fast speaker diarization engine written in C and available in Python, running on CPU with minimal overhead. |
DiaPer | Python | Pytorch implementation for DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors including models pre-trained on free and public data. |
sherpa-onnx | C++ & C & C# & Dart & Go & Java & JavaScript & Kotlin & Pascal & Python & Rust & Swift |
Support speaker diarization, speech recognition, and text-to speech on various platforms with various language bindings. |
Evaluation
Link | Language | Description |
---|---|---|
pyannote-metrics | Python | A toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems. |
SimpleDER | Python | A lightweight library to compute Diarization Error Rate (DER). |
DiarizationLM | Python | Implements Word Error Rate (WER), Word Diarization Error Rate (WDER), and concatenated minimum-permutation Word Error Rate (cpWER). |
NIST md-eval | Perl | (1) modified md-eval.pl from Mary Tai Knox; (2) md-eval-v21.pl from jitendra; (3) md-eval-22.pl from nryant |
dscore | Python & Perl | Diarization scoring tools. |
Sequence Match Accuracy | Python | Match the accuracy of two sequences with Hungarian algorithm. |
spyder | Python & C++ | Simple Python package for fast DER computation. |
CDER | Python | Conversational DER from The Conversational Short-phrase Speaker Diarization (CSSD) Task: Dataset, Evaluation Metric and Baselines |
Clustering
Link | Language | Description |
---|---|---|
uis-rnn | Python & PyTorch | Google’s Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, for Fully Supervised Speaker Diarization. This clustering algorithm is supervised. |
uis-rnn-sml | Python & PyTorch | A variant of UIS-RNN, for the paper Supervised Online Diarization with Sample Mean Loss for Multi-Domain Data. |
DNC | Python & ESPnet | Transformer-based Discriminative Neural Clustering (DNC) for Speaker Diarisation. Like UIS-RNN, it is supervised. |
SpectralCluster | Python | Spectral clustering with affinity matrix refinement operations, auto-tune, and speaker turn constraints. |
sklearn.cluster | Python | scikit-learn clustering algorithms. |
PLDA | Python | Probabilistic Linear Discriminant Analysis & classification, written in Python. |
PLDA | C++ | Open-source implementation of simplified PLDA (Probabilistic Linear Discriminant Analysis). |
Auto-Tuning Spectral Clustering | Python | Auto-tuning Spectral Clustering method that does not need development set or supervised tuning. |
Speaker embedding
Link | Method | Language | Description |
---|---|---|---|
resemble-ai/Resemblyzer | d-vector | Python & PyTorch | PyTorch implementation of generalized end-to-end loss for speaker verification, which can be used for voice cloning and diarization. |
Speaker_Verification | d-vector | Python & TensorFlow | Tensorflow implementation of generalized end-to-end loss for speaker verification. |
PyTorch_Speaker_Verification | d-vector | Python & PyTorch | PyTorch implementation of “Generalized End-to-End Loss for Speaker Verification” by Wan, Li et al. With UIS-RNN integration. |
Real-Time Voice Cloning | d-vector | Python & PyTorch | Implementation of “Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis” (SV2TTS) with a vocoder that works in real-time. |
conformer-speaker-encoder | d-vector | Python & TFLite | Massively multilingual conformer-based speaker recognition models in TFLite format. |
deep-speaker | d-vector | Python & Keras | Third party implementation of the Baidu paper Deep Speaker: an End-to-End Neural Speaker Embedding System. |
x-vector-kaldi-tf | x-vector | Python & TensorFlow & Perl | Tensorflow implementation of x-vector topology on top of Kaldi recipe. |
kaldi-ivector | i-vector | C++ & Perl | Extension to Kaldi implementing the standard i-vector hyperparameter estimation and i-vector extraction procedure. |
voxceleb-ivector | i-vector | Perl | Voxceleb1 i-vector based speaker recognition system. |
pytorch_xvectors | x-vector | Python & PyTorch | PyTorch implementation of Voxceleb x-vectors. Additionaly, includes meta-learning architectures for embedding training. Evaluated with speaker diarization and speaker verification. |
ASVtorch | i-vector | Python & PyTorch | ASVtorch is a toolkit for automatic speaker recognition. |
asv-subtools | i-vector & x-vector | Kaldi & PyTorch | ASV-Subtools is developed based on Pytorch and Kaldi for the task of speaker recognition, language identification, etc. The ‘sub’ of ‘subtools’ means that there are many modular tools and the parts constitute the whole. |
WeSpeaker | x-vector & r-vector | Python & C++ & PyTorch | WeSpeaker is a research and production oriented speaker verification, recognition and diarization toolkit, which supports very strong recipes with on-the-fly data preparation, model training and evaluation, as well as runtime C++ codes. |
ReDimNet | improved resnet | Pytorch | Neural network architecture presented in the paper Reshape Dimensions Network for Speaker Recognition |
Speaker change detection
Link | Language | Description |
---|---|---|
change_detection | Python & Keras | Code for Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks. |
tidydiarize | Python | Diarization inside OpenAI Whisper decoder |
Audio feature extraction
Link | Language | Description |
---|---|---|
LibROSA | Python | Python library for audio and music analysis. https://librosa.github.io/ |
python_speech_features | Python | This library provides common speech features for ASR including MFCCs and filterbank energies. https://python-speech-features.readthedocs.io/en/latest/ |
pyAudioAnalysis | Python | Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications. |
Audio data augmentation
Link | Language | Description |
---|---|---|
pyroomacoustics | Python | Pyroomacoustics is a package for audio signal processing for indoor applications. It was developed as a fast prototyping platform for beamforming algorithms in indoor scenarios. https://pyroomacoustics.readthedocs.io |
gpuRIR | Python | Python library for Room Impulse Response (RIR) simulation with GPU acceleration |
rir_simulator_python | Python | Room impulse response simulator using python |
WavAugment | Python & PyTorch | WavAugment performs data augmentation on audio data. The audio data is represented as pytorch tensors |
EEND_dataprep | Bash & Python | Recipes for generating simulated conversations used to train end-to-end diarization models. |
Other software
Link | Language | Description | |
---|---|---|---|
VB Diarization | Python | VB Diarization with Eigenvoice and HMM Priors. | |
DOVER-Lap | Python | Python package for combining diarization system outputs | |
Diar-az | Python | Data formatting tool to support the ruv-di dataset. Kaldi to Gecko to Kaldi and corpus and back |
Datasets
Diarization datasets
Audio | Diarization ground truth | Language | Pricing | Additional information |
---|---|---|---|---|
2000 NIST Speaker Recognition Evaluation | Disk-6 (Switchboard), Disk-8 (CALLHOME) | Multiple | $2400.00 | Evaluation Plan |
2003 NIST Rich Transcription Evaluation Data | Together with audios | en, ar, zh | $2000.00 | telephone speech, broadcast news |
CALLHOME American English Speech | CALLHOME American English Transcripts | en | $1500.00 + $1000.00 | CH109 whitelist |
The ICSI Meeting Corpus | Together with audios | en | Free | License |
The AMI Meeting Corpus | Together with audios (need to be processed) | Multiple | Free | License |
Fisher English Training Speech Part 1 Speech | Fisher English Training Speech Part 1 Transcripts | en | $7000.00 + $1000.00 | |
Fisher English Training Part 2, Speech | Fisher English Training Part 2, Transcripts | en | $7000.00 + $1000.00 | |
VoxConverse | TBD | TBD | Free | VoxConverse is an audio-visual diarisation dataset consisting of over 50 hours of multispeaker clips of human speech, extracted from YouTube videos |
MiniVox Benchmark | MiniVox Benchmark | en | Free | MiniVox is an automatic framework to transform any speaker-labelled dataset into continuous speech datastream with episodically revealed label feedbacks. |
The AliMeeting Corpus | Together with audios | zh | Free |
Speaker embedding training sets
Name | Utterances | Speakers | Language | Pricing | Additional information |
---|---|---|---|---|---|
TIMIT | 6K+ | 630 | en | $250.00 | Published in 1993, the TIMIT corpus of read speech is one of the earliest speaker recognition datasets. |
VCTK | 43K+ | 109 | en | Free | Most were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph intended to identify the speaker’s accent. |
LibriSpeech | 292K | 2K+ | en | Free | Large-scale (1000 hours) corpus of read English speech. |
Multilingual LibriSpeech (MLS) | ? | ? | en, de, nl, es, fr, it, pt, po | Free | Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish. |
LibriVox | 180K | 9K+ | Multiple | Free | Free public domain audiobooks. LibriSpeech is a processed subset of LibriVox. Each original unsegmented utterance could be very long. |
VoxCeleb 1&2 | 1M+ | 7K | Multiple | Free | VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube. |
The Spoken Wikipedia Corpora | 5K | 879 | en, de, nl | Free | Volunteer readers reading Wikipedia articles. |
CN-Celeb | 130K+ | 1K | zh | Free | A Free Chinese Speaker Recognition Corpus Released by CSLT@Tsinghua University. |
BookTubeSpeech | 8K | 8K | en | Free | Audio samples extracted from BookTube videos - videos where people share their opinions on books - from YouTube. The dataset can be downloaded using BookTubeSpeech-download. |
DeepMine | 540K | 1850 | fa, en | Unknown | A speech database in Persian and English designed to build and evaluate speaker verification, as well as Persian ASR systems. |
NISP-Dataset | ? | 345 | hi, kn, ml, ta, te (all Indian languages) | Free | This dataset contains speech recordings along with speaker physical parameters (height, weight, … ) as well as regional information and linguistic information. |
VoxBlink2 | 10M | 100k+ | 18 lanugages (en, pt, es, ru, ar, …) | CC BY-NC-SA 4.0 | Multilingual dataset from VoxBlink2: A 100K+ Speaker Recognition Corpus and the Open-Set Speaker-Identification Benchmark |
Augmentation noise sources
Name | Utterances | Pricing | Additional information |
---|---|---|---|
AudioSet | 2M | Free | A large-scale dataset of manually annotated audio events. |
MUSAN | N/A | Free | MUSAN is a corpus of music, speech, and noise recordings. |
Conferences
Conference/Workshop | Frequency | Page Limit | Organization | Blind Review |
---|---|---|---|---|
ICASSP | Annual | 4 + 1 (ref) | IEEE | No |
InterSpeech | Annual | 4 + 1 (ref) | ISCA | No |
Speaker Odyssey | Biennial | 8 + 2 (ref) | ISCA | No |
SLT | Biennial | 6 + 2 (ref) | IEEE | Yes |
ASRU | Biennial | 6 + 2 (ref) | IEEE | Yes |
WASPAA | Biennial | 4 + 1 (ref) | IEEE | No |
IJCB | Annual | 8 | IEEE & IAPR TC-4 | Yes |
Other learning materials
Online courses
- Course on Udemy: A Tutorial on Speaker Diarization
Books
- Voice Identity Techniques: From core algorithms to engineering practice (Chinese) by Quan Wang, 2020
Tech blogs
- Literature Review For Speaker Change Detection by Halil Erdoğan
- Speaker Diarization: Separation of Multiple Speakers in an Audio File by Jaspreet Singh
- Speaker Diarization with Kaldi by Yoav Ramon
- Who spoke when! How to Build your own Speaker Diarization Module by Rahul Saxena
Video tutorials
- pyannote audio: neural building blocks for speaker diarization by Hervé Bredin
- Google’s Diarization System: Speaker Diarization with LSTM by Google
- Fully Supervised Speaker Diarization: Say Goodbye to clustering by Google
- Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection by Google
- Speaker Diarization: Optimal Clustering and Learning Speaker Embeddings by Microsoft Research
- Robust Speaker Diarization for Meetings: the ICSI system by Microsoft Research
- 【机器之心&博文视点】入门声纹技术|第二讲:声纹分割聚类与其他应用 by Quan Wang
Products
Company | Product |
---|---|
Recorder app | |
Google Cloud Speech-to-Text API | |
Amazon | Amazon Transcribe |
IBM | Watson Speech To Text API |
DeepAffects | Speaker Diarization API |
Alibaba | Tingwu (听悟) |
Microsoft | Azure Conversation Transcription API |