Awesome Speaker Diarization
Table of contents
Overview
This is a curated list of awesome Speaker Diarization papers, libraries, datasets, and other resources.
The purpose of this repo is to organize the world’s resources for speaker diarization, and make them universally accessible and useful.
To add items to this page, simply send a pull request. (contributing guide)
Publications
Special topics
Review & survey papers
Large language model (LLM)
Supervised diarization
Joint diarization and ASR
Online speaker diarization
Challenges
Audio-Visual Speaker Diarization
Other
2021
2020
2019
2018
2017
2016
2015
2014
2013
2011
2009
2008
2006
Software
Framework
Link |
Language |
Description |
FunASR |
Python & PyTorch |
FunASR is an open-source speech toolkit based on PyTorch, which aims at bridging the gap between academic researchs and industrial applications. |
MiniVox |
MATLAB |
MiniVox is an open-source evaluation system for the online speaker diarization task. |
SpeechBrain |
Python & PyTorch |
SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch. |
SIDEKIT for diarization (s4d) |
Python |
An open source package extension of SIDEKIT for Speaker diarization. |
pyAudioAnalysis |
Python |
Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications. |
AaltoASR |
Python & Perl |
Speaker diarization scripts, based on AaltoASR. |
LIUM SpkDiarization |
Java |
LIUM_SpkDiarization is a software dedicated to speaker diarization (i.e. speaker segmentation and clustering). It is written in Java, and includes the most recent developments in the domain (as of 2013). |
kaldi-asr |
Bash |
Example scripts for speaker diarization on a portion of CALLHOME used in the 2000 NIST speaker recognition evaluation. |
kaldi-speaker-diarization |
Bash |
Icelandic speaker diarization scripts using kaldi. |
Alize LIA_SpkSeg |
C++ |
ALIZÉ is an opensource platform for speaker recognition. LIA_SpkSeg is the tools for speaker diarization. |
pyannote-audio |
Python |
Neural building blocks for speaker diarization: speech activity detection, speaker change detection, speaker embedding. |
pyBK |
Python |
Speaker diarization using binary key speaker modelling. Computationally light solution that does not require external training data. |
Speaker-Diarization |
Python |
Speaker diarization using uis-rnn and GhostVLAD. An easier way to support openset speakers. |
EEND |
Python & Bash & Perl |
End-to-End Neural Diarization. |
VBx |
Python |
Variational Bayes HMM over x-vectors diarization. x-vector extractor recipe |
RE-VERB |
Python & JavaScript |
RE: VERB is speaker diarization system, it allows the user to send/record audio of a conversation and receive timestamps of who spoke when. |
StreamingSpeakerDiarization |
Python |
Streaming speaker diarization, extends pyannote.audio to online processing |
simple_diarizer |
Python |
Simplified diarization pipeline using some pretrained models. Made to be a simple as possible to go from an input audio file to diarized segments. |
Picovoice Falcon |
C & Python |
A lightweight, accurate, and fast speaker diarization engine written in C and available in Python, running on CPU with minimal overhead. |
DiaPer |
Python |
Pytorch implementation for DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors including models pre-trained on free and public data. |
Evaluation
Clustering
Link |
Language |
Description |
uis-rnn |
Python & PyTorch |
Google’s Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, for Fully Supervised Speaker Diarization. This clustering algorithm is supervised. |
uis-rnn-sml |
Python & PyTorch |
A variant of UIS-RNN, for the paper Supervised Online Diarization with Sample Mean Loss for Multi-Domain Data. |
DNC |
Python & ESPnet |
Transformer-based Discriminative Neural Clustering (DNC) for Speaker Diarisation. Like UIS-RNN, it is supervised. |
SpectralCluster |
Python |
Spectral clustering with affinity matrix refinement operations, auto-tune, and speaker turn constraints. |
sklearn.cluster |
Python |
scikit-learn clustering algorithms. |
PLDA |
Python |
Probabilistic Linear Discriminant Analysis & classification, written in Python. |
PLDA |
C++ |
Open-source implementation of simplified PLDA (Probabilistic Linear Discriminant Analysis). |
Auto-Tuning Spectral Clustering |
Python |
Auto-tuning Spectral Clustering method that does not need development set or supervised tuning. |
Speaker embedding
Link |
Method |
Language |
Description |
resemble-ai/Resemblyzer |
d-vector |
Python & PyTorch |
PyTorch implementation of generalized end-to-end loss for speaker verification, which can be used for voice cloning and diarization. |
Speaker_Verification |
d-vector |
Python & TensorFlow |
Tensorflow implementation of generalized end-to-end loss for speaker verification. |
PyTorch_Speaker_Verification |
d-vector |
Python & PyTorch |
PyTorch implementation of “Generalized End-to-End Loss for Speaker Verification” by Wan, Li et al. With UIS-RNN integration. |
Real-Time Voice Cloning |
d-vector |
Python & PyTorch |
Implementation of “Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis” (SV2TTS) with a vocoder that works in real-time. |
conformer-speaker-encoder |
d-vector |
Python & TFLite |
Massively multilingual conformer-based speaker recognition models in TFLite format. |
deep-speaker |
d-vector |
Python & Keras |
Third party implementation of the Baidu paper Deep Speaker: an End-to-End Neural Speaker Embedding System. |
x-vector-kaldi-tf |
x-vector |
Python & TensorFlow & Perl |
Tensorflow implementation of x-vector topology on top of Kaldi recipe. |
kaldi-ivector |
i-vector |
C++ & Perl |
Extension to Kaldi implementing the standard i-vector hyperparameter estimation and i-vector extraction procedure. |
voxceleb-ivector |
i-vector |
Perl |
Voxceleb1 i-vector based speaker recognition system. |
pytorch_xvectors |
x-vector |
Python & PyTorch |
PyTorch implementation of Voxceleb x-vectors. Additionaly, includes meta-learning architectures for embedding training. Evaluated with speaker diarization and speaker verification. |
ASVtorch |
i-vector |
Python & PyTorch |
ASVtorch is a toolkit for automatic speaker recognition. |
asv-subtools |
i-vector & x-vector |
Kaldi & PyTorch |
ASV-Subtools is developed based on Pytorch and Kaldi for the task of speaker recognition, language identification, etc. The ‘sub’ of ‘subtools’ means that there are many modular tools and the parts constitute the whole. |
WeSpeaker |
x-vector & r-vector |
Python & C++ & PyTorch |
WeSpeaker is a research and production oriented speaker verification, recognition and diarization toolkit, which supports very strong recipes with on-the-fly data preparation, model training and evaluation, as well as runtime C++ codes. |
ReDimNet |
improved resnet |
Pytorch |
Neural network architecture presented in the paper Reshape Dimensions Network for Speaker Recognition |
Speaker change detection
Link |
Language |
Description |
change_detection |
Python & Keras |
Code for Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks. |
tidydiarize |
Python |
Diarization inside OpenAI Whisper decoder |
Link |
Language |
Description |
LibROSA |
Python |
Python library for audio and music analysis. https://librosa.github.io/ |
python_speech_features |
Python |
This library provides common speech features for ASR including MFCCs and filterbank energies. https://python-speech-features.readthedocs.io/en/latest/ |
pyAudioAnalysis |
Python |
Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications. |
Audio data augmentation
Link |
Language |
Description |
pyroomacoustics |
Python |
Pyroomacoustics is a package for audio signal processing for indoor applications. It was developed as a fast prototyping platform for beamforming algorithms in indoor scenarios. https://pyroomacoustics.readthedocs.io |
gpuRIR |
Python |
Python library for Room Impulse Response (RIR) simulation with GPU acceleration |
rir_simulator_python |
Python |
Room impulse response simulator using python |
WavAugment |
Python & PyTorch |
WavAugment performs data augmentation on audio data. The audio data is represented as pytorch tensors |
EEND_dataprep |
Bash & Python |
Recipes for generating simulated conversations used to train end-to-end diarization models. |
Other software
Link |
Language |
Description |
|
VB Diarization |
Python |
VB Diarization with Eigenvoice and HMM Priors. |
|
DOVER-Lap |
Python |
Python package for combining diarization system outputs |
|
Diar-az |
Python |
Data formatting tool to support the ruv-di dataset. Kaldi to Gecko to Kaldi and corpus and back |
|
Datasets
Diarization datasets
Speaker embedding training sets
Name |
Utterances |
Speakers |
Language |
Pricing |
Additional information |
TIMIT |
6K+ |
630 |
en |
$250.00 |
Published in 1993, the TIMIT corpus of read speech is one of the earliest speaker recognition datasets. |
VCTK |
43K+ |
109 |
en |
Free |
Most were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph intended to identify the speaker’s accent. |
LibriSpeech |
292K |
2K+ |
en |
Free |
Large-scale (1000 hours) corpus of read English speech. |
Multilingual LibriSpeech (MLS) |
? |
? |
en, de, nl, es, fr, it, pt, po |
Free |
Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish. |
LibriVox |
180K |
9K+ |
Multiple |
Free |
Free public domain audiobooks. LibriSpeech is a processed subset of LibriVox. Each original unsegmented utterance could be very long. |
VoxCeleb 1&2 |
1M+ |
7K |
Multiple |
Free |
VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube. |
The Spoken Wikipedia Corpora |
5K |
879 |
en, de, nl |
Free |
Volunteer readers reading Wikipedia articles. |
CN-Celeb |
130K+ |
1K |
zh |
Free |
A Free Chinese Speaker Recognition Corpus Released by CSLT@Tsinghua University. |
BookTubeSpeech |
8K |
8K |
en |
Free |
Audio samples extracted from BookTube videos - videos where people share their opinions on books - from YouTube. The dataset can be downloaded using BookTubeSpeech-download. |
DeepMine |
540K |
1850 |
fa, en |
Unknown |
A speech database in Persian and English designed to build and evaluate speaker verification, as well as Persian ASR systems. |
NISP-Dataset |
? |
345 |
hi, kn, ml, ta, te (all Indian languages) |
Free |
This dataset contains speech recordings along with speaker physical parameters (height, weight, … ) as well as regional information and linguistic information. |
VoxBlink2 |
10M |
100k+ |
18 lanugages (en, pt, es, ru, ar, …) |
CC BY-NC-SA 4.0 |
Multilingual dataset from VoxBlink2: A 100K+ Speaker Recognition Corpus and the Open-Set Speaker-Identification Benchmark |
Augmentation noise sources
Name |
Utterances |
Pricing |
Additional information |
AudioSet |
2M |
Free |
A large-scale dataset of manually annotated audio events. |
MUSAN |
N/A |
Free |
MUSAN is a corpus of music, speech, and noise recordings. |
Conferences
Conference/Workshop |
Frequency |
Page Limit |
Organization |
Blind Review |
ICASSP |
Annual |
4 + 1 (ref) |
IEEE |
No |
InterSpeech |
Annual |
4 + 1 (ref) |
ISCA |
No |
Speaker Odyssey |
Biennial |
8 + 2 (ref) |
ISCA |
No |
SLT |
Biennial |
6 + 2 (ref) |
IEEE |
Yes |
ASRU |
Biennial |
6 + 2 (ref) |
IEEE |
Yes |
WASPAA |
Biennial |
4 + 1 (ref) |
IEEE |
No |
IJCB |
Annual |
8 |
IEEE & IAPR TC-4 |
Yes |
Other learning materials
Online courses
Books
Tech blogs
Video tutorials
Products
Star History