Awesome Speaker Diarization

Overview
Publications
Software
Datasets
Conferences
Other learning materials
Products

Overview

This is a curated list of awesome Speaker Diarization papers, libraries, datasets, and other resources.

The purpose of this repo is to organize the world’s resources for speaker diarization, and make them universally accessible and useful.

To add items to this page, simply send a pull request. (contributing guide)

Publications

Special topics

Review & survey papers

Large language model (LLM)

Supervised diarization

Joint diarization and ASR

Online speaker diarization

Challenges

Audio-Visual Speaker Diarization

Other

2021

2020

2019

2018

2017

2016

A Speaker Diarization System for Studying Peer-Led Team Learning Groups

2015

Diarization resegmentation in the factor analysis subspace

2014

2013

Unsupervised methods for speaker diarization: An integrated and iterative approach

2011

2009

Speaker Diarization for Meeting Room Audio

2008

Stream-based speaker segmentation using speaker factors and eigenvoices

2006

Software

Framework

Link	Language	Description
FunASR	Python & PyTorch	FunASR is an open-source speech toolkit based on PyTorch, which aims at bridging the gap between academic researchs and industrial applications.
MiniVox	MATLAB	MiniVox is an open-source evaluation system for the online speaker diarization task.
SpeechBrain	Python & PyTorch	SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.
SIDEKIT for diarization (s4d)	Python	An open source package extension of SIDEKIT for Speaker diarization.
pyAudioAnalysis	Python	Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications.
AaltoASR	Python & Perl	Speaker diarization scripts, based on AaltoASR.
LIUM SpkDiarization	Java	LIUM_SpkDiarization is a software dedicated to speaker diarization (i.e. speaker segmentation and clustering). It is written in Java, and includes the most recent developments in the domain (as of 2013).
kaldi-asr	Bash	Example scripts for speaker diarization on a portion of CALLHOME used in the 2000 NIST speaker recognition evaluation.
kaldi-speaker-diarization	Bash	Icelandic speaker diarization scripts using kaldi.
Alize LIA_SpkSeg	C++	ALIZÉ is an opensource platform for speaker recognition. LIA_SpkSeg is the tools for speaker diarization.
pyannote-audio	Python	Neural building blocks for speaker diarization: speech activity detection, speaker change detection, speaker embedding.
pyBK	Python	Speaker diarization using binary key speaker modelling. Computationally light solution that does not require external training data.
Speaker-Diarization	Python	Speaker diarization using uis-rnn and GhostVLAD. An easier way to support openset speakers.
EEND	Python & Bash & Perl	End-to-End Neural Diarization.
VBx	Python	Variational Bayes HMM over x-vectors diarization. x-vector extractor recipe
RE-VERB	Python & JavaScript	RE: VERB is speaker diarization system, it allows the user to send/record audio of a conversation and receive timestamps of who spoke when.
StreamingSpeakerDiarization	Python	Streaming speaker diarization, extends pyannote.audio to online processing
simple_diarizer	Python	Simplified diarization pipeline using some pretrained models. Made to be a simple as possible to go from an input audio file to diarized segments.
Picovoice Falcon	C & Python	A lightweight, accurate, and fast speaker diarization engine written in C and available in Python, running on CPU with minimal overhead.
DiaPer	Python	Pytorch implementation for DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors including models pre-trained on free and public data.
sherpa-onnx	C++ & C & `C#` & Dart & Go & Java & JavaScript & Kotlin & Pascal & Python & Rust & Swift	Support speaker diarization, speech recognition, and text-to speech on various platforms with various language bindings.
FluidAudio	Swift	A native Swift speaker diarization library for Apple platforms, using CoreML for efficient, real-time audio processing with high accuracy.

Evaluation

Link	Language	Description
pyannote-metrics	Python	A toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems.
SimpleDER	Python	A lightweight library to compute Diarization Error Rate (DER).
DiarizationLM	Python	Implements Word Error Rate (WER), Word Diarization Error Rate (WDER), and concatenated minimum-permutation Word Error Rate (cpWER).
NIST md-eval	Perl	(1) modified md-eval.pl from Mary Tai Knox; (2) md-eval-v21.pl from jitendra; (3) md-eval-22.pl from nryant
dscore	Python & Perl	Diarization scoring tools.
Sequence Match Accuracy	Python	Match the accuracy of two sequences with Hungarian algorithm.
spyder	Python & C++	Simple Python package for fast DER computation.
CDER	Python	Conversational DER from The Conversational Short-phrase Speaker Diarization (CSSD) Task: Dataset, Evaluation Metric and Baselines

Clustering

Link	Language	Description
uis-rnn	Python & PyTorch	Google’s Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, for Fully Supervised Speaker Diarization. This clustering algorithm is supervised.
uis-rnn-sml	Python & PyTorch	A variant of UIS-RNN, for the paper Supervised Online Diarization with Sample Mean Loss for Multi-Domain Data.
DNC	Python & ESPnet	Transformer-based Discriminative Neural Clustering (DNC) for Speaker Diarisation. Like UIS-RNN, it is supervised.
SpectralCluster	Python	Spectral clustering with affinity matrix refinement operations, auto-tune, and speaker turn constraints.
sklearn.cluster	Python	scikit-learn clustering algorithms.
PLDA	Python	Probabilistic Linear Discriminant Analysis & classification, written in Python.
PLDA	C++	Open-source implementation of simplified PLDA (Probabilistic Linear Discriminant Analysis).
Auto-Tuning Spectral Clustering	Python	Auto-tuning Spectral Clustering method that does not need development set or supervised tuning.

Speaker embedding

Link	Method	Language	Description
resemble-ai/Resemblyzer	d-vector	Python & PyTorch	PyTorch implementation of generalized end-to-end loss for speaker verification, which can be used for voice cloning and diarization.
Speaker_Verification	d-vector	Python & TensorFlow	Tensorflow implementation of generalized end-to-end loss for speaker verification.
PyTorch_Speaker_Verification	d-vector	Python & PyTorch	PyTorch implementation of “Generalized End-to-End Loss for Speaker Verification” by Wan, Li et al. With UIS-RNN integration.
Real-Time Voice Cloning	d-vector	Python & PyTorch	Implementation of “Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis” (SV2TTS) with a vocoder that works in real-time.
conformer-speaker-encoder	d-vector	Python & TFLite	Massively multilingual conformer-based speaker recognition models in TFLite format.
deep-speaker	d-vector	Python & Keras	Third party implementation of the Baidu paper Deep Speaker: an End-to-End Neural Speaker Embedding System.
x-vector-kaldi-tf	x-vector	Python & TensorFlow & Perl	Tensorflow implementation of x-vector topology on top of Kaldi recipe.
kaldi-ivector	i-vector	C++ & Perl	Extension to Kaldi implementing the standard i-vector hyperparameter estimation and i-vector extraction procedure.
voxceleb-ivector	i-vector	Perl	Voxceleb1 i-vector based speaker recognition system.
pytorch_xvectors	x-vector	Python & PyTorch	PyTorch implementation of Voxceleb x-vectors. Additionaly, includes meta-learning architectures for embedding training. Evaluated with speaker diarization and speaker verification.
ASVtorch	i-vector	Python & PyTorch	ASVtorch is a toolkit for automatic speaker recognition.
asv-subtools	i-vector & x-vector	Kaldi & PyTorch	ASV-Subtools is developed based on Pytorch and Kaldi for the task of speaker recognition, language identification, etc. The ‘sub’ of ‘subtools’ means that there are many modular tools and the parts constitute the whole.
WeSpeaker	x-vector & r-vector	Python & C++ & PyTorch	WeSpeaker is a research and production oriented speaker verification, recognition and diarization toolkit, which supports very strong recipes with on-the-fly data preparation, model training and evaluation, as well as runtime C++ codes.
ReDimNet	improved resnet	Pytorch	Neural network architecture presented in the paper Reshape Dimensions Network for Speaker Recognition

Speaker change detection

Link	Language	Description
change_detection	Python & Keras	Code for Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks.
tidydiarize	Python	Diarization inside OpenAI Whisper decoder

Audio feature extraction

Link	Language	Description
LibROSA	Python	Python library for audio and music analysis. https://librosa.github.io/
python_speech_features	Python	This library provides common speech features for ASR including MFCCs and filterbank energies. https://python-speech-features.readthedocs.io/en/latest/
pyAudioAnalysis	Python	Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications.

Audio data augmentation

Link	Language	Description
pyroomacoustics	Python	Pyroomacoustics is a package for audio signal processing for indoor applications. It was developed as a fast prototyping platform for beamforming algorithms in indoor scenarios. https://pyroomacoustics.readthedocs.io
gpuRIR	Python	Python library for Room Impulse Response (RIR) simulation with GPU acceleration
rir_simulator_python	Python	Room impulse response simulator using python
WavAugment	Python & PyTorch	WavAugment performs data augmentation on audio data. The audio data is represented as pytorch tensors
EEND_dataprep	Bash & Python	Recipes for generating simulated conversations used to train end-to-end diarization models.

Other software

Link	Language	Description
VB Diarization	Python	VB Diarization with Eigenvoice and HMM Priors.
DOVER-Lap	Python	Python package for combining diarization system outputs
Diar-az	Python	Data formatting tool to support the ruv-di dataset. Kaldi to Gecko to Kaldi and corpus and back

Datasets

Diarization datasets

Audio	Diarization ground truth	Language	Pricing	Additional information
2000 NIST Speaker Recognition Evaluation	Disk-6 (Switchboard), Disk-8 (CALLHOME)	Multiple	$2400.00	Evaluation Plan
2003 NIST Rich Transcription Evaluation Data	Together with audios	en, ar, zh	$2000.00	telephone speech, broadcast news
CALLHOME American English Speech	CALLHOME American English Transcripts	en	$1500.00 + $1000.00	CH109 whitelist
The ICSI Meeting Corpus	Together with audios	en	Free	License
The AMI Meeting Corpus	Together with audios (need to be processed)	Multiple	Free	License
Fisher English Training Speech Part 1 Speech	Fisher English Training Speech Part 1 Transcripts	en	$7000.00 + $1000.00
Fisher English Training Part 2, Speech	Fisher English Training Part 2, Transcripts	en	$7000.00 + $1000.00
VoxConverse	TBD	TBD	Free	VoxConverse is an audio-visual diarisation dataset consisting of over 50 hours of multispeaker clips of human speech, extracted from YouTube videos
MiniVox Benchmark	MiniVox Benchmark	en	Free	MiniVox is an automatic framework to transform any speaker-labelled dataset into continuous speech datastream with episodically revealed label feedbacks.
The AliMeeting Corpus	Together with audios	zh	Free

Speaker embedding training sets

Name	Utterances	Speakers	Language	Pricing	Additional information
TIMIT	6K+	630	en	$250.00	Published in 1993, the TIMIT corpus of read speech is one of the earliest speaker recognition datasets.
VCTK	43K+	109	en	Free	Most were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph intended to identify the speaker’s accent.
LibriSpeech	292K	2K+	en	Free	Large-scale (1000 hours) corpus of read English speech.
Multilingual LibriSpeech (MLS)	?	?	en, de, nl, es, fr, it, pt, po	Free	Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish.
LibriVox	180K	9K+	Multiple	Free	Free public domain audiobooks. LibriSpeech is a processed subset of LibriVox. Each original unsegmented utterance could be very long.
VoxCeleb 1&2	1M+	7K	Multiple	Free	VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube.
The Spoken Wikipedia Corpora	5K	879	en, de, nl	Free	Volunteer readers reading Wikipedia articles.
CN-Celeb	130K+	1K	zh	Free	A Free Chinese Speaker Recognition Corpus Released by CSLT@Tsinghua University.
BookTubeSpeech	8K	8K	en	Free	Audio samples extracted from BookTube videos - videos where people share their opinions on books - from YouTube. The dataset can be downloaded using BookTubeSpeech-download.
DeepMine	540K	1850	fa, en	Unknown	A speech database in Persian and English designed to build and evaluate speaker verification, as well as Persian ASR systems.
NISP-Dataset	?	345	hi, kn, ml, ta, te (all Indian languages)	Free	This dataset contains speech recordings along with speaker physical parameters (height, weight, … ) as well as regional information and linguistic information.
VoxBlink2	10M	100k+	18 lanugages (en, pt, es, ru, ar, …)	CC BY-NC-SA 4.0	Multilingual dataset from VoxBlink2: A 100K+ Speaker Recognition Corpus and the Open-Set Speaker-Identification Benchmark

Augmentation noise sources

Name	Utterances	Pricing	Additional information
AudioSet	2M	Free	A large-scale dataset of manually annotated audio events.
MUSAN	N/A	Free	MUSAN is a corpus of music, speech, and noise recordings.

Conferences

Conference/Workshop	Frequency	Page Limit	Organization	Blind Review
ICASSP	Annual	4 + 1 (ref)	IEEE	No
InterSpeech	Annual	4 + 1 (ref)	ISCA	No
Speaker Odyssey	Biennial	8 + 2 (ref)	ISCA	No
SLT	Biennial	6 + 2 (ref)	IEEE	Yes
ASRU	Biennial	6 + 2 (ref)	IEEE	Yes
WASPAA	Biennial	4 + 1 (ref)	IEEE	No
IJCB	Annual	8	IEEE & IAPR TC-4	Yes

Other learning materials

Online courses

Course on Udemy: A Tutorial on Speaker Diarization

Books

Voice Identity Techniques: From core algorithms to engineering practice (Chinese) by Quan Wang, 2020

Tech blogs

Video tutorials

pyannote audio: neural building blocks for speaker diarization by Hervé Bredin
Google’s Diarization System: Speaker Diarization with LSTM by Google
Fully Supervised Speaker Diarization: Say Goodbye to clustering by Google
Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection by Google
Speaker Diarization: Optimal Clustering and Learning Speaker Embeddings by Microsoft Research
Robust Speaker Diarization for Meetings: the ICSI system by Microsoft Research
【机器之心&博文视点】入门声纹技术｜第二讲：声纹分割聚类与其他应用 by Quan Wang

Products

Company	Product
Google	Recorder app
Google	Google Cloud Speech-to-Text API
Amazon	Amazon Transcribe
IBM	Watson Speech To Text API
DeepAffects	Speaker Diarization API
Alibaba	Tingwu (听悟)
Microsoft	Azure Conversation Transcription API