Awesome Speaker Diarization Awesome Contribution

Table of contents

Overview

This is a curated list of awesome Speaker Diarization papers, libraries, datasets, and other resources.

The purpose of this repo is to organize the world’s resources for speaker diarization, and make them universally accessible and useful.

To add items to this page, simply send a pull request. (contributing guide)

Publications

Special topics

Review & survey papers

Supervisied diarization

Joint diarization and ASR

Challenges

Other

2020

2019

2018

2017

2016

2015

2014

2013

2011

2009

2008

2006

Software

Framework

Link Language Description
SpeechBrain GitHub stars Python & PyTorch SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.
SIDEKIT for diarization (s4d) Python An open source package extension of SIDEKIT for Speaker diarization.
pyAudioAnalysis GitHub stars Python Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications.
AaltoASR GitHub stars Python & Perl Speaker diarization scripts, based on AaltoASR.
LIUM SpkDiarization Java LIUM_SpkDiarization is a software dedicated to speaker diarization (i.e. speaker segmentation and clustering). It is written in Java, and includes the most recent developments in the domain (as of 2013).
kaldi-asr Build Status Bash Example scripts for speaker diarization on a portion of CALLHOME used in the 2000 NIST speaker recognition evaluation.
Alize LIA_SpkSeg C++ ALIZÉ is an opensource platform for speaker recognition. LIA_SpkSeg is the tools for speaker diarization.
pyannote-audio GitHub stars Python Neural building blocks for speaker diarization: speech activity detection, speaker change detection, speaker embedding.
pyBK GitHub stars Python Speaker diarization using binary key speaker modelling. Computationally light solution that does not require external training data.
Speaker-Diarization GitHub stars Python Speaker diarization using uis-rnn and GhostVLAD. An easier way to support openset speakers.
EEND GitHub stars Python & Bash & Perl End-to-End Neural Diarization.
VBDiarization GitHub stars Python Speaker diarization based on Kaldi x-vectors using pretrained model trained in Kaldi (kaldi-asr/kaldi) and converted to ONNX format (onnx/onnx) running in ONNXRuntime (Microsoft/onnxruntime).
RE-VERB GitHub stars Python & JavaScript RE: VERB is speaker diarization system, it allows the user to send/record audio of a conversation and receive timestamps of who spoke when.

Evaluation

Link Language Description
pyannote-metrics GitHub stars Build Status Python A toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems.
SimpleDER GitHub stars Build Status Python A lightweight library to compute Diarization Error Rate (DER).
NIST md-eval Perl (1) modified md-eval.pl from Mary Tai Knox; (2) md-eval-v21.pl from jitendra; (3) md-eval-22.pl from nryant
dscore GitHub stars Python & Perl Diarization scoring tools.
Sequence Match Accuracy Python Match the accuracy of two sequences with Hungarian algorithm.

Clustering

Link Language Description
uis-rnn GitHub stars Build Status Python & PyTorch Google’s Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, for Fully Supervised Speaker Diarization. This clustering algorithm is supervised.
uis-rnn-sml GitHub stars Python & PyTorch A variant of UIS-RNN, for the paper Supervised Online Diarization with Sample Mean Loss for Multi-Domain Data.
DNC GitHub stars Python & ESPnet Transformer-based Discriminative Neural Clustering (DNC) for Speaker Diarisation. Like UIS-RNN, it is supervised.
SpectralCluster GitHub stars Build Status Python Spectral clustering with affinity matrix refinement operations.
sklearn.cluster Build Status Python scikit-learn clustering algorithms.
PLDA GitHub stars Python Probabilistic Linear Discriminant Analysis & classification, written in Python.
PLDA GitHub stars C++ Open-source implementation of simplified PLDA (Probabilistic Linear Discriminant Analysis).
Auto-Tuning Spectral Clustering GitHub stars Python Auto-tuning Spectral Clustering method that does not need development set or supervised tuning.

Speaker embedding

Link Method Language Description
resemble-ai/Resemblyzer GitHub stars d-vector Python & PyTorch PyTorch implementation of generalized end-to-end loss for speaker verification, which can be used for voice cloning and diarization.
Speaker_Verification GitHub stars d-vector Python & TensorFlow Tensorflow implementation of generalized end-to-end loss for speaker verification.
PyTorch_Speaker_Verification GitHub stars d-vector Python & PyTorch PyTorch implementation of “Generalized End-to-End Loss for Speaker Verification” by Wan, Li et al. With UIS-RNN integration.
Real-Time Voice Cloning GitHub stars d-vector Python & PyTorch Implementation of “Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis” (SV2TTS) with a vocoder that works in real-time.
deep-speaker GitHub stars d-vector Python & Keras Third party implementation of the Baidu paper Deep Speaker: an End-to-End Neural Speaker Embedding System.
x-vector-kaldi-tf GitHub stars x-vector Python & TensorFlow & Perl Tensorflow implementation of x-vector topology on top of Kaldi recipe.
kaldi-ivector GitHub stars i-vector C++ & Perl Extension to Kaldi implementing the standard i-vector hyperparameter estimation and i-vector extraction procedure.
voxceleb-ivector GitHub stars i-vector Perl Voxceleb1 i-vector based speaker recognition system.
pytorch_xvectors GitHub stars x-vector Python & PyTorch PyTorch implementation of Voxceleb x-vectors. Additionaly, includes meta-learning architectures for embedding training. Evaluated with speaker diarization and speaker verification.
ASVtorch i-vector Python & PyTorch ASVtorch is a toolkit for automatic speaker recognition.
asv-subtools GitHub stars i-vector & x-vector Kaldi & PyTorch ASV-Subtools is developed based on Pytorch and Kaldi for the task of speaker recognition, language identification, etc. The ‘sub’ of ‘subtools’ means that there are many modular tools and the parts constitute the whole.

Speaker change detection

Link Language Description
change_detection GitHub stars Python & Keras Code for Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks.

Audio feature extraction

Link Language Description
LibROSA GitHub stars Python Python library for audio and music analysis. https://librosa.github.io/
python_speech_features GitHub stars Python This library provides common speech features for ASR including MFCCs and filterbank energies. https://python-speech-features.readthedocs.io/en/latest/
pyAudioAnalysis GitHub stars Python Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications.

Audio data augmentation

Link Language Description
pyroomacoustics GitHub stars Python Pyroomacoustics is a package for audio signal processing for indoor applications. It was developed as a fast prototyping platform for beamforming algorithms in indoor scenarios. https://pyroomacoustics.readthedocs.io
gpuRIR GitHub stars Python Python library for Room Impulse Response (RIR) simulation with GPU acceleration
rir_simulator_python GitHub stars Python Room impulse response simulator using python
WavAugment GitHub stars Python & PyTorch WavAugment performs data augmentation on audio data. The audio data is represented as pytorch tensors

Other software

Link Language Description
VB Diarization GitHub stars Build Status Python VB Diarization with Eigenvoice and HMM Priors.

Datasets

Diarization datasets

Audio Diarization ground truth Language Pricing Additional information
2000 NIST Speaker Recognition Evaluation Disk-6 (Switchboard), Disk-8 (CALLHOME) Multiple $2400.00 Evaluation Plan
2003 NIST Rich Transcription Evaluation Data Together with audios en, ar, zh $2000.00 telephone speech, broadcast news
CALLHOME American English Speech CALLHOME American English Transcripts en $1500.00 + $1000.00 CH109 whitelist
The ICSI Meeting Corpus Together with audios en Free License
The AMI Meeting Corpus Together with audios (need to be processed) Multiple Free License
Fisher English Training Speech Part 1 Speech Fisher English Training Speech Part 1 Transcripts en $7000.00 + $1000.00  
Fisher English Training Part 2, Speech Fisher English Training Part 2, Transcripts en $7000.00 + $1000.00  
VoxConverse TBD TBD Free VoxConverse is an audio-visual diarisation dataset consisting of over 50 hours of multispeaker clips of human speech, extracted from YouTube videos

Speaker embedding training sets

Name Utterances Speakers Language Pricing Additional information
TIMIT 6K+ 630 en $250.00 Published in 1993, the TIMIT corpus of read speech is one of the earliest speaker recognition datasets.
VCTK 43K+ 109 en Free Most were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph intended to identify the speaker’s accent.
LibriSpeech 292K 2K+ en Free Large-scale (1000 hours) corpus of read English speech.
Multilingual LibriSpeech (MLS) ? ? en, de, nl, es, fr, it, pt, po Free Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish.
LibriVox 180K 9K+ Multiple Free Free public domain audiobooks. LibriSpeech is a processed subset of LibriVox. Each original unsegmented utterance could be very long.
VoxCeleb 1&2 1M+ 7K Multiple Free VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube.
The Spoken Wikipedia Corpora 5K 879 en, de, nl Free Volunteer readers reading Wikipedia articles.
CN-Celeb 130K+ 1K zh Free A Free Chinese Speaker Recognition Corpus Released by CSLT@Tsinghua University.
BookTubeSpeech 8K 8K en Free Audio samples extracted from BookTube videos - videos where people share their opinions on books - from YouTube. The dataset can be downloaded using BookTubeSpeech-download.
DeepMine 540K 1850 fa, en Unknown A speech database in Persian and English designed to build and evaluate speaker verification, as well as Persian ASR systems.
NISP-Dataset ? 345 hi, kn, ml, ta, te (all Indian languages) Free This dataset contains speech recordings along with speaker physical parameters (height, weight, … ) as well as regional information and linguistic information.

Augmentation noise sources

Name Utterances Pricing Additional information
AudioSet 2M Free A large-scale dataset of manually annotated audio events.
MUSAN N/A Free MUSAN is a corpus of music, speech, and noise recordings.

Conferences

Conference/Workshop Frequency Page Limit Organization Blind Review
ICASSP Annual 4 + 1 (ref) IEEE No
InterSpeech Annual 4 + 1 (ref) ISCA No
Speaker Odyssey Biennial 8 + 2 (ref) ISCA No
SLT Biennial 6 + 2 (ref) IEEE Yes
ASRU Biennial 6 + 2 (ref) IEEE Yes
WASPAA Biennial 4 + 1 (ref) IEEE No

Other learning materials

Books

Tech blogs

Video tutorials

Products

Company Product
Google Google Cloud Speech-to-Text API
Amazon Amazon Transcribe
IBM Watson Speech To Text API
DeepAffects Speaker Diarization API