Voice and disease audio datasets

Dataset Modality Language Task Type Number of Samples Year License
COUGHVID Cough audio English COVID-19 detection 20,000 2020 CC-BY 4.0
Coswara Cough, breath, speech English COVID-19 detection 5,000 2022 CC-BY 4.0
UK COVID-19 Vocal Audio Dataset Cough, breath, speech English COVID-19 detection 70,000 2023 OGL v3.0
Respiratory Sound Database Lung auscultation sounds English Respiratory disease classification 920 2017 CC-BY 4.0
smarty4covid Cough, breath, voice English COVID-19 detection 4,600 2023 CC-BY 4.0
Bridge2AI-Voice Voice recordings English Voice biomarker research Not specified 2025 Apache-2.0
VOICED Voice recordings English Pathological voice analysis 208 2018 ODC-BY 1.0
Perceptual Voice Qualities Dataset Voice recordings English Perpetual voice quality 360+ 2020 CC-BY 4.0
COVID-19 Voice Dataset Voice recordings English COVID-19 detection Not specified 2023 CC-BY 4.0
ALS IAC Speech Corpus Speech English ALS Not specified 2024 CC-BY 4.0
PMC COVID-19 Voice Dataset Voice recordings English COVID-19 detection Not specified 2022 OGL v3.0

Pulse datasets

Dataset Modality Language Task Type Number of Samples Year License
PulseDB ECG, PPG, ABP waveforms English Cuff-less blood pressure estimation 5,245,454 2023 ODbL
MIMIC-BP ECG, PPG, ABP waveforms English Blood pressure estimation 12,000 2024 ODC-By 1.0
Pulse-ECG ECG images English ECG interpretation 1,160,000 2023 Apache-2.0
MTHS Dataset Video-PPG, ECG signals English Heart rate and SpOâ‚‚ estimation 65 2023 CC BY-NC-ND 4.0
Welltory Dataset Video-PPG, ECG signals English Heart rate variability analysis 21 2023 CC BY-NC-ND 4.0
BUT-PPG Dataset Video-PPG signals English Heart rate estimation 65 2023 CC-BY 4.0

Long-form TCM dialogue datasets

Dataset Modality Language Task Type Number of Samples Year License
Huatuo-26M Text Chinese QA, Dialogue 26 M+ 2023 CC-BY 4.0
TCMD Text Chinese Syndrome-Finding Mapping 100,000 + 2024 CC-BY 4.0
CMD Text Chinese Medical Dialogue 25,000 + dialogues 2020 MIT