Machine learning and data mining
Part of a series on
Paradigms Supervised learning Unsupervised learning Online learning Batch learning Meta-learning Semi-supervised learning Self-supervised learning Reinforcement learning Curriculum learning Rule-based learning Quantum machine learning
Problems Classification Generative modeling Regression Clustering Dimensionality reduction Density estimation Anomaly detection Data cleaning AutoML Association rules Semantic analysis Structured prediction Feature engineering Feature learning Learning to rank Grammar induction Ontology learning Multimodal learning
Supervised learning (classification • regression) Apprenticeship learning Decision trees Ensembles Bagging Boosting Random forest k-NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH CURE Hierarchical k-means Fuzzy Expectation–maximization (EM) DBSCAN OPTICS Mean shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA PGD t-SNE SDL
Structured prediction Graphical models Bayes net Conditional random field Hidden Markov
Anomaly detection RANSAC k-NN Local outlier factor Isolation forest
Artificial neural network Autoencoder Cognitive computing Deep learning DeepDream Feedforward neural network Recurrent neural network LSTM GRU ESN reservoir computing Restricted Boltzmann machine GAN Diffusion model SOM Convolutional neural network U-Net Transformer Vision Mamba Spiking neural network Memtransistor Electrochemical RAM (ECRAM)
Reinforcement learning Q-learning SARSA Temporal difference (TD) Multi-agent Self-play
Learning with humans Active learning Crowdsourcing Human-in-the-loop RLHF
Model diagnostics Coefficient of determination Confusion matrix Learning curve ROC curve
Mathematical foundations Kernel machines Bias–variance tradeoff Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory
Machine-learning venues ECML PKDD NeurIPS ICML ICLR IJCAI ML JMLR
Related articles Glossary of artificial intelligence List of datasets for machine-learning research List of datasets in computer vision and image processing Outline of machine learning
v t e

These datasets are used in machine learning (ML) research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), computer hardware, and, less-intuitively, the availability of high-quality training datasets.^[1] High-quality labeled training datasets for supervised and semi-supervised machine learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality datasets for unsupervised learning can also be difficult and costly to produce.^[2]^[3]^[4]^[5]

Many organizations including governments publish and share their datasets. The datasets are classified, based on the licenses, as Open data and Non-Open data.

The datasets from various governmental-bodies are presented in List of open government data sites. The datasets are ported on open data portals. They are made available for searching, depositing and accessing through interfaces like Open API. The datasets are made available as various sorted types and subtypes.

Type	Subtypes
Specific category	Finance, Economics, Commerce, Societal, Health, Academy, Sports, Food, Agriculture, Travel, Geospatial, Political, Consumer, Transport, Logistics, Environmental, Real-Estate, Legal, Entertainment, Energy, Hospitality
Scope	Supranational Union, National, Subnational, Municipality, Urban, Rural
Language	Mandarin Chinese, Spanish, English, Arabic, Hindi, Bengali
Type	Tabular, Graph, Text, Image, Sound, Video
Usage	Training, validating, and testing
File-Formats	CSV, JSON, XML, KML, GeoJSON, Shapefile, GML
Licenses	Creative-Commons, GPL, Other Non-Open data licenses
Last-Updated	Last-Hour, Last-Day, Last-Week, Last-Month, Last-Year
File-Size	Minimum, Maximum, Range
Status	Verified, In-Preparation, Deactivated(or Deprecated)
Number of records	100s, 1000s, 10000s, 100000s, Millions
Number of variables	Less than 10, 10s, 100s, 1000s, 10000s
Services	Individual, Aggregation

Portal-name	License	List of installations of the portal	Typical usages
Comprehensive Knowledge Archive Network (CKAN)	AGPL	https://ckan.github.io/ckan-instances/ https://github.com/sebneu/ckan_instances/blob/master/instances.csv	Data repository for government or non-profit organisations, Data Management Solution for Research Institutes
DKAN	GPL	https://getdkan.org/community	Data repository for government or non-profit organisations, Data Management Solution for Research Institutes
Dataverse	Apache	https://dataverse.org/installations https://dataverse.org/metrics	Data Management Solution for Research Institutes
DSpace	BSD	https://registry.lyrasis.org/	Data Management Solution for Research Institutes
OpenML	BSD	https://www.openml.org/search?type=data&sort=runs&status=active	Data Management Solution to share datasets, algorithms, and experiments results through APIs.

Academic Torrents	https://academictorrents.com
Amazon Datasets	https://registry.opendata.aws/
Awesome Public Datasets Collection	https://github.com/awesomedata/awesome-public-datasets
data.world	https://data.world/datasets/machine-learning
Datahub – Core Datasets	https://datahub.io/docs/core-data
DataONE	https://www.dataone.org/
DataPortals	https://dataportals.org/
Datasetlist.com	https://www.datasetlist.com
Global Open Data Index – Open Knowledge Foundation	https://index.okfn.org/ Archived 25 May 2020 at the Wayback Machine
Google Dataset Search	https://datasetsearch.research.google.com/
Hugging Face	https://huggingface.co/docs/datasets/
IBM's Data Asset Exchange	https://developer.ibm.com/exchanges/data/
Jupyter – Tutorial Data	https://jupyter-tutorial.readthedocs.io/en/latest/data-processing/opendata.html
Kaggle	https://www.kaggle.com/datasets
Machine learning datasets	https://macgence.com/data-sets-and-cataloges/
Major Smart Cities with Open Data	https://rlist.io/l/major-smart-cities-with-open-data-portals
Microsoft Datasets	https://msropendata.com/datasets
Open Data Inception	https://opendatainception.io/
Opendatasoft	https://data.opendatasoft.com/explore/dataset/open-data-sources%40public/table/?sort=code_en
OpenDOAR	https://v2.sherpa.ac.uk/opendoar/
OpenML	https://www.openml.org/search?type=data
Papers with Code	https://paperswithcode.com/datasets
Penn Machine Learning Benchmarks	https://github.com/EpistasisLab/pmlb/tree/master/datasets
Public APIs	https://github.com/public-apis/public-apis
Registry of Open Access Repositories	http://roar.eprints.org/
REgistry of REsearch Data REpositories	https://www.re3data.org/
UCI Machine Learning Repository	http://mlr.cs.umass.edu/ml/ Archived 26 June 2020 at the Wayback Machine
Speech Dataset	https://www.shaip.com/offerings/speech-data-catalog/
Visual Data Discovery	https://visualdata.io/discovery

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Amazon reviews	US product reviews from Amazon.com.	None.	233.1 million	Text	Classification, sentiment analysis	2015 (2018)	^[6]^[7]	McAuley et al.
OpinRank Review Dataset	Reviews of cars and hotels from Edmunds.com and TripAdvisor respectively.	None.	42,230 / ~259,000 respectively	Text	Sentiment analysis, clustering	2011	^[8]^[9]	K. Ganesan et al.
MovieLens	22,000,000 ratings and 580,000 tags applied to 33,000 movies by 240,000 users.	None.	~ 22M	Text	Regression, clustering, classification	2016	^[10]	GroupLens Research
Yahoo! Music User Ratings of Musical Artists	Over 10M ratings of artists by Yahoo users.	None described.	~ 10M	Text	Clustering, regression	2004	^[11]^[12]	Yahoo!
Car Evaluation Data Set	Car properties and their overall acceptability.	Six categorical features given.	1728	Text	Classification	1997	^[13]^[14]	M. Bohanec
YouTube Comedy Slam Preference Dataset	User vote data for pairs of videos shown on YouTube. Users voted on funnier videos.	Video metadata given.	1,138,562	Text	Classification	2012	^[15]^[16]	Google
Skytrax User Reviews Dataset	User reviews of airlines, airports, seats, and lounges from Skytrax.	Ratings are fine-grain and include many aspects of airport experience.	41396	Text	Classification, regression	2015	^[17]	Q. Nguyen
Teaching Assistant Evaluation Dataset	Teaching assistant reviews.	Features of each instance such as class, class size, and instructor are given.	151	Text	Classification	1997	^[18]^[19]	W. Loh et al.
Vietnamese Students’ Feedback Corpus (UIT-VSFC)	Students’ Feedback.	Comments	16,000	Text	Classification	1997	^[20]	Nguyen et al.
Vietnamese Social Media Emotion Corpus (UIT-VSMEC)	Users’ Facebook Comments.	Comments	6,927	Text	Classification	1997	^[21]	Nguyen et al.
Vietnamese Open-domain Complaint Detection dataset (ViOCD)	Customer product reviews	Comments	5,485	Text	Classification	2021	^[22]	Nguyen et al.
ViHOS: Hate Speech Spans Detection for Vietnamese	Social Media Texts	Comments	Containing 26k spans on 11k comments	Text	Span Detection	2021	^[23]	Hoang et al.

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
NYSK Dataset	English news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn.	Filtered and presented in XML format.	10,421	XML, text	Sentiment analysis, topic extraction	2013	^[24]	Dermouche, M. et al.
The Reuters Corpus Volume 1	Large corpus of Reuters news stories in English.	Fine-grain categorization and topic codes.	810,000	Text	Classification, clustering, summarization	2002	^[25]	Reuters
The Reuters Corpus Volume 2	Large corpus of Reuters news stories in multiple languages.	Fine-grain categorization and topic codes.	487,000	Text	Classification, clustering, summarization	2005	^[26]	Reuters
Thomson Reuters Text Research Collection	Large corpus of news stories.	Details not described.	1,800,370	Text	Classification, clustering, summarization	2009	^[27]	T. Rose et al.
Saudi Newspapers Corpus	31,030 Arabic newspaper articles.	Metadata extracted.	31,030	JSON	Summarization, clustering	2015	^[28]	M. Alhagri
RE3D (Relationship and Entity Extraction Evaluation Dataset)	Entity and Relation marked data from various news and government sources. Sponsored by Dstl	Filtered, categorisation using Baleen types	not known	JSON	Classification, Entity and Relation recognition	2017	^[29]	Dstl
Examiner Spam Clickbait Catalogue	Clickbait, spam, crowd-sourced headlines from 2010 to 2015	Publish date and headlines	3,089,781	CSV	Clustering, Events, Sentiment	2016	^[30]	R. Kulkarni
ABC Australia News Corpus	Entire news corpus of ABC Australia from 2003 to 2019	Publish date and headlines	1,186,018	CSV	Clustering, Events, Sentiment	2020	^[31]	R. Kulkarni
Worldwide News – Aggregate of 20K Feeds	One week snapshot of all online headlines in 20+ languages	Publish time, URL and headlines	1,398,431	CSV	Clustering, Events, Language Detection	2018	^[32]	R. Kulkarni
Reuters News Wire Headline	11 Years of timestamped events published on the news-wire	Publish time, Headline Text	16,121,310	CSV	NLP, Computational Linguistics, Events	2018	^[33]	R. Kulkarni
The Irish Times Ireland News Corpus	24 Years of Ireland News from 1996 to 2019	Publish time, Headline Category and Text	1,484,340	CSV	NLP, Computational Linguistics, Events	2020	^[34]	R. Kulkarni
News Headlines Dataset for Sarcasm Detection	High quality dataset with Sarcastic and Non-sarcastic news headlines.	Clean, normalized text	26,709	JSON	NLP, Classification, Linguistics	2018	^[35]	Rishabh Misra

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Enron Email Dataset	Emails from employees at Enron organized into folders.	Attachments removed, invalid email addresses converted to user@enron.com or no_address@enron.com.	~ 500,000	Text	Network analysis, sentiment analysis	2004 (2015)	^[36]^[37]	Klimt, B. and Y. Yang
Ling-Spam Dataset	Corpus containing both legitimate and spam emails.	Four version of the corpus involving whether or not a lemmatiser or stop-list was enabled.	2,412 Ham 481 Spam	Text	Classification	2000	^[38]^[39]	Androutsopoulos, J. et al.
SMS Spam Collection Dataset	Collected SMS spam messages.	None.	5,574	Text	Classification	2011	^[40]^[41]	T. Almeida et al.
Twenty Newsgroups Dataset	Messages from 20 different newsgroups.	None.	20,000	Text	Natural language processing	1999	^[42]	T. Mitchell et al.
Spambase Dataset	Spam emails.	Many text features extracted.	4,601	Text	Spam detection, classification	1999	^[43]	M. Hopkins et al.

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
MovieTweetings	Movie rating dataset based on public and well-structured tweets		~710,000	Text	Classification, regression	2018	^[44]	S. Dooms
Twitter100k	Pairs of images and tweets		100,000	Text and Images	Cross-media retrieval	2017	^[45]^[46]	Y. Hu, et al.
Sentiment140	Tweet data from 2009 including original text, time stamp, user and sentiment.	Classified using distant supervision from presence of emoticon in tweet.	1,578,627	Tweets, comma, separated values	Sentiment analysis	2009	^[47]^[48]	A. Go et al.
ASU Twitter Dataset	Twitter network data, not actual tweets. Shows connections between a large number of users.	None.	11,316,811 users, 85,331,846 connections	Text	Clustering, graph analysis	2009	^[49]^[50]	R. Zafarani et al.
SNAP Social Circles: Twitter Database	Large Twitter network data.	Node features, circles, and ego networks.	1,768,149	Text	Clustering, graph analysis	2012	^[51]^[52]	J. McAuley et al.
Twitter Dataset for Arabic Sentiment Analysis	Arabic tweets.	Samples hand-labeled as positive or negative.	2000	Text	Classification	2014	^[53]^[54]	N. Abdulla
Buzz in Social Media Dataset	Data from Twitter and Tom's Hardware. This dataset focuses on specific buzz topics being discussed on those sites.	Data is windowed so that the user can attempt to predict the events leading up to social media buzz.	140,000	Text	Regression, Classification	2013	^[55]^[56]	F. Kawala et al.
Paraphrase and Semantic Similarity in Twitter (PIT)	This dataset focuses on whether tweets have (almost) same meaning/information or not. Manually labeled.	tokenization, part-of-speech and named entity tagging	18,762	Text	Regression, Classification	2015	^[57]^[58]	Xu et al.
Geoparse Twitter benchmark dataset	This dataset contains tweets during different news events in different countries. Manually labeled location mentions.	location annotations added to JSON metadata	6,386	Tweets, JSON	Classification, Information Extraction	2014	^[59]^[60]	S.E. Middleton et al.
Sarcasm, Perceived and Intended, by Reactive Supervision (SPIRS)	Intended and perceived sarcastic tweets along with their context collected using reactive supervision; an equal number of negative (non-sarcastic) samples		30,000	Tweet IDs, CSV	Classification	2020	^[61]^[62]	B. Shmueli et al.
Dutch Social media collection	This dataset contains COVID-19 tweets made by Dutch speakers or users from Netherlands. The data has been machine labeled	classified for sentiment, tweet text & user description translated to English. Industry mention are extracted	271,342	JSONL	Sentiment, multi-label classification, machine translation	2020	^[63]^[64]^[65]	Aaaksh Gupta, CoronaWhy
ReactionGIF dataset	A dataset of 30K tweets and their GIF reactions	Classified for sentiment, reaction, and emotion	30,000	Tweet IDs, JSONL	Classified for sentiment, reaction, and emotion	2021	^[66]^[67]	B. Shmueli et al.

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
NPS Chat Corpus	Posts from age-specific online chat rooms.	Hand privacy masked, tagged for part of speech and dialogue-act.	~ 500,000	XML	NLP, programming, linguistics	2007	^[68]	Forsyth, E., Lin, J., & Martell, C.
Twitter Triple Corpus	A-B-A triples extracted from Twitter.		4,232	Text	NLP	2016	^[69]	Sordini, A. et al.
UseNet Corpus	UseNet forum postings.	Anonymized e-mails and URLs. Omitted documents with lengths <500 words or >500,000 words, or that were <90% English.	7 billion	Text		2011	^[70]	Shaoul, C., & Westbury C.
NUS SMS Corpus	SMS messages collected between two users, with timing analysis.		~ 10,000	XML	NLP	2011	^[71]	KAN, M
Reddit All Comments Corpus	All Reddit comments (as of 2015).		~ 1.7 billion	JSON	NLP, research	2015	^[72]	Stuck_In_the_Matrix
Ubuntu Dialogue Corpus	Dialogues extracted from Ubuntu chat stream on IRC.		930 thousand dialogues, 7.1 million utterances	CSV	Dialogue Systems Research	2015	^[73]	Lowe, R. et al.
Dialog State Tracking Challenge	The Dialog State Tracking Challenges 2 & 3 (DSTC2&3) were research challenge focused on improving the state of the art in tracking the state of spoken dialog systems.	Transcription of spoken dialogs with labelling	DSTC2 contains ~3.2k calls – DSTC3 contains ~2.3k calls	Json	Dialogue state tracking	2014	^[74]	Henderson, Matthew and Thomson, Blaise and Williams, Jason D

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
FreeLaw	Filtered data from Court Listener, part of the FreeLaw project.	Cleaned and normalized text	4,940,710	Json	NLP, linguistics	2020	^[75]	T. Hoppe
Pile of Law	Corpus of legal and administrative data	Cleaned, normalized, and privatized	~50,000,000	Json	NLP, linguistics, sentiment	2022	^[76]^[77]	L. Zheng; N. Guha; B. Anderson; P. Henderson; D. Ho
Caselaw Access Project	All official, book-published state and federal United States case law — every volume or case designated as an official report of decisions by a court within the United States.	Cleaned and normalized text	~10,000	Json	NLP, linguistics	2022	^[78]	A. Aizman; S. Chapman; J. Cushman; K. Dulin; H. Eidolon; et al.

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Web of Science Dataset	Hierarchical Datasets for Text Classification	None.	46,985	Text	Classification, Categorization	2017	^[79]^[80]	K. Kowsari et al.
Legal Case Reports	Federal Court of Australia cases from 2006 to 2009.	None.	4,000	Text	Summarization, citation analysis	2012	^[81]^[82]	F. Galgani et al.
Blogger Authorship Corpus	Blog entries of 19,320 people from blogger.com.	Blogger self-provided gender, age, industry, and astrological sign.	681,288	Text	Sentiment analysis, summarization, classification	2006	^[83]^[84]	J. Schler et al.
Social Structure of Facebook Networks	Large dataset of the social structure of Facebook.	None.	100 colleges covered	Text	Network analysis, clustering	2012	^[85]^[86]	A. Traud et al.
Dataset for the Machine Comprehension of Text	Stories and associated questions for testing comprehension of text.	None.	660	Text	Natural language processing, machine comprehension	2013	^[87]^[88]	M. Richardson et al.
The Penn Treebank Project	Naturally occurring text annotated for linguistic structure.	Text is parsed into semantic trees.	~ 1M words	Text	Natural language processing, summarization	1995	^[89]^[90]	M. Marcus et al.
DEXTER Dataset	Task given is to determine, from features given, which articles are about corporate acquisitions.	Features extracted include word stems. Distractor features included.	2600	Text	Classification	2008	^[91]	Reuters
Google Books N-grams	N-grams from a very large corpus of books	None.	2.2 TB of text	Text	Classification, clustering, regression	2011	^[92]^[93]	Google
Personae Corpus	Collected for experiments in Authorship Attribution and Personality Prediction. Consists of 145 Dutch-language essays.	In addition to normal texts, syntactically annotated texts are given.	145	Text	Classification, regression	2008	^[94]^[95]	K. Luyckx et al.
PushShift	Archives of social media websites, including Reddit, Twitter, and Hackernews.	Text extracted and normalized from WARCs	~100,000,000 posts	Json	NLP, sentiment, linguistics	2022	^[96]^[97]	J. Baumgartner
SEC Filings	EDGAR \| Company Filings	Text extracted.		csv	NLP
CNAE-9 Dataset	Categorization task for free text descriptions of Brazilian companies.	Word frequency has been extracted.	1080	Text	Classification	2012	^[98]^[99]	P. Ciarelli et al.
Sentiment Labeled Sentences Dataset	3000 sentiment labeled sentences.	Sentiment of each sentence has been hand labeled as positive or negative.	3000	Text	Classification, sentiment analysis	2015	^[100]^[101]	D. Kotzias
BlogFeedback Dataset	Dataset to predict the number of comments a post will receive based on features of that post.	Many features of each post extracted.	60,021	Text	Regression	2014	^[102]^[103]	K. Buza
PubMed Central	PubMed® comprises more than 35 million citations for biomedical literature from MEDLINE, life science journals, and online books.	None	35 Million	Text	NLP
USPTO	The United States Patent and Trademark Office			Text	NLP
PhilPapers	Open access collection of philosophy publications			Text	NLP
Book Corpus	A popular large-scale text corpus.	None		Text	NLP	2015	^[104]	Zhu, Yukun, et al.
Stanford Natural Language Inference (SNLI) Corpus	Image captions matched with newly constructed sentences to form entailment, contradiction, or neutral pairs.	Entailment class labels, syntactic parsing by the Stanford PCFG parser	570,000	Text	Natural language inference/recognizing textual entailment	2015	^[105]	S. Bowman et al.
DSL Corpus Collection (DSLCC)	A multilingual collection of short excerpts of journalistic texts in similar languages and dialects.	None	294,000 phrases	Text	Discriminating between similar languages	2017	^[106]	Tan, Liling et al.
Urban Dictionary Dataset	Corpus of words, votes and definitions	User names anonymised	2,580,925	CSV	NLP, Machine comprehension	2016 May	^[107]	Anonymous
T-REx	Wikipedia abstracts aligned with Wikidata entities	Alignment of Wikidata triples with Wikipedia abstracts	11M aligned triples	JSON and NIF [4]	NLP, Relation Extraction	2018	^[108]	H. Elsahar et al.
General Language Understanding Evaluation (GLUE)	Benchmark of nine tasks	Various	~1M sentences and sentence pairs		NLU	2018	^[109]^[110]^[111]	Wang et al.
Contract Understanding Atticus Dataset (CUAD) (formerly known as Atticus Open Contract Dataset (AOK))	Dataset of legal contracts with rich expert annotations		~13,000 labels	CSV and PDF	Natural language processing, QnA	2021		The Atticus Project
Vietnamese Image Captioning Dataset (UIT-ViIC)	Vietnamese Image Captioning Dataset		19,250 captions for 3,850 images	CSV and PDF	Natural language processing, Computer vision	2020	^[112]	Lam et al.
Vietnamese Names annotated with Genders (UIT-ViNames)	Vietnamese Names annotated with Genders		26,850 Vietnamese full names annotated with genders	CSV	Natural language processing	2020	^[113]	To et al.
Vietnamese Constructive and Toxic Speech Detection Dataset (UIT-ViCTSD)	Vietnamese Constructive and Toxic Speech Detection Dataset		10,000 Vietnamese users' comments on online newspapers on 10 domains	CSV	Natural Language Processing	2021	^[114]	Nguyen et al.
PG-19	A set of books extracted from the Project Gutenberg books library			Text	Natural Language Processing	2019		Jack W et al.
Deepmind Mathematics	Mathematical question and answer pairs.			Text	Natural Language Processing	2018	^[115]	D Saxton et al.
Anna's Archive	A comprehensive archive of published books and papers	None	100,356,641	Text,epub,PDF	Natural Language Processing	2024

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Zero Resource Speech Challenge 2015	Spontaneous speech (English), Read speech (Xitsonga).	None, raw WAV files.	English: 5h, 12 speakers; Xitsonga: 2h30, 24 speakers	WAV (audio only)	Unsupervised discovery of speech features/subword units/word units	2015	^[116]^[117]	Versteegh et al.
Parkinson Speech Dataset	Multiple recordings of people with and without Parkinson's Disease.	Voice features extracted, disease scored by physician using unified Parkinson's disease rating scale.	1,040	Text	Classification, regression	2013	^[118]^[119]	B. E. Sakar et al.
Spoken Arabic Digits	Spoken Arabic digits from 44 male and 44 female.	Time-series of mel-frequency cepstrum coefficients.	8,800	Text	Classification	2010	^[120]^[121]	M. Bedda et al.
ISOLET Dataset	Spoken letter names.	Features extracted from sounds.	7797	Text	Classification	1994	^[122]^[123]	R. Cole et al.
Japanese Vowels Dataset	Nine male speakers uttered two Japanese vowels successively.	Applied 12-degree linear prediction analysis to it to obtain a discrete-time series with 12 cepstrum coefficients.	640	Text	Classification	1999	^[124]^[125]	M. Kudo et al.
Parkinson's Telemonitoring Dataset	Multiple recordings of people with and without Parkinson's Disease.	Sound features extracted.	5875	Text	Classification	2009	^[126]^[127]	A. Tsanas et al.
TIMIT	Recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences.	Speech is lexically and phonemically transcribed.	6300	Text	Speech recognition, classification.	1986	^[128]^[129]	J. Garofolo et al.
Arabic Speech Corpus	A single-speaker, Modern Standard Arabic (MSA) speech corpus with phonetic and orthographic transcripts aligned to phoneme level.	Speech is orthographically and phonetically transcribed with stress marks.	~1900	Text, WAV	Speech Synthesis, Speech Recognition, Corpus Alignment, Speech Therapy, Education.	2016	^[130]	N. Halabi
Common Voice	A public domain database of crowdsourced data across a wide range of dialects.	Validation by other users .	English: 1,118 hours	MP3 with corresponding text files	Speech recognition	2017 June (2019 December)	^[131]	Mozilla
LJSpeech	A single-speaker corpus of English public-domain audiobook recordings, split into short clips at punctuation marks.	Quality check, normalized transcription alongside the original.	13,100	CSV, WAV	Speech synthesis	2017	^[132]	Keith Ito, Linda Johnson
Arabic Speech Commands Dataset	Collected from 30 contributors and grouped into 40 keywords.	Raw WAV files	12,000	WAV, CSV	Speech recognition, keyword spotting	2021	^[133]	Abdulkader Ghandoura

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Geographic Origin of Music Data Set	Audio features of music samples from different locations.	Audio features extracted using MARSYAS software.	1,059	Text	Geographic classification, clustering	2014	^[134]^[135]	F. Zhou et al.
Million Song Dataset	Audio features from one million different songs.	Audio features extracted.	1M	Text	Classification, clustering	2011	^[136]^[137]	T. Bertin-Mahieux et al.
MUSDB18	Multi-track popular music recordings	Raw audio	150	MP4, WAV	Source Separation	2017	^[138]	Z. Rafii et al.
Free Music Archive	Audio under Creative Commons from 100k songs (343 days, 1TiB) with a hierarchy of 161 genres, metadata, user data, free-form text.	Raw audio and audio features.	106,574	Text, MP3	Classification, recommendation	2017	^[139]	M. Defferrard et al.
Bach Choral Harmony Dataset	Bach chorale chords.	Audio features extracted.	5665	Text	Classification	2014	^[140]^[141]	D. Radicioni et al.

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
UrbanSound	Labeled sound recordings of sounds like air conditioners, car horns and children playing.	Sorted into folders by class of events as well as metadata in a JSON file and annotations in a CSV file.	1,059	Sound (WAV)	Classification	2014	^[142]^[143]	J. Salamon et al.
AudioSet	10-second sound snippets from YouTube videos, and an ontology of over 500 labels.	128-d PCA'd VGG-ish features every 1 second.	2,084,320	Text (CSV) and TensorFlow Record files	Classification	2017	^[144]	J. Gemmeke et al., Google
Bird Audio Detection challenge	Audio from environmental monitoring stations, plus crowdsourced recordings		17,000+		Classification	2016 (2018)	^[145]^[146]	Queen Mary University and IEEE Signal Processing Society
WSJ0 Hipster Ambient Mixtures	Audio from WSJ0 mixed with noise recorded in the San Francisco Bay Area	Noise clips matched to WSJ0 clips	28,000	Sound (WAV)	Audio source separation	2019	^[147]	Wichern, G., et al., Whisper and MERL
Clotho	4,981 audio samples of 15 to 30 seconds long, each audio sample having five different captions of eight to 20 words long.		24,905	Sound (WAV) and text (CSV)	Automated audio captioning	2020	^[148]^[149]	K. Drossos, S. Lipping, and T. Virtanen

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Witty Worm Dataset	Dataset detailing the spread of the Witty worm and the infected computers.	Split into a publicly available set and a restricted set containing more sensitive information like IP and UDP headers.	55,909 IP addresses	Text	Classification	2004	^[150]^[151]	Center for Applied Internet Data Analysis
Cuff-Less Blood Pressure Estimation Dataset	Cleaned vital signals from human patients which can be used to estimate blood pressure.	125 Hz vital signs have been cleaned.	12,000	Text	Classification, regression	2015	^[152]^[153]	M. Kachuee et al.
Gas Sensor Array Drift Dataset	Measurements from 16 chemical sensors utilized in simulations for drift compensation.	Extensive number of features given.	13,910	Text	Classification	2012	^[154]^[155]	A. Vergara
Servo Dataset	Data covering the nonlinear relationships observed in a servo-amplifier circuit.	Levels of various components as a function of other components are given.	167	Text	Regression	1993	^[156]^[157]	K. Ullrich
UJIIndoorLoc-Mag Dataset	Indoor localization database to test indoor positioning systems. Data is magnetic field based.	Train and test splits given.	40,000	Text	Classification, regression, clustering	2015	^[158]^[159]	D. Rambla et al.
Sensorless Drive Diagnosis Dataset	Electrical signals from motors with defective components.	Statistical features extracted.	58,508	Text	Classification	2015	^[160]^[161]	M. Bator

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Wearable Computing: Classification of Body Postures and Movements (PUC-Rio)	People performing five standard actions while wearing motion trackers.	None.	165,632	Text	Classification	2013	^[162]^[163]	Pontifical Catholic University of Rio de Janeiro
Gesture Phase Segmentation Dataset	Features extracted from video of people doing various gestures.	Features extracted aim at studying gesture phase segmentation.	9900	Text	Classification, clustering	2014	^[164]^[165]	R. Madeo et a
Vicon Physical Action Data Set Dataset	10 normal and 10 aggressive physical actions that measure the human activity tracked by a 3D tracker.	Many parameters recorded by 3D tracker.	3000	Text	Classification	2011	^[166]^[167]	T. Theodoridis
Daily and Sports Activities Dataset	Motor sensor data for 19 daily and sports activities.	Many sensors given, no preprocessing done on signals.	9120	Text	Classification	2013	^[168]^[169]	B. Barshan et al.
Human Activity Recognition Using Smartphones Dataset	Gyroscope and accelerometer data from people wearing smartphones and performing normal actions.	Actions performed are labeled, all signals preprocessed for noise.	10,299	Text	Classification	2012	^[170]^[171]	J. Reyes-Ortiz et al.
Australian Sign Language Signs	Australian sign language signs captured by motion-tracking gloves.	None.	2565	Text	Classification	2002	^[172]^[173]	M. Kadous
Weight Lifting Exercises monitored with Inertial Measurement Units	Five variations of the biceps curl exercise monitored with IMUs.	Some statistics calculated from raw data.	39,242	Text	Classification	2013	^[174]^[175]	W. Ugulino et al.
sEMG for Basic Hand movements Dataset	Two databases of surface electromyographic signals of 6 hand movements.	None.	3000	Text	Classification	2014	^[176]^[177]	C. Sapsanis et al.
REALDISP Activity Recognition Dataset	Evaluate techniques dealing with the effects of sensor displacement in wearable activity recognition.	None.	1419	Text	Classification	2014	^[177]^[178]	O. Banos et al.
Heterogeneity Activity Recognition Dataset	Data from multiple different smart devices for humans performing various activities.	None.	43,930,257	Text	Classification, clustering	2015	^[179]^[180]	A. Stisen et al.
Indoor User Movement Prediction from RSS Data	Temporal wireless network data that can be used to track the movement of people in an office.	None.	13,197	Text	Classification	2016	^[181]^[182]	D. Bacciu
PAMAP2 Physical Activity Monitoring Dataset	18 different types of physical activities performed by 9 subjects wearing 3 IMUs.	None.	3,850,505	Text	Classification	2012	^[183]	A. Reiss
OPPORTUNITY Activity Recognition Dataset	Human Activity Recognition from wearable, object, and ambient sensors is a dataset devised to benchmark human activity recognition algorithms.	None.	2551	Text	Classification	2012	^[184]^[185]	D. Roggen et al.
Real World Activity Recognition Dataset	Human Activity Recognition from wearable devices. Distinguishes between seven on-body device positions and comprises six different kinds of sensors.	None.	3,150,000 (per sensor)	Text	Classification	2016	^[186]	T. Sztyler et al.
Toronto Rehab Stroke Pose Dataset	3D human pose estimates (Kinect) of stroke patients and healthy participants performing a set of tasks using a stroke rehabilitation robot.	None.	10 healthy person and 9 stroke survivors (3500–6000 frames per person)	CSV	Classification	2017	^[187]^[188]^[189]	E. Dolatabadi et al.
Corpus of Social Touch (CoST)	7805 gesture captures of 14 different social touch gestures performed by 31 subjects. The gestures were performed in three variations: gentle, normal and rough, on a pressure sensor grid wrapped around a mannequin arm.	Touch gestures performed are segmented and labeled.	7805 gesture captures	CSV	Classification	2016	^[190]^[191]	M. Jung et al.

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Wine Dataset	Chemical analysis of wines grown in the same region in Italy but derived from three different cultivars.	13 properties of each wine are given	178	Text	Classification, regression	1991	^[192]^[193]	M. Forina et al.
Combined Cycle Power Plant Data Set	Data from various sensors within a power plant running for 6 years.	None	9568	Text	Regression	2014	^[194]^[195]	P. Tufekci et al.

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
HIGGS Dataset	Monte Carlo simulations of particle accelerator collisions.	28 features of each collision are given.	11M	Text	Classification	2014	^[196]^[197]^[198]	D. Whiteson
HEPMASS Dataset	Monte Carlo simulations of particle accelerator collisions. Goal is to separate the signal from noise.	28 features of each collision are given.	10,500,000	Text	Classification	2016	^[197]^[198]^[199]	D. Whiteson

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Yacht Hydrodynamics Dataset	Yacht performance based on dimensions.	Six features are given for each yacht.	308	Text	Regression	2013	^[200]^[201]	R. Lopez
Robot Execution Failures Dataset	5 data sets that center around robotic failure to execute common tasks.	Integer valued features such as torque and other sensor measurements.	463	Text	Classification	1999	^[202]	L. Seabra et al.
Pittsburgh Bridges Dataset	Design description is given in terms of several properties of various bridges.	Various bridge features are given.	108	Text	Classification	1990	^[203]^[204]	Y. Reich et al.
Automobile Dataset	Data about automobiles, their insurance risk, and their normalized losses.	Car features extracted.	205	Text	Regression	1987	^[205]^[206]	J. Schimmer et al.
Auto MPG Dataset	MPG data for cars.	Eight features of each car given.	398	Text	Regression	1993	^[207]	Carnegie Mellon University
Energy Efficiency Dataset	Heating and cooling requirements given as a function of building parameters.	Building parameters given.	768	Text	Classification, regression	2012	^[208]^[209]	A. Xifara et al.
Airfoil Self-Noise Dataset	A series of aerodynamic and acoustic tests of two and three-dimensional airfoil blade sections.	Data about frequency, angle of attack, etc., are given.	1503	Text	Regression	2014	^[210]	R. Lopez
Challenger USA Space Shuttle O-Ring Dataset	Attempt to predict O-ring problems given past Challenger data.	Several features of each flight, such as launch temperature, are given.	23	Text	Regression	1993	^[211]^[212]	D. Draper et al.
Statlog (Shuttle) Dataset	NASA space shuttle datasets.	Nine features given.	58,000	Text	Classification	2002	^[213]	NASA

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Volcanoes on Venus – JARtool experiment Dataset	Venus images returned by the Magellan spacecraft.	Images are labeled by humans.	not given	Images	Classification	1991	^[214]^[215]	M. Burl
MAGIC Gamma Telescope Dataset	Monte Carlo generated high-energy gamma particle events.	Numerous features extracted from the simulations.	19,020	Text	Classification	2007	^[215]^[216]	R. Bock
Solar Flare Dataset	Measurements of the number of certain types of solar flare events occurring in a 24-hour period.	Many solar flare-specific features are given.	1389	Text	Regression, classification	1989	^[217]	G. Bradshaw
CAMELS Multifield Dataset	2D maps and 3D grids from thousands of N-body and state-of-the-art hydrodynamic simulations spanning a broad range in the value of the cosmological and astrophysical parameters	Each map and grid has 6 cosmological and astrophysical parameters associated to it	405,000 2D maps and 405,000 3D grids	2D maps and 3D grids	Regression	2021	^[218]	Francisco Villaescusa-Navarro et al.

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Volcanoes of the World	Volcanic eruption data for all known volcanic events on earth.	Details such as region, subregion, tectonic setting, dominant rock type are given.	1535	Text	Regression, classification	2013	^[219]	E. Venzke et al.
Seismic-bumps Dataset	Seismic activities from a coal mine.	Seismic activity was classified as hazardous or not.	2584	Text	Classification	2013	^[220]^[221]	M. Sikora et al.
CAMELS-US	Catchment hydrology dataset with hydrometeorological timeseries and various attributes	see Reference	671	CSV, Text, Shapefile	Regression	2017	^[222]^[223]	N. Addor et al. / A. Newman et al.
CAMELS-Chile	Catchment hydrology dataset with hydrometeorological timeseries and various attributes	see Reference	516	CSV, Text, Shapefile	Regression	2018	^[224]	C. Alvarez-Garreton et al.
CAMELS-Brazil	Catchment hydrology dataset with hydrometeorological timeseries and various attributes	see Reference	897	CSV, Text, Shapefile	Regression	2020	^[225]	V. Chagas et al.
CAMELS-GB	Catchment hydrology dataset with hydrometeorological timeseries and various attributes	see Reference	671	CSV, Text, Shapefile	Regression	2020	^[226]	G. Coxon et al.
CAMELS-Australia	Catchment hydrology dataset with hydrometeorological timeseries and various attributes	see Reference	222	CSV, Text, Shapefile	Regression	2021	^[227]	K. Fowler et al.
LamaH-CE	Catchment hydrology dataset with hydrometeorological timeseries and various attributes	see Reference	859	CSV, Text, Shapefile	Regression	2021	^[228]	C. Klingler et al.

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Concrete Compressive Strength Dataset	Dataset of concrete properties and compressive strength.	Nine features are given for each sample.	1030	Text	Regression	2007	^[229]^[230]	I. Yeh
Concrete Slump Test Dataset	Concrete slump flow given in terms of properties.	Features of concrete given such as fly ash, water, etc.	103	Text	Regression	2009	^[231]^[232]	I. Yeh
Musk Dataset	Predict if a molecule, given the features, will be a musk or a non-musk.	168 features given for each molecule.	6598	Text	Classification	1994	^[233]	Arris Pharmaceutical Corp.
Steel Plates Faults Dataset	Steel plates of 7 different types.	27 features given for each sample.	1941	Text	Classification	2010	^[234]	Semeion Research Center
Noble Metal Monometallic Nanoparticles Datasets	Processing and structural features of monometallic nanoparticles, labels being formation energy.	85-182 features given for each sample.	425 to 4000	CSV	Regression	2017 to 2023	^[235]^[236]^[237]^[238]^[239]^[240]	A. Barnard and G. Opletal
Noble Metal Bimetallic Nanoparticles Datasets	Processing and structural features of bimetallic nanoparticles, labels being formation energy.	922 features given for each sample.	138147 to 162770	CSV	Regression	2023	^[241]^[242]^[243]^[244]^[245]^[246]^[247]^[248]^[249]^[250]^[251]^[252]	J. Ting et al.
AuPdPt Trimetallic Nanoparticles Dataset	Processing and structural features of AuPdPt nanoparticles, labels being formation energy.	1958 features given for each sample.	48136	CSV	Regression	2023	^[253]	K. Lu et al.

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Age Dataset	A structured general-purpose dataset on life, work, and death of 1.22 million distinguished people. Public domain.	A five-step method to infer birth and death years, gender, and occupation from community-submitted data to all language versions of the Wikipedia project.	1,223,009	Text	Regression, Classification	2022	Paper^[254] Dataset^[255]	Amoradnejad et al.
Synthetic Fundus Dataset^[256]	Photorealistic retinal images and vessel segmentations. Public domain.	2500 images with 1500*1152 pixels useful for segmentation and classification of veins and arteries on a single background.	2500	Images	Classification, Segmentation	2020	^[257]	C. Valenti et al.
EEG Database	Study to examine EEG correlates of genetic predisposition to alcoholism.	Measurements from 64 electrodes placed on the scalp sampled at 256 Hz (3.9 ms epoch) for 1 second.	122	Text	Classification	1999	^[258]	H. Begleiter
P300 Interface Dataset	Data from nine subjects collected using P300-based brain-computer interface for disabled subjects.	Split into four sessions for each subject. MATLAB code given.	1,224	Text	Classification	2008	^[259]^[260]	U. Hoffman et al.
Heart Disease Data Set	Attributed of patients with and without heart disease.	75 attributes given for each patient with some missing values.	303	Text	Classification	1988	^[261]^[262]	A. Janosi et al.
Breast Cancer Wisconsin (Diagnostic) Dataset	Dataset of features of breast masses. Diagnoses by physician is given.	10 features for each sample are given.	569	Text	Classification	1995	^[263]^[264]	W. Wolberg et al.
National Survey on Drug Use and Health	Large scale survey on health and drug use in the United States.	None.	55,268	Text	Classification, regression	2012	^[265]	United States Department of Health and Human Services
Lung Cancer Dataset	Lung cancer dataset without attribute definitions	56 features are given for each case	32	Text	Classification	1992	^[266]^[267]	Z. Hong et al.
Arrhythmia Dataset	Data for a group of patients, of which some have cardiac arrhythmia.	276 features for each instance.	452	Text	Classification	1998	^[268]^[269]	H. Altay et al.
Diabetes 130-US hospitals for years 1999–2008 Dataset	9 years of readmission data across 130 US hospitals for patients with diabetes.	Many features of each readmission are given.	100,000	Text	Classification, clustering	2014	^[270]^[271]	J. Clore et al.
Diabetic Retinopathy Debrecen Dataset	Features extracted from images of eyes with and without diabetic retinopathy.	Features extracted and conditions diagnosed.	1151	Text	Classification	2014	^[272]^[273]	B. Antal et al.
Diabetic Retinopathy Messidor Dataset	Methods to evaluate segmentation and indexing techniques in the field of retinal ophthalmology (MESSIDOR)	Features retinopathy grade and risk of macular edema	1200	Images, Text	Classification, Segmentation	2008	^[274]^[275]	Messidor Project
Liver Disorders Dataset	Data for people with liver disorders.	Seven biological features given for each patient.	345	Text	Classification	1990	^[276]^[277]	Bupa Medical Research Ltd.
Thyroid Disease Dataset	10 databases of thyroid disease patient data.	None.	7200	Text	Classification	1987	^[278]^[279]	R. Quinlan
Mesothelioma Dataset	Mesothelioma patient data.	Large number of features, including asbestos exposure, are given.	324	Text	Classification	2016	^[280]^[281]	A. Tanrikulu et al.
Parkinson's Vision-Based Pose Estimation Dataset	2D human pose estimates of Parkinson's patients performing a variety of tasks.	Camera shake has been removed from trajectories.	134	Text	Classification, regression	2017	^[282]^[283]^[284]	M. Li et al.
KEGG Metabolic Reaction Network (Undirected) Dataset	Network of metabolic pathways. A reaction network and a relation network are given.	Detailed features for each network node and pathway are given.	65,554	Text	Classification, clustering, regression	2011	^[285]	M. Naeem et al.
Modified Human Sperm Morphology Analysis Dataset (MHSMA)	Human sperm images from 235 patients with male factor infertility, labeled for normal or abnormal sperm acrosome, head, vacuole, and tail.	Cropped around single sperm head. Magnification normalized. Training, validation, and test set splits created.	1,540	.npy files	Classification	2019	^[286]^[287]	S. Javadi and S.A. Mirroshandel

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Abalone Dataset	Physical measurements of Abalone. Weather patterns and location are also given.	None.	4177	Text	Regression	1995	^[288]	Marine Research Laboratories – Taroona
Zoo Dataset	Artificial dataset covering 7 classes of animals.	Animals are classed into 7 categories and features are given for each.	101	Text	Classification	1990	^[289]	R. Forsyth
Demospongiae Dataset	Data about marine sponges.	503 sponges in the Demosponge class are described by various features.	503	Text	Classification	2010	^[290]	E. Armengol et al.
Farm animals data	PLF data inventory (cows, pigs; location, acceleration, etc.).	Labeled datasets.	List is constantly updated	Text	Classification	2020	^[291]	V. Bloch
Splice-junction Gene Sequences Dataset	Primate splice-junction gene sequences (DNA) with associated imperfect domain theory.	None.	3190	Text	Classification	1992	^[267]	G. Towell et al.
Mice Protein Expression Dataset	Expression levels of 77 proteins measured in the cerebral cortex of mice.	None.	1080	Text	Classification, Clustering	2015	^[292]^[293]	C. Higuera et al.

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
UCI Mushroom Dataset	Mushroom attributes and classification.	Many properties of each mushroom are given.	8124	Text	Classification	1987	^[294]	J. Schlimmer
Secondary Mushroom Dataset	Mushroom attributes and classification	Simulated data from larger and more realistic primary mushroom entries. Fully reproducible.	61069	Text	Classification	2020	^[295]^[296]	D. Wagner et al.

List of sorting used for datasets

List of open data portals

List of portals suitable for multiple types of applications

List of portals suitable for a specific subtype of applications

Image data

Text data

Reviews

News articles

Messages

Twitter and tweets

Dialogues

Legal

Other text

Sound data

Speech

Music

Other sounds

Signal data

Electrical

Motion-tracking

Other signals

Physical data

High-energy physics

Systems

Astronomy

Earth science

Other physical

Biological data

Human

Animal

Fungi

Plant

Microbe

Drug discovery

Anomaly data

Question answering data

Dialog or instruction prompted data

Cybersecurity

Climate and sustainability

Code data

Multivariate data

Financial

Weather

Census

Transit

Internet

Games

Other multivariate

Curated repositories of datasets

See also

References

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Forest Fires Dataset	Forest fires and their properties.	13 features of each fire are extracted.	517	Text	Regression	2008	^[297]^[298]	P. Cortez et al.
Iris Dataset	Three types of iris plants are described by 4 different attributes.	None.	150	Text	Classification	1936	^[299]^[300]	R. Fisher
Plant Species Leaves Dataset	Sixteen samples of leaf each of one-hundred plant species.	Shape descriptor, fine-scale margin, and texture histograms are given.	1600	Text	Classification	2012	^[301]^[302]	J. Cope et al.
Soybean Dataset	Database of diseased soybean plants.	35 features for each plant are given. Plants are classified into 19 categories.	307	Text	Classification	1988	^[303]	R. Michalski et al.
Seeds Dataset	Measurements of geometrical properties of kernels belonging to three different varieties of wheat.	None.	210	Text	Classification, clustering	2012	^[304]^[305]	Charytanowicz et al.
Covertype Dataset	Data for predicting forest cover type strictly from cartographic variables.	Many geographical features given.	581,012	Text	Classification	1998	^[306]^[307]	J. Blackard et al.
Abscisic Acid Signaling Network Dataset	Data for a plant signaling network. Goal is to determine set of rules that governs the network.	None.	300	Text	Causal-discovery	2008	^[308]	J. Jenkens et al.
Folio Dataset	20 photos of leaves for each of 32 species.	None.	637	Images, text	Classification, clustering	2015	^[309]^[310]	T. Munisami et al.
Oxford Flower Dataset	17 category dataset of flowers.	Train/test splits, labeled images,	1360	Images, text	Classification	2006	^[311]^[312]	M-E Nilsback et al.
Plant Seedlings Dataset	12 category dataset of plant seedlings.	Labelled images, segmented images,	5544	Images	Classification, detection	2017	^[313]	Giselsson et al.
Fruits-360	Database with images of 131 fruits and vegetables.	100x100 pixels, white background.	90483	Images (jpg)	Classification	2017–2024	^[314]	Mihai Oltean

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Ecoli Dataset	Protein localization sites.	Various features of the protein localizations sites are given.	336	Text	Classification	1996	^[315]^[316]	K. Nakai et al.
MicroMass Dataset	Identification of microorganisms from mass-spectrometry data.	Various mass spectrometer features.	931	Text	Classification	2013	^[317]^[318]	P. Mahe et al.
Yeast Dataset	Predictions of Cellular localization sites of proteins.	Eight features given per instance.	1484	Text	Classification	1996	^[319]^[320]	K. Nakai et al.

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Numenta Anomaly Benchmark (NAB)	Data are ordered, timestamped, single-valued metrics. All data files contain anomalies, unless otherwise noted.	None	50+ files	CSV	Anomaly detection	2016 (continually updated)	^[322]	Numenta
Skoltech Anomaly Benchmark (SKAB)	Each file represents a single experiment and contains a single anomaly. The dataset represents a multivariate time series collected from the sensors installed on the testbed.	There are two markups for Outlier detection (point anomalies) and Changepoint detection (collective anomalies) problems	30+ files (v0.9)	CSV	Anomaly detection	2020 (continually updated)	^[323] ^[324]	Iurii D. Katser and Vyacheslav O. Kozitsin
On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study	Most data files are adapted from UCI Machine Learning Repository data, some are collected from the literature.	treated for missing values, numerical attributes only, different percentages of anomalies, labels	1000+ files	ARFF	Anomaly detection	2016 (possibly updated with new datasets and/or results)	^[325]	Campos et al.

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
DBpedia Neural Question Answering (DBNQA) Dataset	A large collection of Question to SPARQL specially design for Open Domain Neural Question Answering over DBpedia Knowledgebase.	This dataset contains a large collection of Open Neural SPARQL Templates and instances for training Neural SPARQL Machines; it was pre-processed by semi-automatic annotation tools as well as by three SPARQL experts.	894,499	Question-query pairs	Question Answering	2018	^[326]^[327]	Hartmann, Soru, and Marx et al.
Vietnamese Question Answering Dataset (UIT-ViQuAD)	A large collection of Vietnamese questions for evaluating MRC models.	This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia.	23,074	Question-answer pairs	Question Answering	2020	^[328]	Nguyen et al.
Vietnamese Multiple-Choice Machine Reading Comprehension Corpus(ViMMRC)	A collection of Vietnamese multiple-choice questions for evaluating MRC models.	This corpus includes 2,783 Vietnamese multiple-choice questions.	2,783	Question-answer pairs	Question Answering/Machine Reading Comprehension	2020	^[329]	Nguyen et al.
Open-Domain Question Answering Goes Conversational via Question Rewriting	An end-to-end open-domain question answering.	This dataset includes 14,000 conversations with 81,000 question-answer pairs.		Context, Question, Rewrite, Answer, Answer_URL, Conversation_no, Turn_no, Conversation_source Further details are provided in the project's GitHub repository and respective Hugging Face dataset card.	Question Answering	2021	^[330]	Anantha and Vakulenko et al.
UnifiedQA	Question-answer data	Processed dataset			Question Answering	2020	^[331]	Khashabi et al.

Dataset Name	Brief description	Preprocessing	Format	Default Task	Created (updated)	Reference	Creator
Taskmaster	"The Taskmaster corpus consists of THREE datasets, Taskmaster-1 (TM-1), Taskmaster-2 (TM-2), and Taskmaster-3 (TM-3), comprising over 55,000 spoken and written task-oriented dialogs in over a dozen domains."^[332]	Taskmaster-1: goal-oriented conversational dataset. It includes 13,215 task-based dialogs comprising six domains. Taskmaster-2: 17,289 dialogs in the seven domains (restaurants, food ordering, movies, hotels, flights, music and sports). Taskmaster-3: 23,757 movie ticketing dialogs.	Taskmaster-1 and Taskmaster-2: conversation id, utterances, Instruction id Taskmaster-3: conversation id, utterances, vertical, scenario, instructions. For further details check the project's GitHub repository or the Hugging Face dataset cards (taskmaster-1, taskmaster-2, taskmaster-3).	Dialog/Instruction prompted	2019	^[333]	Byrne and Krishnamoorthi et al.
DrRepair	A labeled dataset for program repair.	Pre-processed data	Check format details in the project's worksheet.	Dialog/Instruction prompted	2020	^[334]	Michihiro et al.
Natural Instructions v2	Large dataset that covers a wider range of reasoning abilities		Each task consists of input/output, and a task definition. Additionally, each ask contains a task definition. Further information is provided in the GitHub repository of the project and the Hugging Face data card.	Input/Output and task definition	2022	^[335]	Wang et al.
LAMBADA	" LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word."^[336]		Information about this dataset's format is available in the HuggingFace dataset card and the project's website. The dataset can be downloaded here, and the rejected data here.		2016	^[337]	Paperno et al.
FLAN		A re-preprocessed version of the FLAN dataset with updates since the original FLAN dataset was released is available in Hugging Face: test data train data validation data The scripts to process the data are available in the GitHub repo mentioned on the paper: https://github.com/google-research/FLAN/tree/main/flan. Another FLAN GitHub repo was created as well. This is the one associated with the dataset card in Hugging Face.			2021	^[338]	Wei et al.

Dataset Name	Brief description	Preprocessing	Format	Reference	Creator
MITRE ATTACK	The ATT&CK is a globally-accessible knowledge base of adversary tactics and techniques.		Data can be downloaded from these two GitHub repositories: version 2.1 and version 2.0	^[339]	MITRE ATTACK
CAPEC	Common Attack Pattern Enumeration and Classification		Data can be downloaded from CAPEC's website: Mechanisms of Attack Domains of Attack	^[340]	CAPEC
CVE	CVE is a list of publicly disclosed cybersecurity vulnerabilities that is free to search, use, and incorporate into products and services.		Data can be downloaded from: Allitems	^[341]	CVE
CWE	Common Weakness Enumeration data.		Data can be downloaded from: Software Development Hardware Design^{[permanent dead link]}Research Concepts	^[342]	CWE
MalwareTextDB	Annotated database of malware texts.		The GitHub repository of the project contains the data to download.	^[343]	Kiat et al.
USENIX Security Symposium proceedings	Collection of security proceedings from USENIX Security Symposium – technical sessions from 1995 to 2022.	This data is not pre-processed.	1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022.	^[344]	USENIX Security Symposium
APTNotes	Collection of public documents, whitepapers and articles about APT campaigns. All the documents are publicly available data.	This data is not pre-processed.	The GitHub repository of the project contains a file with links to the data stored in box. Data files can also be downloaded here.	^[345]	APT Notes
arXiv Cryptography and Security papers	Collection of articles about cybersecurity	This data is not pre-processed.	All articles available here.	^[346]	arXiv
Security eBooks for free	Small collection of security eBooks, and security presentations publicly available.	This data is not pre-processed.		^[347]^[348]^[349]^[350]^[351]^[352]^[353]^[354]^[355]^[356]^[357]^[358]
National Cyber Security strategy repository	Repository of worldwide strategy documents about cybersecurity.	This data is not pre-processed.		^[359]
Cyber Security Natural Language Processing	Data about cybersecurity strategies from more than 75 countries.	Tokenization, meaningless-frequent words removal.		^[360]	Yanlin Chen, Yunjian Wei, Yifan Yu, Wen Xue, Xianya Qin
APT Reports collection	Sample of APT reports, malware, technology, and intelligence collection	Raw and tokenize data available.	All data is available in this GitHub repository.	^{[citation needed]}	blackorbird
Offensive Language Identification Dataset (OLID)			Data available in the project's website. Data is also available here.	^[361]	Zampieri et al.
Cyber reports from the National Cyber Security Centre		This data is not pre-processed.	Threat reports, reports and advisory, news, blog-posts, speeches. Alternate list of reports.	^[362]
APT reports by Kaspersky		This data is not pre-processed.		^[363]
The cyberwire		This data is not pre-processed.	Newsletters, podcasts, and stories.	^[364]
Databreaches news		This data is not pre-processed.	News, list of news from Aug 2022 to Feb 2023	^[365]
Cybernews		This data is not pre-processed.	News, curated list of news	^[366]
Bleepingcomputer		This data is not pre-processed.	News	^[367]
Therecord		This data is not pre-processed.	Cybercrime news	^[368]
Hackread		This data is not pre-processed.	Hacking news	^[369]
Securelist		This data is not pre-processed.	APT reports, archive, DDOS reports, incidents, Kaspersky security bulletin, industrial threats, malware-reports, opinions, publications, research, and SAS.	^[370]
Stucco project	The Stucco project collects data not typically integrated into security systems.	This data is not pre-processed	Project's website with data information Reviewed source with links to data sources	^[371]
Farsightsecurity	Website with technical information, reports, and more about security topics.	This data is not pre-processed	Technical information, research, reports.	^[372]
Schneier	Website with academic papers about security topics.	This data is not pre-processed	Papers per category, papers archive by date.	^[373]
Trendmicro	Website with research, news, and perspectives bout security topics.	This data is not pre-processed	Reviewed list of Trendmicro research, news, and perspectives.	^[374]
The Hacker News	News about cybersecurity topics.	This data is not pre-processed	data breaches, cyberattacks, vulnerabilities, malware news.	^[375]
Krebsonsecurity	Security news and investigation	This data is not pre-processed	curated list of news	^[376]
Mitre Defend	Matrix of Defend artifacts		json files	^[377]
Mitre Atlas	Mitre Atlas is a knowledge base of adversary tactics, techniques, and case studies for machine learning (ML) systems based on real-world observations.	This data is not pre-processed		^[378]
Mitre Engage	MITRE Engage is a framework for planning and discussing adversary engagement operations that empowers you to engage your adversaries and achieve your cybersecurity goals.	This data is not pre-processed		^[379]
Hacking Tutorials		This data is not pre-processed		^[380]

Dataset Name	Brief description	Preprocessing	Format	Reference	Creator
TCFD reports	Database of company reports that include TCFD-related disclosures.	This data is not pre-processed	Direct link to reports Curated list of reports	^[381]	TCFD Knowledge Hub
Corporate Social Responsibility Reports	A listing of responsibility reports on the internet.	This data is not pre-processed	Curated list of reports	^[382]	ResponsibilityReports
The Intergovernmental Panel on Climate Change (IPCC)	A collection of comprehensive assessment reports about knowledge on climate change, its causes, potential impacts and response options	This data is not pre-processed	Reports Curated list of reports	^[383]	IPCC
Alliance for Research on Corporate Sustainability		This data is not pre-processed	Curated list of blog posts	^[384]	ARCS
ESG corpus: Knowledge Hub of the Accounting for Sustainability		This data is not pre-processed	Guides, case studies, blogs, and reports & surveys.	^[385]	Mehra et al.
CLIMATE-FEVER	A dataset adopting the FEVER methodology that consists of 1,535 real-world claims regarding climate-change collected on the internet.	Each claim is accompanied by five manually annotated evidence sentences retrieved from the English Wikipedia that support, refute or do not give enough information to validate the claim totalling in 7,675 claim-evidence pairs.^[386]	Dataset HF card, and project's GitHub repository.	^[387]	Diggelmann et al.
Climate News dataset	A dataset for NLP and climate change media researchers	The dataset is made up of a number of data artifacts (JSON, JSONL & CSV text files & SQLite database)	Climate news DB, Project's GitHub repository	^[388]	ADGEfficiency
Climatext	Climatext is a dataset for sentence-based climate change topic detection.		HF dataset	^[389]	University of Zurich
GreenBiz	Collection of articles and news about climate and sustainability	This data is not pre-processed	Curated list of climate articles Curated list of sustainability articles	^[390]
Top research pre-prints in climate and sustainability	List of pre-prints from researchers in the reuters hot list	This data is not pre-processed	Curated list of pre-prints	^[391]	Maurice Tamman
ARCS		This data is not pre-processed	Curated list of corporate sustainability blogs	^[392]
GreenBiz	Website with articles about climate and sustainability	This data is not pre-processed		^[393]	GreenBiz
CSRWIRE		This data is not pre-processed	Curated list of articles	^[394]	CSRWIRE
CDP	Articles about climate, water, and forests	This data is not pre-processed		^[395]	CDP

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
The Stack	A 3.1 TB dataset consisting of permissively licensed source code in 30 programming languages.	Filtered through license detection and deduplication.	6 TB, 51.76B files (prior to deduplication); 3 TB, 5.28B files (after). 358 programming languages.	Parquet	Language modeling, autocompletion, program synthesis.	2022	^[396]^[397]	D. Kocetkov, R. Li, L. Ben Allal, L. von Werra, H. de Vries
GitHub repositories		This data is not pre-processed		Curated lis of repositories from GitHub: 61 62 63 64 65 66 67 68 69 70 71 , 72, 73, 74, 75, 76, 77 101
IBM Public GitHub repositories		This data is not pre-processed		Curated list of repositories from GitHub
RedHat Public GitHub repositories		This data is not pre-processed		Curated list of repositories from GitHub
StackExchange Public Archive.org files		This data is not pre-processed		Curated list of files from Archive.org
Gitlab Public repositories		This data is not pre-processed		Curated list of repositories from Gitlab: 1 2
Ansible Collections public repositories		This data is not pre-processed		Curated list of repositories from GitHub.
CodeParrot GitHub Code Dataset		This data is not pre-processed		Curated list of repositories from Hugging Face: 1 2 3 4 5 6 7 8 9 10
OKD	The Community Distribution of Kubernetes that powers Red Hat OpenShift	This data is not pre-processed		List of GitHub repositories of the project
OpenShift	The developer and operations friendly Kubernetes distro			List of GitHub repositories of the project
Kubernetes		This data is not pre-processed		List of GitHub repositories of the project
Red Hat Developer	GitHub home of the Red Hat Developer program	This data is not pre-processed		List of GitHub repositories of the project
Red Hat Workshops		This data is not pre-processed		List of GitHub repositories of the project
Kubernetes SIGs		This data is not pre-processed		List of GitHub repositories of the project
Konveyor		This data is not pre-processed		List of GitHub repositories of the project
RedHat Marketplace		This data is not pre-processed		List of GitHub repositories of the project
Redhat blog		This data is not pre-processed					^[398]
Kubernetes io		This data is not pre-processed					^[399]
Docs Openshift		This data is not pre-processed					^[400]
cncf io		This data is not pre-processed					^[401]
Kubernetes presentations	List of publicly available Kubernetes presentations	This data is not pre-processed		data link
Red Hat Open Innovation Labs		This data is not pre-processed		List of GitHub repositories of the project
Red Hat Demos		This data is not pre-processed		List of GitHub repositories of the project
Red Hat OpenShift Online		This data is not pre-processed		List of GitHub repositories of the project
Software Collections		This data is not pre-processed		List of GitHub repositories of the project
Red Hat Insights		This data is not pre-processed		List of GitHub repositories of the project
Red Hat Government		This data is not pre-processed		List of GitHub repositories of the project
Red Hat Consulting		This data is not pre-processed		List of GitHub repositories of the project
Red Hat Communities of Practice		This data is not pre-processed		List of GitHub repositories of the project
Red Hat Partner Tech		This data is not pre-processed		List of GitHub repositories of the project
Red Hat Documentation		This data is not pre-processed		List of GitHub repositories of the project
IBM		This data is not pre-processed		List of GitHub repositories of the project
IBM Cloud		This data is not pre-processed		List of GitHub repositories of the project
Build Lab Team		This data is not pre-processed		List of GitHub repositories of the project
Terraform IBM Modules		This data is not pre-processed		List of GitHub repositories of the project
Cloud Schematics		This data is not pre-processed		List of GitHub repositories of the project
OCP Power Demos		This data is not pre-processed		List of GitHub repositories of the project
IBM App Modernization		This data is not pre-processed		List of GitHub repositories of the project
Kubernetes OperatorHub		This data is not pre-processed		List of GitHub repositories of the project
Cloud Native Computing Foundation (CNCF)		This data is not pre-processed		List of GitHub repositories of the project
Operator Framework		This data is not pre-processed		List of GitHub repositories of the project			^[402]
GitHub repositories referenced in artifacthub.io		This data is not pre-processed		List of GitHub repositories in artifacthub.io
Red Hat Communities of Practice		This data is not pre-processed		List of GitHub repositories of the project
Red Hat partner		This data is not pre-processed		List of GitHub repositories of the project
IBM Repositories		This data is not pre-processed		List of GitHub repositories for the project
Build Lab Team		This data is not pre-processed		List of GitHub repositories for the project
Operator Framework		This data is not pre-processed		List of GitHub repositories for the project
GitHub repositories		This data is not pre-processed		List of GitHub repositories for the project
Red Hat		This data is not pre-processed		List of GitHub repositories of the project
Kubernetes Patterns		This data is not pre-processed		List of GitHub repositories of the project
Kubernetes Deployment & Security Patterns		This data is not pre-processed		List of GitHub repositories of the project
Kubernetes for Full-Stack Developers		This data is not pre-processed		List of GitHub repositories of the project
Load Balancer Cloudwatch Metrics		This data is not pre-processed		GitHub repository of the project
Dynatrace		This data is not pre-processed		[5]
AIOps Challenge 2020 Data		This data is not pre-processed		GitHub repository of the project
Loghub		This data is not pre-processed		List of repositories
HTML Pages		This data is not pre-processed		List of HTML pages
Opensift ebooks		This data is not pre-processed					^[403]
Kubernetes ebooks		This data is not pre-processed		Kubernetes Patterns, Kubernetes Deployment, Kubernetes for Full-Stack Developers
Kubernetes for Full-Stack Developers		This data is not pre-processed		Kubernetes for Full-Stack Developers
List of public and licensed Github repositories		This data is not pre-processed		List of repositories

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Dow Jones Index	Weekly data of stocks from the first and second quarters of 2011.	Calculated values included such as percentage change and a lags.	750	Comma separated values	Classification, regression, Time series	2014	^[404]^[405]	M. Brown et al.
Statlog (Australian Credit Approval)	Credit card applications either accepted or rejected and attributes about the application.	Attribute names are removed as well as identifying information. Factors have been relabeled.	690	Comma separated values	Classification	1987	^[406]^[407]	R. Quinlan
eBay auction data	Auction data from various eBay.com objects over various length auctions	Contains all bids, bidderID, bid times, and opening prices.	~ 550	Text	Regression, classification	2012	^[408]^[409]	G. Shmueli et al.
Statlog (German Credit Data)	Binary credit classification into "good" or "bad" with many features	Various financial features of each person are given.	690	Text	Classification	1994	^[410]	H. Hofmann
Bank Marketing Dataset	Data from a large marketing campaign carried out by a large bank .	Many attributes of the clients contacted are given. If the client subscribed to the bank is also given.	45,211	Text	Classification	2012	^[411]^[412]	S. Moro et al.
Istanbul Stock Exchange Dataset	Several stock indexes tracked for almost two years.	None.	536	Text	Classification, regression	2013	^[413]^[414]	O. Akbilgic
Default of Credit Card Clients	Credit default data for Taiwanese creditors.	Various features about each account are given.	30,000	Text	Classification	2016	^[415]^[416]	I. Yeh
StockNet	Stock movement prediction from tweets and historical stock prices	None		Text	NLP	2018	^[417]	Yumo Xu and Shay B. Cohen

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Cloud DataSet	Data about 1024 different clouds.	Image features extracted.	1024	Text	Classification, clustering	1989	^[418]	P. Collard
El Nino Dataset	Oceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific.	12 weather attributes are measured at each buoy.	178080	Text	Regression	1999	^[419]	Pacific Marine Environmental Laboratory
Greenhouse Gas Observing Network Dataset	Time-series of greenhouse gas concentrations at 2921 grid cells in California created using simulations of the weather.	None.	2921	Text	Regression	2015	^[420]	D. Lucas
Atmospheric CO₂ from Continuous Air Samples at Mauna Loa Observatory	Continuous air samples in Hawaii, USA. 44 years of records.	None.	44 years	Text	Regression	2001	^[421]	Mauna Loa Observatory
Ionosphere Dataset	Radar data from the ionosphere. Task is to classify into good and bad radar returns.	Many radar features given.	351	Text	Classification	1989	^[279]^[422]	Johns Hopkins University
Ozone Level Detection Dataset	Two ground ozone level datasets.	Many features given, including weather conditions at time of measurement.	2536	Text	Classification	2008	^[423]^[424]	K. Zhang et al.