Machine learning and data mining |
---|
|
|
|
|
Dataset name | Brief description | Preprocessing | Instances | Format | Default task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
FERET (facial recognition technology) | 11338 images of 1199 individuals in different positions and at different times. | None. | 11,338 | Images | Classification, face recognition | 2003 | [6][7] | United States Department of Defense |
CMU Pose, Illumination, and Expression (PIE) | 41,368 color images of 68 people in 13 different poses. | Images labeled with expressions. | 41,368 | Images, text | Classification, face recognition | 2000 | [8][9] | R. Gross et al. |
Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) | 7,356 video and audio recordings of 24 professional actors. 8 emotions each at two intensities. | Files labelled with expression. Perceptual validation ratings provided by 319 raters. | 7,356 | Video, sound files | Classification, face recognition, voice recognition | 2018 | [10][11] | S.R. Livingstone and F.A. Russo |
SCFace | Color images of faces at various angles. | Location of facial features extracted. Coordinates of features given. | 4,160 | Images, text | Classification, face recognition | 2011 | [12][13] | M. Grgic et al. |
YouTube Faces DB | Videos of 1,595 different people gathered from YouTube. Each clip is between 48 and 6,070 frames. | Identity of those appearing in videos and descriptors. | 3,425 videos | Video, text | Video classification, face recognition | 2011 | [14][15] | L. Wolf et al. |
300 videos in-the-Wild | 114 videos annotated for facial landmark tracking. The 68 landmark mark-up is applied to every frame. | None | 114 videos, 218,000 frames. | Video, annotation file. | Facial landmark tracking. | 2015 | [16] | Shen, Jie et al. |
Grammatical Facial Expressions Dataset | Grammatical Facial Expressions from Brazilian Sign Language. | Microsoft Kinect features extracted. | 27,965 | Text | Facial gesture recognition | 2014 | [17] | F. Freitas et al. |
CMU Face Images Dataset | Images of faces. Each person is photographed multiple times to capture different expressions. | Labels and features. | 640 | Images, Text | Face recognition | 1999 | [18][19] | T. Mitchell |
Yale Face Database | Faces of 15 individuals in 11 different expressions. | Labels of expressions. | 165 | Images | Face recognition | 1997 | [20][21] | J. Yang et al. |
Cohn-Kanade AU-Coded Expression Database | Large database of images with labels for expressions. | Tracking of certain facial features. | 500+ sequences | Images, text | Facial expression analysis | 2000 | [22][23] | T. Kanade et al. |
FaceScrub | Images of public figures scrubbed from image searching. | Name and m/f annotation. | 107,818 | Images, text | Face recognition | 2014 | [24][25] | H. Ng et al. |
BioID Face Database | Images of faces with eye positions marked. | Manually set eye positions. | 1521 | Images, text | Face recognition | 2001 | [26][27] | BioID |
Skin Segmentation Dataset | Randomly sampled color values from face images. | B, G, R, values extracted. | 245,057 | Text | Segmentation, classification | 2012 | [28][29] | R. Bhatt. |
Bosphorus | 3D Face image database. | 34 action units and 6 expressions labeled; 24 facial landmarks labeled. | 4652 | Images, text | Face recognition, classification | 2008 | [30][31] | A Savran et al. |
UOY 3D-Face | neutral face, 5 expressions: anger, happiness, sadness, eyes closed, eyebrows raised. | labeling. | 5250 | Images, text | Face recognition, classification | 2004 | [32][33] | University of York |
CASIA | Expressions: Anger, smile, laugh, surprise, closed eyes. | None. | 4624 | Images, text | Face recognition, classification | 2007 | [34][35] | Institute of Automation, Chinese Academy of Sciences |
CASIA | Expressions: Anger Disgust Fear Happiness Sadness Surprise | None. | 480 | Annotated Visible Spectrum and Near Infrared Video captures at 25 frames per second | Face recognition, classification | 2011 | [36] | Zhao, G. et al. |
BU-3DFE | neutral face, and 6 expressions: anger, happiness, sadness, surprise, disgust, fear (4 levels). 3D images extracted. | None. | 2500 | Images, text | Facial expression recognition, classification | 2006 | [37] | Binghamton University |
Face Recognition Grand Challenge Dataset | Up to 22 samples for each subject. Expressions: anger, happiness, sadness, surprise, disgust, puffy. 3D Data. | None. | 4007 | Images, text | Face recognition, classification | 2004 | [38][39] | National Institute of Standards and Technology |
Gavabdb | Up to 61 samples for each subject. Expressions neutral face, smile, frontal accentuated laugh, frontal random gesture. 3D images. | None. | 549 | Images, text | Face recognition, classification | 2008 | [40][41] | King Juan Carlos University |
3D-RMA | Up to 100 subjects, expressions mostly neutral. Several poses as well. | None. | 9971 | Images, text | Face recognition, classification | 2004 | [42][43] | Royal Military Academy (Belgium) |
SoF | 112 persons (66 males and 46 females) wear glasses under different illumination conditions. | A set of synthetic filters (blur, occlusions, noise, and posterization ) with different level of difficulty. | 42,592 (2,662 original image × 16 synthetic image) | Images, Mat file | Gender classification, face detection, face recognition, age estimation, and glasses detection | 2017 | [44][45] | Afifi, M. et al. |
IMDB-WIKI | IMDB and Wikipedia face images with gender and age labels. | None | 523,051 | Images | Gender classification, face detection, face recognition, age estimation | 2015 | [46] | R. Rothe, R. Timofte, L. V. Gool |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Human Motion DataBase (HMDB51) | 51 action categories, each containing at least 101 clips, extracted from a range of sources. | None. | 6,766 video clips | video clips | Action classification | 2011 | [47] | H. Kuehne et al. |
TV Human Interaction Dataset | Videos from 20 different TV shows for prediction social actions: handshake, high five, hug, kiss and none. | None. | 6,766 video clips | video clips | Action prediction | 2013 | [48] | Patron-Perez, A. et al. |
UT Interaction | People acting out one of 6 actions (shake-hands, point, hug, push, kick, and punch) sometimes with multiple groups in the same video clip. | None. | 120 video clips | video clips | Action prediction | 2009 | [49] | Ryoo, M. S. et al. |
UT Kinect | 10 different people performing one of 6 actions (walk, sit down, stand up, pick up, carry, throw, push, pull, wave hands and clap hands) in an office setting. | None. | 200 video clips with depth information at 15 frames per second | video clips with depth information | Action classification | 2012 | [50] | Xia, L. et al. |
SBU Interact | Seven participants performing one of 8 actions together (approaching, departing, pushing, kicking, punching, exchanging objects, hugging, and shaking hands) in an office setting. | None. | Around 300 interactions | video clips with depth information | Action classification | 2012 | [51] | Yun, K. et al. |
Berkeley Multimodal Human Action Database (MHAD) | Recordings of a single person performing 12 actions | MoCap pre-processing | 660 action samples | 8 PhaseSpace Motion Capture, 2 Stereo Cameras, 4 Quad Cameras, 6 accelerometers, 4 microphones | Action classification | 2013 | [52] | Ofli, F. et al. |
UCF 101 Dataset | Self described as 'a dataset of 101 human actions classes from videos in the wild.' Dataset is large with over 27 hours of video. | Actions classified and labeled. | 13,000 | Video, images, text | Classification, action detection | 2012 | [53][54] | K. Soomro et al. |
THUMOS Dataset | Large video dataset for action classification. | Actions classified and labeled. | 45M frames of video | Video, images, text | Classification, action detection | 2013 | [55][56] | Y. Jiang et al. |
Activitynet | Large video dataset for activity recognition and detection. | Actions classified and labeled. | 10,024 | Video, images, text | Classification, action detection | 2015 | [57] | Heilbron et al. |
MSP-AVATAR | Improvised scenarios annotated for discourse functions: contrast, confirmation/negation, question, uncertainty, suggest, giving orders, warn, inform, size description, using pronouns. | Actions classified and labeled. | 74 sessions | Motion-captured video, audio | Classification, action detection | 2015 | [58] | Sadoughi, N. et al. |
LILiR Twotalk Corpus | Video datasets for non-verbal communication activity recognition: agreement, thinking, asking and understanding. | Actions classified and labeled. | 527 | Video | Action detection | 2011 | [59] | Sheerman-Chase et al. |
MEXAction2 | Video dataset for action localization and spotting | Actions classified and labeled. | 1000 | Video | Action detection | 2014 | [60] | Stoian et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Visual Genome | Images and their description | 108,000 | images, text | Image captioning | 2016 | [61] | R. Krishna et al. | |
DAVIS: Densely Annotated VIdeo Segmentation 2017 | 150 video sequences containing 10459 frames with a total of 376 objects annotated. | Dataset released for the 2017 DAVIS Challenge with a dedicated workshop co-located with CVPR 2017. The videos contain several types of objects and humans with a high quality segmentation annotation.In each video sequence multiple instances are annotated. | 10,459 | Frames annotated | Video object segmentation | 2017 | [62] | Pont-Tuset, J. et al. |
DAVIS: Densely Annotated VIdeo Segmentation 2016 | 50 video sequences containing 3455 frames with a total of 50 objects annotated. | Dataset released with the CVPR 2016 paper. The videos contain several types of objects and humans with a high quality segmentation annotation. In each video sequence a single instance is annotated. | 3,455 | Frames annotated | Video object segmentation | 2016 | [63] | Perazzi, F. et al. |
T-LESS: An RGB-D Dataset for 6D Pose Estimation of Texture-less Objects | 30 industry-relevant objects. 39K training and 10K test images from each of three sensors. Two types of 3D models for each object. | 6D poses for all modeled objects in all images. Per-pixel labelling can be obtained by rendering of the object models at the ground truth poses. | 49,000 | RGB-D images, 3D object models | 6D object pose estimation, object detection | 2017 | [64] | T. Hodan et al. |
Berkeley 3-D Object Dataset | 849 images taken in 75 different scenes. About 50 different object classes are labeled. | Object bounding boxes and labeling. | 849 | labeled images, text | Object recognition | 2014 | [65][66] | A. Janoch et al. |
Berkeley Segmentation Data Set and Benchmarks 500 (BSDS500) | 500 natural images, explicitly separated into disjoint train, validation and test subsets + benchmarking code. Based on BSDS300. | Each image segmented by five different subjects on average. | 500 | Segmented images | Contour detection and hierarchical image segmentation | 2011 | [67] | University of California, Berkeley |
Microsoft Common Objects in Context (COCO) | complex everyday scenes of common objects in their natural context. | Object highlighting, labeling, and classification into 91 object types. | 2,500,000 | Labeled images, text | Object recognition | 2015 | [68][69] | T. Lin et al. |
SUN Database | Very large scene and object recognition database. | Places and objects are labeled. Objects are segmented. | 131,067 | Images, text | Object recognition, scene recognition | 2014 | [70][71] | J. Xiao et al. |
ImageNet | Labeled object image database, used in the ImageNet Large Scale Visual Recognition Challenge | Labeled objects, bounding boxes, descriptive words, SIFT features | 14,197,122 | Images, text | Object recognition, scene recognition | 2009 (2014) | [72][73][74] | J. Deng et al. |
Open Images | A Large set of images listed as having CC BY 2.0 license with image-level labels and bounding boxes spanning thousands of classes. | Image-level labels, Bounding boxes | 9,178,275 | Images, text | Classification, Object recognition | 2017 | [75] | |
TV News Channel Commercial Detection Dataset | TV commercials and news broadcasts. | Audio and video features extracted from still images. | 129,685 | Text | Clustering, classification | 2015 | [76][77] | P. Guha et al. |
Statlog (Image Segmentation) Dataset | The instances were drawn randomly from a database of 7 outdoor images and hand-segmented to create a classification for every pixel. | Many features calculated. | 2310 | Text | Classification | 1990 | [78] | University of Massachusetts |
Caltech 101 | Pictures of objects. | Detailed object outlines marked. | 9146 | Images | Classification, object recognition. | 2003 | [79][80] | F. Li et al. |
Caltech-256 | Large dataset of images for object classification. | Images categorized and hand-sorted. | 30,607 | Images, Text | Classification, object detection | 2007 | [81][82] | G. Griffin et al. |
SIFT10M Dataset | SIFT features of Caltech-256 dataset. | Extensive SIFT feature extraction. | 11,164,866 | Text | Classification, object detection | 2016 | [83] | X. Fu et al. |
LabelMe | Annotated pictures of scenes. | Objects outlined. | 187,240 | Images, text | Classification, object detection | 2005 | [84] | MIT Computer Science and Artificial Intelligence Laboratory |
Cityscapes Dataset | Stereo video sequences recorded in street scenes, with pixel-level annotations. Metadata also included. | Pixel-level segmentation and labeling | 25,000 | Images, text | Classification, object detection | 2016 | [85] | Daimler AG et al. |
PASCAL VOC Dataset | Large number of images for classification tasks. | Labeling, bounding box included | 500,000 | Images, text | Classification, object detection | 2010 | [86][87] | M. Everingham et al. |
CIFAR-10 Dataset | Many small, low-resolution, images of 10 classes of objects. | Classes labelled, training set splits created. | 60,000 | Images | Classification | 2009 | [73][88] | A. Krizhevsky et al. |
CIFAR-100 Dataset | Like CIFAR-10, above, but 100 classes of objects are given. | Classes labelled, training set splits created. | 60,000 | Images | Classification | 2009 | [73][88] | A. Krizhevsky et al. |
CINIC-10 Dataset | A unified contribution of CIFAR-10 and Imagenet with 10 classes, and 3 splits. Larger than CIFAR-10. | Classes labelled, training, validation, test set splits created. | 270,000 | Images | Classification | 2018 | [89] | Luke N. Darlow, Elliot J. Crowley, Antreas Antoniou, Amos J. Storkey |
Fashion-MNIST | A MNIST-like fashion product database | Classes labelled, training set splits created. | 60,000 | Images | Classification | 2017 | [90] | Zalando SE |
notMNIST | Some publicly available fonts and extracted glyphs from them to make a dataset similar to MNIST. There are 10 classes, with letters A-J taken from different fonts. | Classes labelled, training set splits created. | 500,000 | Images | Classification | 2011 | [91] | Yaroslav Bulatov |
German Traffic Sign Detection Benchmark Dataset | Images from vehicles of traffic signs on German roads. These signs comply with UN standards and therefore are the same as in other countries. | Signs manually labeled | 900 | Images | Classification | 2013 | [92][93] | S Houben et al. |
KITTI Vision Benchmark Dataset | Autonomous vehicles driving through a mid-size city captured images of various areas using cameras and laser scanners. | Many benchmarks extracted from data. | >100 GB of data | Images, text | Classification, object detection | 2012 | [94][95] | A Geiger et al. |
Linnaeus 5 dataset | Images of 5 classes of objects. | Classes labelled, training set splits created. | 8000 | Images | Classification | 2017 | [96] | Chaladze & Kalatozishvili |
FieldSAFE | Multi-modal dataset for obstacle detection in agriculture including stereo camera, thermal camera, web camera, 360-degree camera, lidar, radar, and precise localization. | Classes labelled geographically. | >400 GB of data | Images and 3D point clouds | Classification, object detection, object localization | 2017 | [97] | M. Kragh et al. |
11K Hands | 11,076 hand images (1600 x 1200 pixels) of 190 subjects, of varying ages between 18 – 75 years old, for gender recognition and biometric identification. | None | 11,076 hand images | Images and (.mat, .txt, and .csv) label files | Gender recognition and biometric identification | 2017 | [98] | M Afifi |
CORe50 | Specifically designed for Continuous/Lifelong Learning and Object Recognition, is a collection of more than 500 videos (30fps) of 50 domestic objects belonging to 10 different categories. | Classes labelled, training set splits created based on a 3-way, multi-runs benchmark. | 164,866 RBG-D images | images (.png or .pkl) and (.pkl, .txt, .tsv) label files | Classification, Object recognition | 2017 | [99] | V. Lomonaco and D. Maltoni |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Artificial Characters Dataset | Artificially generated data describing the structure of 10 capital English letters. | Coordinates of lines drawn given as integers. Various other features. | 6000 | Text | Handwriting recognition, classification | 1992 | [100] | H. Guvenir et al. |
Letter Dataset | Upper case printed letters. | 17 features are extracted from all images. | 20,000 | Text | OCR, classification | 1991 | [101][102] | D. Slate et al. |
Character Trajectories Dataset | Labeled samples of pen tip trajectories for people writing simple characters. | 3-dimensional pen tip velocity trajectory matrix for each sample | 2858 | Text | Handwriting recognition, classification | 2008 | [103][104] | B. Williams |
Chars74K Dataset | Character recognition in natural images of symbols used in both English and Kannada | 74,107 | Character recognition, handwriting recognition, OCR, classification | 2009 | [105] | T. de Campos | ||
UJI Pen Characters Dataset | Isolated handwritten characters | Coordinates of pen position as characters were written given. | 11,640 | Text | Handwriting recognition, classification | 2009 | [106][107] | F. Prat et al. |
Gisette Dataset | Handwriting samples from the often-confused 4 and 9 characters. | Features extracted from images, split into train/test, handwriting images size-normalized. | 13,500 | Images, text | Handwriting recognition, classification | 2003 | [108] | Yann LeCun et al. |
MNIST database | Database of handwritten digits. | Hand-labeled. | 60,000 | Images, text | Classification | 1998 | [109][110] | National Institute of Standards and Technology |
Optical Recognition of Handwritten Digits Dataset | Normalized bitmaps of handwritten data. | Size normalized and mapped to bitmaps. | 5620 | Images, text | Handwriting recognition, classification | 1998 | [111] | E. Alpaydin et al. |
Pen-Based Recognition of Handwritten Digits Dataset | Handwritten digits on electronic pen-tablet. | Feature vectors extracted to be uniformly spaced. | 10,992 | Images, text | Handwriting recognition, classification | 1998 | [112][113] | E. Alpaydin et al. |
Semeion Handwritten Digit Dataset | Handwritten digits from 80 people. | All handwritten digits have been normalized for size and mapped to the same grid. | 1593 | Images, text | Handwriting recognition, classification | 2008 | [114] | T. Srl |
HASYv2 | Handwritten mathematical symbols | All symbols are centered and of size 32px x 32px. | 168233 | Images, text | Classification | 2017 | [115] | Martin Thoma |
Noisy Handwritten Bangla Dataset | Includes Handwritten Numeral Dataset (10 classes) and Basic Character Dataset (50 classes), each dataset has three types of noise: white gaussian, motion blur, and reduced contrast. | All images are centered and of size 32x32. | Numeral Dataset: 23330, Character Dataset: 76000 | Images, text | Handwriting recognition, classification | 2017 | [116] | M. Karki et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Aerial Image Segmentation Dataset | 80 high-resolution aerial images with spatial resolution ranging from 0.3 to 1.0. | Images manually segmented. | 80 | Images | Aerial Classification, object detection | 2013 | [117][118] | J. Yuan et al. |
KIT AIS Data Set | Multiple labeled training and evaluation datasets of aerial images of crowds. | Images manually labeled to show paths of individuals through crowds. | ~ 150 | Images with paths | People tracking, aerial tracking | 2012 | [119][120] | M. Butenuth et al. |
Wilt Dataset | Remote sensing data of diseased trees and other land cover. | Various features extracted. | 4899 | Images | Classification, aerial object detection | 2014 | [121][122] | B. Johnson |
Forest Type Mapping Dataset | Satellite imagery of forests in Japan. | Image wavelength bands extracted. | 326 | Text | Classification | 2015 | [123][124] | B. Johnson |
Overhead Imagery Research Data Set | Annotated overhead imagery. Images with multiple objects. | Over 30 annotations and over 60 statistics that describe the target within the context of the image. | 1000 | Images, text | Classification | 2009 | [125][126] | F. Tanner et al. |
SpaceNet | SpaceNet is a corpus of commercial satellite imagery and labeled training data. | GeoTiff and GeoJSON files containing building footprints. | >17533 | Images | Classification, Object Identification | 2017 | [127][128][129] | DigitalGlobe, Inc. |
UC Merced Land Use Dataset | These images were manually extracted from large images from the USGS National Map Urban Area Imagery collection for various urban areas around the US. | This is a 21 class land use image dataset meant for research purposes. There are 100 images for each class. | 2,100 | Image chips of 256x256, 30 cm (1 foot) GSD | Land cover classification | 2010 | [130] | Yi Yang and Shawn Newsam |
SAT-4 Airborne Dataset | Images were extracted from the National Agriculture Imagery Program (NAIP) dataset. | SAT-4 has four broad land cover classes, includes barren land, trees, grassland and a class that consists of all land cover classes other than the above three. | 500,000 | Images | Classification | 2015 | [131] | S. Basu et al. |
SAT-6 Airborne Dataset | Images were extracted from the National Agriculture Imagery Program (NAIP) dataset. | SAT-6 has six broad land cover classes, includes barren land, trees, grassland, roads, buildings and water bodies. | 405,000 | Images | Classification | 2015 | [131] | S. Basu et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Quantum simulations of an electron in a two dimensional potential well | Labelled images of raw input to a simulation of 2d Quantum mechanics | Raw data (in HDF5 format) and output labels from quantum simulation | 1.3 million images | Labeled images | Regression | 2017 | [132] | K. Mills et al. |
MPII Cooking Activities Dataset | Videos and images of various cooking activities. | Activity paths and directions, labels, fine-grained motion labeling, activity class, still image extraction and labeling. | 881,755 frames | Labeled video, images, text | Classification | 2012 | [133][134] | M. Rohrbach et al. |
FAMOS Dataset | 5,000 unique microstructures, all samples have been acquired 3 times with two different cameras. | Original PNG files, sorted per camera and then per acquisition. MATLAB datafiles with one 16384 times 5000 matrix per camera per acquisition. | 30,000 | Images and .mat files | Authentication | 2012 | [135] | S. Voloshynovskiy, et al. |
PharmaPack Dataset | 1,000 unique classes with 54 images per class. | Class labeling, many local descriptors, like SIFT and aKaZE, and local feature agreators, like Fisher Vector (FV). | 54,000 | Images and .mat files | Fine-grain classification | 2017 | [136] | O. Taran and S. Rezaeifar, et al. |
Stanford Dogs Dataset | Images of 120 breeds of dogs from around the world. | Train/test splits and ImageNet annotations provided. | 20,580 | Images, text | Fine-grain classification | 2011 | [137][138] | A. Khosla et al. |
The Oxford-IIIT Pet Dataset | 37 categories of pets with roughly 200 images of each. | Breed labeled, tight bounding box, foreground-background segmentation. | ~ 7,400 | Images, text | Classification, object detection | 2012 | [138][139] | O. Parkhi et al. |
Corel Image Features Data Set | Database of images with features extracted. | Many features including color histogram, co-occurrence texture, and colormoments, | 68,040 | Text | Classification, object detection | 1999 | [140][141] | M. Ortega-Bindenberger et al. |
Online Video Characteristics and Transcoding Time Dataset. | Transcoding times for various different videos and video properties. | Video features given. | 168,286 | Text | Regression | 2015 | [142] | T. Deneke et al. |
Microsoft Sequential Image Narrative Dataset (SIND) | Dataset for sequential vision-to-language | Descriptive caption and storytelling given for each photo, and photos are arranged in sequences | 81,743 | Images, text | Visual storytelling | 2016 | [143] | Microsoft Research |
Caltech-UCSD Birds-200-2011 Dataset | Large dataset of images of birds. | Part locations for birds, bounding boxes, 312 binary attributes given | 11,788 | Images, text | Classification | 2011 | [144][145] | C. Wah et al. |
YouTube-8M | Large and diverse labeled video dataset | YouTube video IDs and associated labels from a diverse vocabulary of 4800 visual entities | 8 million | Video, text | Video classification | 2016 | [146][147] | S. Abu-El-Haija et al. |
YFCC100M | Large and diverse labeled image and video dataset | Flickr Videos and Images and associated description, titles, tags, and other metadata (such as EXIF and geotags) | 100 million | Video, Image, Text | Video and Image classification | 2016 | [148][149] | B. Thomee et al. |
Discrete LIRIS-ACCEDE | Short videos annotated for valence and arousal. | Valence and arousal labels. | 9800 | Video | Video emotion elicitation detection | 2015 | [150] | Y. Baveye et al. |
Continuous LIRIS-ACCEDE | Long videos annotated for valence and arousal while also collecting Galvanic Skin Response. | Valence and arousal labels. | 30 | Video | Video emotion elicitation detection | 2015 | [151] | Y. Baveye et al. |
MediaEval LIRIS-ACCEDE | Extension of Discrete LIRIS-ACCEDE including annotations for violence levels of the films. | Violence, valence and arousal labels. | 10900 | Video | Video emotion elicitation detection | 2015 | [152] | Y. Baveye et al. |
Leeds Sports Pose | Articulated human pose annotations in 2000 natural sports images from Flickr. | Rough crop around single person of interest with 14 joint labels | 2000 | Images plus .mat file labels | Human pose estimation | 2010 | [153] | S. Johnson and M. Everingham |
Leeds Sports Pose Extended Training | Articulated human pose annotations in 10,000 natural sports images from Flickr. | 14 joint labels via crowdsourcing | 10000 | Images plus .mat file labels | Human pose estimation | 2011 | [154] | S. Johnson and M. Everingham |
MCQ Dataset | 6 different real multiple choice-based exams (735 answer sheets and 33,540 answer boxes) to evaluate computer vision techniques and systems developed for multiple choice test assessment systems. | None | 735 answer sheets and 33,540 answer boxes | Images and .mat file labels | Development of multiple choice test assessment systems | 2017 | [155][156] | Afifi, M. et al. |
Surveillance Videos | Real surveillance videos cover a large surveillance time (7 days with 24 hours each). | None | 19 surveillance videos (7 days with 24 hours each). | Videos | Data compression | 2016 | [157] | Taj-Eddin, I. A. T. F. et al. |
LILA BC | Labeled Information Library of Alexandria: Biology and Conservation. Labeled images that support machine learning research around ecology and environmental science. | None | ~10M images | Images | Classification | 2019 | [158] | LILA working group |
Can We See Photosynthesis? | 32 videos for eight live and eight dead leaves recorded under both DC and AC lighting conditions. | None | 32 videos | Videos | Liveness detection of plants | 2017 | [159] | Taj-Eddin, I. A. T. F. et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Amazon reviews | US product reviews from Amazon.com. | None. | ~ 82M | Text | Classification, sentiment analysis | 2015 | [160] | McAuley et al. |
OpinRank Review Dataset | Reviews of cars and hotels from Edmunds.com and TripAdvisor respectively. | None. | 42,230 / ~259,000 respectively | Text | Sentiment analysis, clustering | 2011 | [161][162] | K. Ganesan et al. |
MovieLens | 22,000,000 ratings and 580,000 tags applied to 33,000 movies by 240,000 users. | None. | ~ 22M | Text | Regression, clustering, classification | 2016 | [163] | GroupLens Research |
Yahoo! Music User Ratings of Musical Artists | Over 10M ratings of artists by Yahoo users. | None described. | ~ 10M | Text | Clustering, regression | 2004 | [164][165] | Yahoo! |
Car Evaluation Data Set | Car properties and their overall acceptability. | Six categorical features given. | 1728 | Text | Classification | 1997 | [166][167] | M. Bohanec |
YouTube Comedy Slam Preference Dataset | User vote data for pairs of videos shown on YouTube. Users voted on funnier videos. | Video metadata given. | 1,138,562 | Text | Classification | 2012 | [168][169] | |
Skytrax User Reviews Dataset | User reviews of airlines, airports, seats, and lounges from Skytrax. | Ratings are fine-grain and include many aspects of airport experience. | 41396 | Text | Classification, regression | 2015 | [170] | Q. Nguyen |
Teaching Assistant Evaluation Dataset | Teaching assistant reviews. | Features of each instance such as class, class size, and instructor are given. | 151 | Text | Classification | 1997 | [171][172] | W. Loh et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
NYSK Dataset | English news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn. | Filtered and presented in XML format. | 10,421 | XML, text | Sentiment analysis, topic extraction | 2013 | [173] | Dermouche, M. et al. |
The Reuters Corpus Volume 1 | Large corpus of Reuters news stories in English. | Fine-grain categorization and topic codes. | 810,000 | Text | Classification, clustering, summarization | 2002 | [174] | Reuters |
The Reuters Corpus Volume 2 | Large corpus of Reuters news stories in multiple languages. | Fine-grain categorization and topic codes. | 487,000 | Text | Classification, clustering, summarization | 2005 | [175] | Reuters |
Thomson Reuters Text Research Collection | Large corpus of news stories. | Details not described. | 1,800,370 | Text | Classification, clustering, summarization | 2009 | [176] | T. Rose et al. |
Saudi Newspapers Corpus | 31,030 Arabic newspaper articles. | Metadata extracted. | 31,030 | JSON | Summarization, clustering | 2015 | [177] | M. Alhagri |
RE3D (Relationship and Entity Extraction Evaluation Dataset) | Entity and Relation marked data from various news and government sources. Sponsored by Dstl | Filtered, categorisation using Baleen types | not known | JSON | Classification, Entity and Relation recognition | 2017 | [178] | Dstl |
Examiner Pseudo-News Corpus | Clickbait, spam, crowd-sourced headlines from 2010 to 2015 | Publish date and headlines | 3,089,781 | CSV | Clustering, Events, Sentiment | 2017 | [179] | R. Kulkarni |
ABC Australia News Corpus | Entire news corpus of ABC Australia from 2003 to 2017 | Publish date and headlines | 1,103,664 | CSV | Clustering, Events, Sentiment | 2017 | [180] | R. Kulkarni |
Worldwide News - Aggregate of 20K Feeds | One week snapshot of all online headlines in 20+ languages | Publish time, URL and headlines | 1,398,431 | CSV | Clustering, Events, Language Detection | 2017 | [181] | R. Kulkarni |
Reuters News Wire Headline | 11 Years of timestamped events published on the news-wire | Publish time, Headline Text | 16,121,310 | CSV | NLP, Computational Linguistics, Events | 2018 | [182] | R. Kulkarni |
The Irish Times The Irish Times IRS | 23 Years of Events From Ireland | Publish time, Headline Text | 1,425,460 | CSV | NLP, Computational Linguistics, Events | 2018 | [183] | R. Kulkarni |
News Headlines Dataset for Sarcasm Detection | High quality dataset with Sarcastic and Non-sarcastic news headlines. | Clean, normalized text | 26,709 | JSON | NLP, Classification, Linguistics | 2018 | [184] | Rishabh Misra |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Enron Email Dataset | Emails from employees at Enron organized into folders. | Attachments removed, invalid email addresses converted to [email protected] or [email protected]. | ~ 500,000 | Text | Network analysis, sentiment analysis | 2004 (2015) | [185][186] | Klimt, B. and Y. Yang |
Ling-Spam Dataset | Corpus containing both legitimate and spam emails. | Four version of the corpus involving whether or not a lemmatiser or stop-list was enabled. | Text | Classification | 2000 | [187][188] | Androutsopoulos, J. et al. | |
SMS Spam Collection Dataset | Collected SMS spam messages. | None. | 5,574 | Text | Classification | 2011 | [189][190] | T. Almeida et al. |
Twenty Newsgroups Dataset | Messages from 20 different newsgroups. | None. | 20,000 | Text | Natural language processing | 1999 | [191] | T. Mitchell et al. |
Spambase Dataset | Spam emails. | Many text features extracted. | 4,601 | Text | Spam detection, classification | 1999 | [192] | M. Hopkins et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
MovieTweetings | Movie rating dataset based on public and well-structured tweets | ~710,000 | Text | Classification, regression | 2018 | [193] | S. Dooms | |
Twitter100k | Pairs of images and tweets | 100,000 | Text and Images | Cross-media retrieval | 2017 | [194][195] | Y. Hu, et al. | |
Sentiment140 | Tweet data from 2009 including original text, time stamp, user and sentiment. | Classified using distant supervision from presence of emoticon in tweet. | 1,578,627 | Tweets, comma, separated values | Sentiment analysis | 2009 | [196][197] | A. Go et al. |
ASU Twitter Dataset | Twitter network data, not actual tweets. Shows connections between a large number of users. | None. | 11,316,811 users, 85,331,846 connections | Text | Clustering, graph analysis | 2009 | [198][199] | R. Zafarani et al. |
SNAP Social Circles: Twitter Database | Large Twitter network data. | Node features, circles, and ego networks. | 1,768,149 | Text | Clustering, graph analysis | 2012 | [200][201] | J. McAuley et al. |
Twitter Dataset for Arabic Sentiment Analysis | Arabic tweets. | Samples hand-labeled as positive or negative. | 2000 | Text | Classification | 2014 | [202][203] | N. Abdulla |
Buzz in Social Media Dataset | Data from Twitter and Tom's Hardware. This dataset focuses on specific buzz topics being discussed on those sites. | Data is windowed so that the user can attempt to predict the events leading up to social media buzz. | 140,000 | Text | Regression, Classification | 2013 | [204][205] | F. Kawala et al. |
Paraphrase and Semantic Similarity in Twitter (PIT) | This dataset focuses on whether tweets have (almost) same meaning/information or not. Manually labeled. | tokenization, part-of-speech and named entity tagging | 18,762 | Text | Regression, Classification | 2015 | [206][207] | Xu et al. |
Geoparse Twitter benchmark dataset | This dataset contains tweets during different news events in different countries. Manually labeled location mentions. | location annotations added to JSON metadata | 6,386 | Tweets, JSON | Classification, Information Extraction | 2014 | [208][209] | S.E. Middleton et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
NPS Chat Corpus | Posts from age-specific online chat rooms. | Hand privacy masked, tagged for part of speech and dialogue-act. | ~ 500,000 | XML | NLP, programming, linguistics | 2007 | [210] | Forsyth, E., Lin, J., & Martell, C. |
Twitter Triple Corpus | A-B-A triples extracted from Twitter. | 4,232 | Text | NLP | 2016 | [211] | Sordini, A. et al. | |
UseNet Corpus | UseNet forum postings. | Anonymized e-mails and URLs. Omitted documents with lengths <500 words or >500,000 words, or that were <90% English. | 7 billion | Text | 2011 | [212] | Shaoul, C., & Westbury C. | |
NUS SMS Corpus | SMS messages collected between two users, with timing analysis. | ~ 10,000 | XML | NLP | 2011 | [213] | KAN, M | |
Reddit All Comments Corpus | All Reddit comments (as of 2015). | ~ 1.7 billion | JSON | NLP, research | 2015 | [214] | Stuck_In_the_Matrix | |
Ubuntu Dialogue Corpus | Dialogues extracted from Ubuntu chat stream on IRC. | CSV | Dialogue Systems Research | 2015 | [215] | Lowe, R. et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Web of Science Dataset | Hierarchical Datasets for Text Classification | None. | 46,985 | Text | Classification, Categorization | 2017 | [216][217] | K. Kowsari et al. |
Legal Case Reports | Federal Court of Australia cases from 2006 to 2009. | None. | 4,000 | Text | Summarization, citation analysis | 2012 | [218][219] | F. Galgani et al. |
Blogger Authorship Corpus | Blog entries of 19,320 people from blogger.com. | Blogger self-provided gender, age, industry, and astrological sign. | 681,288 | Text | Sentiment analysis, summarization, classification | 2006 | [220][221] | J. Schler et al. |
Social Structure of Facebook Networks | Large dataset of the social structure of Facebook. | None. | 100 colleges covered | Text | Network analysis, clustering | 2012 | [222][223] | A. Traud et al. |
Dataset for the Machine Comprehension of Text | Stories and associated questions for testing comprehension of text. | None. | 660 | Text | Natural language processing, machine comprehension | 2013 | [224][225] | M. Richardson et al. |
The Penn Treebank Project | Naturally occurring text annotated for linguistic structure. | Text is parsed into semantic trees. | ~ 1M words | Text | Natural language processing, summarization | 1995 | [226][227] | M. Marcus et al. |
DEXTER Dataset | Task given is to determine, from features given, which articles are about corporate acquisitions. | Features extracted include word stems. Distractor features included. | 2600 | Text | Classification | 2008 | [228] | Reuters |
Google Books N-grams | N-grams from a very large corpus of books | None. | 2.2 TB of text | Text | Classification, clustering, regression | 2011 | [229][230] | |
Personae Corpus | Collected for experiments in Authorship Attribution and Personality Prediction. Consists of 145 Dutch-language essays. | In addition to normal texts, syntactically annotated texts are given. | 145 | Text | Classification, regression | 2008 | [231][232] | K. Luyckx et al. |
CNAE-9 Dataset | Categorization task for free text descriptions of Brazilian companies. | Word frequency has been extracted. | 1080 | Text | Classification | 2012 | [233][234] | P. Ciarelli et al. |
Sentiment Labeled Sentences Dataset | 3000 sentiment labeled sentences. | Sentiment of each sentence has been hand labeled as positive or negative. | 3000 | Text | Classification, sentiment analysis | 2015 | [235][236] | D. Kotzias |
BlogFeedback Dataset | Dataset to predict the number of comments a post will receive based on features of that post. | Many features of each post extracted. | 60,021 | Text | Regression | 2014 | [237][238] | K. Buza |
Stanford Natural Language Inference (SNLI) Corpus | Image captions matched with newly constructed sentences to form entailment, contradiction, or neutral pairs. | Entailment class labels, syntactic parsing by the Stanford PCFG parser | 570,000 | Text | Natural language inference/recognizing textual entailment | 2015 | [239] | S. Bowman et al. |
DSL Corpus Collection (DSLCC) | A multilingual collection of short excerpts of journalistic texts in similar languages and dialects. | None | 294,000 phrases | Text | Discriminating between similar languages | 2017 | [240] | Tan, Liling et al. |
Urban Dictionary Dataset | Corpus of words, votes and definitions | User names anonymised | 2,606,522 | CSV | NLP, Machine comprehension | 2016-05 | [241] | Anonymous |
T-REx | Wikipedia abstracts aligned with Wikidata entities | Alignment of Wikidata triples with Wikipedia abstracts | 11M aligned triples | JSON and NIF [1] | NLP, Relation Extraction | 2018 | [242] | H. Elsahar et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Zero Resource Speech Challenge 2015 | Spontaneous speech (English), Read speech (Xitsonga). | raw wav | English: 5h, 12 speakers; Xitsonga: 2h30; 24 speakers | sound | Unsupervised discovery of speech features/subword units/word units | 2015 | [243][244] | Versteegh et al. |
Parkinson Speech Dataset | Multiple recordings of people with and without Parkinson's Disease. | Voice features extracted, disease scored by physician using unified Parkinson's disease rating scale | 1,040 | Text | Classification, regression | 2013 | [245][246] | B. E. Sakar et al. |
Spoken Arabic Digits | Spoken Arabic digits from 44 male and 44 female. | Time-series of mel-frequency cepstrum coefficients. | 8,800 | Text | Classification | 2010 | [247][248] | M. Bedda et al. |
ISOLET Dataset | Spoken letter names. | Features extracted from sounds. | 7797 | Text | Classification | 1994 | [249][250] | R. Cole et al. |
Japanese Vowels Dataset | Nine male speakers uttered two Japanese vowels successively. | Applied 12-degree linear prediction analysis to it to obtain a discrete-time series with 12 cepstrum coefficients. | 640 | Text | Classification | 1999 | [251][252] | M. Kudo et al. |
Parkinson's Telemonitoring Dataset | Multiple recordings of people with and without Parkinson's Disease. | Sound features extracted. | 5875 | Text | Classification | 2009 | [253][254] | A. Tsanas et al. |
TIMIT | Recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. | Speech is lexically and phonemically transcribed. | 6300 | Text | Speech recognition, classification. | 1986 | [255][256] | J. Garofolo et al. |
Arabic Speech Corpus | A single-speaker, Modern Standard Arabic (MSA) speech corpus with phonetic and orthographic transcripts aligned to phoneme level | Speech is orthographically and phonetically transcribed with stress marks. | ~1900 | Text, WAV | Speech Synthesis, Speech Recognition, Corpus Alignment, Speech Therapy, Education. | 2016 | [257] | N. Halabi |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Geographical Original of Music Data Set | Audio features of music samples from different locations. | Audio features extracted using MARSYAS software. | 1,059 | Text | Geographical classification, clustering | 2014 | [258][259] | F. Zhou et al. |
Million Song Dataset | Audio features from one million different songs. | Audio features extracted. | 1M | Text | Classification, clustering | 2011 | [260][261] | T. Bertin-Mahieux et al. |
Free Music Archive | Audio under Creative Commons from 100k songs (343 days, 1TiB) with a hierarchy of 161 genres, metadata, user data, free-form text. | Raw audio and audio features. | 106,574 | Text, MP3 | Classification, recommendation | 2017 | [262] | M. Defferrard et al. |
Bach Choral Harmony Dataset | Bach chorale chords. | Audio features extracted. | 5665 | Text | Classification | 2014 | [263][264] | D. Radicioni et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
UrbanSound | Labeled sound recordings of sounds like air conditioners, car horns and children playing. | Sorted into folders by class of events as well as metadata in a JSON file and annotations in a CSV file. | 1,059 | Sound (WAV) | Classification | 2014 | [265][266] | J. Salamon et al. |
AudioSet | 10-second sound snippets from YouTube videos, and an ontology of over 500 labels. | 128-d PCA'd VGG-ish features every 1 second. | 2,084,320 | Text (CSV) and TensorFlow Record files | Classification | 2017 | [267] | J. Gemmeke et al., Google |
Bird Audio Detection challenge | Audio from environmental monitoring stations, plus crowdsourced recordings | 17,000+ | Classification | 2016 (2018) | [268][269] | Queen Mary University and IEEE Signal Processing Society |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Witty Worm Dataset | Dataset detailing the spread of the Witty worm and the infected computers. | Split into a publicly available set and a restricted set containing more sensitive information like IP and UDP headers. | 55,909 IP addresses | Text | Classification | 2004 | [270][271] | Center for Applied Internet Data Analysis |
Cuff-Less Blood Pressure Estimation Dataset | Cleaned vital signals from human patients which can be used to estimate blood pressure. | 125 Hz vital signs have been cleaned. | 12,000 | Text | Classification, regression | 2015 | [272][273] | M. Kachuee et al. |
Gas Sensor Array Drift Dataset | Measurements from 16 chemical sensors utilized in simulations for drift compensation. | Extensive number of features given. | 13,910 | Text | Classification | 2012 | [274][275] | A. Vergara |
Servo Dataset | Data covering the nonlinear relationships observed in a servo-amplifier circuit. | Levels of various components as a function of other components are given. | 167 | Text | Regression | 1993 | [276][277] | K. Ullrich |
UJIIndoorLoc-Mag Dataset | Indoor localization database to test indoor positioning systems. Data is magnetic field based. | Train and test splits given. | 40,000 | Text | Classification, regression, clustering | 2015 | [278][279] | D. Rambla et al. |
Sensorless Drive Diagnosis Dataset | Electrical signals from motors with defective components. | Statistical features extracted. | 58,508 | Text | Classification | 2015 | [280][281] | M. Bator |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Wearable Computing: Classification of Body Postures and Movements (PUC-Rio) | People performing five standard actions while wearing motion tackers. | None. | 165,632 | Text | Classification | 2013 | [282][283] | Pontifical Catholic University of Rio de Janeiro |
Gesture Phase Segmentation Dataset | Features extracted from video of people doing various gestures. | Features extracted aim at studying gesture phase segmentation. | 9900 | Text | Classification, clustering | 2014 | [284][285] | R. Madeo et a |
Vicon Physical Action Data Set Dataset | 10 normal and 10 aggressive physical actions that measure the human activity tracked by a 3D tracker. | Many parameters recorded by 3D tracker. | 3000 | Text | Classification | 2011 | [286][287] | T. Theodoridis |
Daily and Sports Activities Dataset | Motor sensor data for 19 daily and sports activities. | Many sensors given, no preprocessing done on signals. | 9120 | Text | Classification | 2013 | [288][289] | B. Barshan et al. |
Human Activity Recognition Using Smartphones Dataset | Gyroscope and accelerometer data from people wearing smartphones and performing normal actions. | Actions performed are labeled, all signals preprocessed for noise. | 10,299 | Text | Classification | 2012 | [290][291] | J. Reyes-Ortiz et al. |
Australian Sign Language Signs | Australian sign language signs captured by motion-tracking gloves. | None. | 2565 | Text | Classification | 2002 | [292][293] | M. Kadous |
Weight Lifting Exercises monitored with Inertial Measurement Units | Five variations of the biceps curl exercise monitored with IMUs. | Some statistics calculated from raw data. | 39,242 | Text | Classification | 2013 | [294][295] | W. Ugulino et al. |
sEMG for Basic Hand movements Dataset | Two databases of surface electromyographic signals of 6 hand movements. | None. | 3000 | Text | Classification | 2014 | [296][297] | C. Sapsanis et al. |
REALDISP Activity Recognition Dataset | Evaluate techniques dealing with the effects of sensor displacement in wearable activity recognition. | None. | 1419 | Text | Classification | 2014 | [297][298] | O. Banos et al. |
Heterogeneity Activity Recognition Dataset | Data from multiple different smart devices for humans performing various activities. | None. | 43,930,257 | Text | Classification, clustering | 2015 | [299][300] | A. Stisen et al. |
Indoor User Movement Prediction from RSS Data | Temporal wireless network data that can be used to track the movement of people in an office. | None. | 13,197 | Text | Classification | 2016 | [301][302] | D. Bacciu |
PAMAP2 Physical Activity Monitoring Dataset | 18 different types of physical activities performed by 9 subjects wearing 3 IMUs. | None. | 3,850,505 | Text | Classification | 2012 | [303] | A. Reiss |
OPPORTUNITY Activity Recognition Dataset | Human Activity Recognition from wearable, object, and ambient sensors is a dataset devised to benchmark human activity recognition algorithms. | None. | 2551 | Text | Classification | 2012 | [304][305] | D. Roggen et al. |
Real World Activity Recognition Dataset | Human Activity Recognition from wearable devices. Distinguishes between seven on-body device positions and comprises six different kinds of sensors. | None. | 3,150,000 (per sensor) | Text | Classification | 2016 | [306] | T. Sztyler et al. |
Toronto Rehab Stroke Pose Dataset | 3D human pose estimates (Kinect) of stroke patients and healthy participants performing a set of tasks using a stroke rehabilitation robot. | None. | 10 healthy person and 9 stroke survivors (3500-6000 frames per person) | CSV | Classification | 2017 | [307][308][309] | E. Dolatabadi et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Wine Dataset | Chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. | 13 properties of each wine are given | 178 | Text | Classification, regression | 1991 | [310][311] | M. Forina et al. |
Combined Cycle Power Plant Data Set | Data from various sensors within a power plant running for 6 years. | None | 9568 | Text | Regression | 2014 | [312][313] | P. Tufekci et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
HIGGS Dataset | Monte Carlo simulations of particle accelerator collisions. | 28 features of each collision are given. | 11M | Text | Classification | 2014 | [314][315][316] | D. Whiteson |
HEPMASS Dataset | Monte Carlo simulations of particle accelerator collisions. Goal is to separate the signal from noise. | 28 features of each collision are given. | 10,500,000 | Text | Classification | 2016 | [315][316][317] | D. Whiteson |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Yacht Hydrodynamics Dataset | Yacht performance based on dimensions. | Six features are given for each yacht. | 308 | Text | Regression | 2013 | [318][319] | R. Lopez |
Robot Execution Failures Dataset | 5 data sets that center around robotic failure to execute common tasks. | Integer valued features such as torque and other sensor measurements. | 463 | Text | Classification | 1999 | [320] | L. Seabra et al. |
Pittsburgh Bridges Dataset | Design description is given in terms of several properties of various bridges. | Various bridge features are given. | 108 | Text | Classification | 1990 | [321][322] | Y. Reich et al. |
Automobile Dataset | Data about automobiles, their insurance risk, and their normalized losses. | Car features extracted. | 205 | Text | Regression | 1987 | [323][324] | J. Schimmer et al. |
Auto MPG Dataset | MPG data for cars. | Eight features of each car given. | 398 | Text | Regression | 1993 | [325] | Carnegie Mellon University |
Energy Efficiency Dataset | Heating and cooling requirements given as a function of building parameters. | Building parameters given. | 768 | Text | Classification, regression | 2012 | [326][327] | A. Xifara et al. |
Airfoil Self-Noise Dataset | A series of aerodynamic and acoustic tests of two and three-dimensional airfoil blade sections. | Data about frequency, angle of attack, etc., are given. | 1503 | Text | Regression | 2014 | [328] | R. Lopez |
Challenger USA Space Shuttle O-Ring Dataset | Attempt to predict O-ring problems given past Challenger data. | Several features of each flight, such as launch temperature, are given. | 23 | Text | Regression | 1993 | [329][330] | D. Draper et al. |
Statlog (Shuttle) Dataset | NASA space shuttle datasets. | Nine features given. | 58,000 | Text | Classification | 2002 | [331] | NASA |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Volcanoes on Venus – JARtool experiment Dataset | Venus images returned by the Magellan spacecraft. | Images are labeled by humans. | not given | Images | Classification | 1991 | [332][333] | M. Burl |
MAGIC Gamma Telescope Dataset | Monte Carlo generated high-energy gamma particle events. | Numerous features extracted from the simulations. | 19,020 | Text | Classification | 2007 | [333][334] | R. Bock |
Solar Flare Dataset | Measurements of the number of certain types of solar flare events occurring in a 24-hour period. | Many solar flare-specific features are given. | 1389 | Text | Regression, classification | 1989 | [335] | G. Bradshaw |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Volcanoes of the World | Volcanic eruption data for all known volcanic events on earth. | Details such as region, subregion, tectonic setting, dominant rock type are given. | 1535 | Text | Regression, classification | 2013 | [336] | E. Venzke et al. |
Seismic-bumps Dataset | Seismic activities from a coal mine. | Seismic activity was classified as hazardous or not. | 2584 | Text | Classification | 2013 | [337][338] | M. Sikora et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Concrete Compressive Strength Dataset | Dataset of concrete properties and compressive strength. | Nine features are given for each sample. | 1030 | Text | Regression | 2007 | [339][340] | I. Yeh |
Concrete Slump Test Dataset | Concrete slump flow given in terms of properties. | Features of concrete given such as fly ash, water, etc. | 103 | Text | Regression | 2009 | [341][342] | I. Yeh |
Musk Dataset | Predict if a molecule, given the features, will be a musk or a non-musk. | 168 features given for each molecule. | 6598 | Text | Classification | 1994 | [343] | Arris Pharmaceutical Corp. |
Steel Plates Faults Dataset | Steel plates of 7 different types. | 27 features given for each sample. | 1941 | Text | Classification | 2010 | [344] | Semeion Research Center |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
EEG Database | Study to examine EEG correlates of genetic predisposition to alcoholism. | Measurements from 64 electrodes placed on the scalp sampled at 256 Hz (3.9 ms epoch) for 1 second. | 122 | Text | Classification | 1999 | [345] | H. Begleiter |
P300 Interface Dataset | Data from nine subjects collected using P300-based brain-computer interface for disabled subjects. | Split into four sessions for each subject. MATLAB code given. | 1,224 | Text | Classification | 2008 | [346][347] | U. Hoffman et al. |
Heart Disease Data Set | Attributed of patients with and without heart disease. | 75 attributes given for each patient with some missing values. | 303 | Text | Classification | 1988 | [348][349] | A. Janosi et al. |
Breast Cancer Wisconsin (Diagnostic) Dataset | Dataset of features of breast masses. Diagnoses by physician is given. | 10 features for each sample are given. | 569 | Text | Classification | 1995 | [350][351] | W. Wolberg et al. |
National Survey on Drug Use and Health | Large scale survey on health and drug use in the United States. | None. | 55,268 | Text | Classification, regression | 2012 | [352] | United States Department of Health and Human Services |
Lung Cancer Dataset | Lung cancer dataset without attribute definitions | 56 features are given for each case | 32 | Text | Classification | 1992 | [353][354] | Z. Hong et al. |
Arrhythmia Dataset | Data for a group of patients, of which some have cardiac arrhythmia. | 276 features for each instance. | 452 | Text | Classification | 1998 | [355][356] | H. Altay et al. |
Diabetes 130-US hospitals for years 1999–2008 Dataset | 9 years of readmission data across 130 US hospitals for patients with diabetes. | Many features of each readmission are given. | 100,000 | Text | Classification, clustering | 2014 | [357][358] | J. Clore et al. |
Diabetic Retinopathy Debrecen Dataset | Features extracted from images of eyes with and without diabetic retinopathy. | Features extracted and conditions diagnosed. | 1151 | Text | Classification | 2014 | [359][360] | B. Antal et al. |
Diabetic Retinopathy Messidor Dataset | Methods to evaluate segmentation and indexing techniques in the field of retinal ophthalmology (MESSIDOR) | Features retinopathy grade and risk of macular edema | 1200 | Images,Text | Classification, Segmentation | 2008 | [361][362] | Messidor Project |
Liver Disorders Dataset | Data for people with liver disorders. | Seven biological features given for each patient. | 345 | Text | Classification | 1990 | [363][364] | Bupa Medical Research Ltd. |
Thyroid Disease Dataset | 10 databases of thyroid disease patient data. | None. | 7200 | Text | Classification | 1987 | [365][366] | R. Quinlan |
Mesothelioma Dataset | Mesothelioma patient data. | Large number of features, including asbestos exposure, are given. | 324 | Text | Classification | 2016 | [367][368] | A. Tanrikulu et al. |
Parkinson's Vision-Based Pose Estimation Dataset | 2D human pose estimates of Parkinson's patients performing a variety of tasks. | Camera shake has been removed from trajectories. | 134 | Text | Classification, regression | 2017 | [369][370][371] | M. Li et al. |
KEGG Metabolic Reaction Network (Undirected) Dataset | Network of metabolic pathways. A reaction network and a relation network are given. | Detailed features for each network node and pathway are given. | 65,554 | Text | Classification, clustering, regression | 2011 | [372] | M. Naeem et al. |
Modified Human Sperm Morphology Analysis Dataset (MHSMA) | Human sperm images from 235 patients with male factor infertility, labeled for normal or abnormal sperm acrosome, head, vacuole, and tail. | Cropped around single sperm head. Magnification normalized. Training, validation, and test set splits created. | 1,540 | .npy files | Classification | 2019 | [373][374] | S. Javadi and S.A. Mirroshandel |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Abalone Dataset | Physical measurements of Abalone. Weather patterns and location are also given. | None. | 4177 | Text | Regression | 1995 | [375] | Marine Research Laboratories – Taroona |
Zoo Dataset | Artificial dataset covering 7 classes of animals. | Animals are classed into 7 categories and features are given for each. | 101 | Text | Classification | 1990 | [376] | R. Forsyth |
Demospongiae Dataset | Data about marine sponges. | 503 sponges in the Demosponge class are described by various features. | 503 | Text | Classification | 2010 | [377] | E. Armengol et al. |
Splice-junction Gene Sequences Dataset | Primate splice-junction gene sequences (DNA) with associated imperfect domain theory. | None. | 3190 | Text | Classification | 1992 | [354] | G. Towell et al. |
Mice Protein Expression Dataset | Expression levels of 77 proteins measured in the cerebral cortex of mice. | None. | 1080 | Text | Classification, Clustering | 2015 | [378][379] | C. Higuera et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Forest Fires Dataset | Forest fires and their properties. | 13 features of each fire are extracted. | 517 | Text | Regression | 2008 | [380][381] | P. Cortez et al. |
Iris Dataset | Three types of iris plants are described by 4 different attributes. | None. | 150 | Text | Classification | 1936 | [382][383] | R. Fisher |
Plant Species Leaves Dataset | Sixteen samples of leaf each of one-hundred plant species. | Shape descriptor, fine-scale margin, and texture histograms are given. | 1600 | Text | Classification | 2012 | [384][385] | J. Cope et al. |
Mushroom Dataset | Mushroom attributes and classification. | Many properties of each mushroom are given. | 8124 | Text | Classification | 1987 | [386] | J. Schlimmer |
Soybean Dataset | Database of diseased soybean plants. | 35 features for each plant are given. Plants are classified into 19 categories. | 307 | Text | Classification | 1988 | [387] | R. Michalski et al. |
Seeds Dataset | Measurements of geometrical properties of kernels belonging to three different varieties of wheat. | None. | 210 | Text | Classification, clustering | 2012 | [388][389] | Charytanowicz et al. |
Covertype Dataset | Data for predicting forest cover type strictly from cartographic variables. | Many geographical features given. | 581,012 | Text | Classification | 1998 | [390][391] | J. Blackard et al. |
Abscisic Acid Signaling Network Dataset | Data for a plant signaling network. Goal is to determine set of rules that governs the network. | None. | 300 | Text | Causal-discovery | 2008 | [392] | J. Jenkens et al. |
Folio Dataset | 20 photos of leaves for each of 32 species. | None. | 637 | Images, text | Classification, clustering | 2015 | [393][394] | T. Munisami et al. |
Oxford Flower Dataset | 17 category dataset of flowers. | Train/test splits, labeled images, | 1360 | Images, text | Classification | 2006 | [139][395] | M-E Nilsback et al. |
Plant Seedlings Dataset | 12 category dataset of plant seedlings. | Labelled images, segmented images, | 5544 | Images | Classification, detection | 2017 | [396] | Giselsson et al. |
Fruits 360 dataset | Database with images of 100 fruits. | 100x100 pixels, White background. | 69277 | Images (jpg) | Classification | 2017 | [397][398] | Mihai Oltean, Horea Muresan |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Ecoli Dataset | Protein localization sites. | Various features of the protein localizations sites are given. | 336 | Text | Classification | 1996 | [399][400] | K. Nakai et al. |
MicroMass Dataset | Identification of microorganisms from mass-spectrometry data. | Various mass spectrometer features. | 931 | Text | Classification | 2013 | [401][402] | P. Mahe et al. |
Yeast Dataset | Predictions of Cellular localization sites of proteins. | Eight features given per instance. | 1484 | Text | Classification | 1996 | [403][404] | K. Nakai et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Tox21 Dataset | Prediction of outcome of biological assays. | Chemical descriptors of molecules are given. | 12707 | Text | Classification | 2016 | [405] | A. Mayr et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Numenta Anomaly Benchmark (NAB) | Data are ordered, timestamped, single-valued metrics. All data files contain anomalies, unless otherwise noted. | None | 50+ files | Comma separated values | Anomaly detection | 2016 (continually updated) | [406] | Numenta |
On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study | Most data files are adapted from UCI Machine Learning Repository data, some are collected from the literature. | treated for missing values, numerical attributes only, different percentages of anomalies, labels | 1000+ files | ARFF | Anomaly detection | 2016 (possibly updated with new datasets and/or results) | Campos et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
DBpedia Neural Question Answering (DBNQA) Dataset | A large collection of Question to SPARQL specially design for Open Domain Neural Question Answering over DBpedia Knowledgebase. | This dataset contains a large collection of Open Neural SPARQL Templates and instances for training Neural SPARQL Machines; it was pre-processed by semi-automatic annotation tools as well as by three SPARQL experts. | 894,499 | Question-query pairs | Question Answering | 2018 | [408][409] | Hartmann, Soru, and Marx et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Dow Jones Index | Weekly data of stocks from the first and second quarters of 2011. | Calculated values included such as percentage change and a lags. | 750 | Comma separated values | Classification, regression, Time series | 2014 | [410][411] | M. Brown et al. |
Statlog (Australian Credit Approval) | Credit card applications either accepted or rejected and attributes about the application. | Attribute names are removed as well as identifying information. Factors have been relabeled. | 690 | Comma separated values | Classification | 1987 | [412][413] | R. Quinlan |
eBay auction data | Auction data from various eBay.com objects over various length auctions | Contains all bids, bidderID, bid times, and opening prices. | ~ 550 | Text | Regression, classification | 2012 | [414][415] | G. Shmueli et al. |
Statlog (German Credit Data) | Binary credit classification into 'good' or 'bad' with many features | Various financial features of each person are given. | 690 | Text | Classification | 1994 | [416] | H. Hofmann |
Bank Marketing Dataset | Data from a large marketing campaign carried out by a large bank . | Many attributes of the clients contacted are given. If the client subscribed to the bank is also given. | 45,211 | Text | Classification | 2012 | [417][418] | S. Moro et al. |
Istanbul Stock Exchange Dataset | Several stock indexes tracked for almost two years. | None. | 536 | Text | Classification, regression | 2013 | [419][420] | O. Akbilgic |
Default of Credit Card Clients | Credit default data for Taiwanese creditors. | Various features about each account are given. | 30,000 | Text | Classification | 2016 | [421][422] | I. Yeh |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Cloud DataSet | Data about 1024 different clouds. | Image features extracted. | 1024 | Text | Classification, clustering | 1989 | [423] | P. Collard |
El Nino Dataset | Oceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific. | 12 weather attributes are measured at each buoy. | 178080 | Text | Regression | 1999 | [424] | Pacific Marine Environmental Laboratory |
Greenhouse Gas Observing Network Dataset | Time-series of greenhouse gas concentrations at 2921 grid cells in California created using simulations of the weather. | None. | 2921 | Text | Regression | 2015 | [425] | D. Lucas |
Atmospheric CO2 from Continuous Air Samples at Mauna Loa Observatory | Continuous air samples in Hawaii, USA. 44 years of records. | None. | 44 years | Text | Regression | 2001 | [426] | Mauna Loa Observatory |
Ionosphere Dataset | Radar data from the ionosphere. Task is to classify into good and bad radar returns. | Many radar features given. | 351 | Text | Classification | 1989 | [366][427] | Johns Hopkins University |
Ozone Level Detection Dataset | Two ground ozone level datasets. | Many features given, including weather conditions at time of measurement. | 2536 | Text | Classification | 2008 | [428][429] | K. Zhang et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Adult Dataset | Census data from 1994 containing demographic features of adults and their income. | Cleaned and anonymized. | 48,842 | Comma separated values | Classification | 1996 | [430] | United States Census Bureau |
Census-Income (KDD) | Weighted census data from the 1994 and 1995 Current Population Surveys. | Split into training and test sets. | 299,285 | Comma separated values | Classification | 2000 | [431][432] | United States Census Bureau |
IPUMS Census Database | Census data from the Los Angeles and Long Beach areas. | None | 256,932 | Text | Classification, regression | 1999 | [433] | IPUMS |
US Census Data 1990 | Partial data from 1990 US census. | Results randomized and useful attributes selected. | 2,458,285 | Text | Classification, regression | 1990 | [434] | United States Census Bureau |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Bike Sharing Dataset | Hourly and daily count of rental bikes in a large city. | Many features, including weather, length of trip, etc., are given. | 17,389 | Text | Regression | 2013 | [435][436] | H. Fanaee-T |
New York City Taxi Trip Data | Trip data for yellow and green taxis in New York City. | Gives pick up and drop off locations, fares, and other details of trips. | 6 years | Text | Classification, clustering | 2015 | [437] | New York City Taxi and Limousine Commission |
Taxi Service Trajectory ECML PKDD | Trajectories of all taxis in a large city. | Many features given, including start and stop points. | 1,710,671 | Text | Clustering, causal-discovery | 2015 | [438][439] | M. Ferreira et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Webpages from Common Crawl 2012 | Large collection of webpages and how they are connected via hyperlinks | None. | 3.5B | Text | clustering, classification | 2013 | [440] | V. Granville |
Internet Advertisements Dataset | Dataset for predicting if a given image is an advertisement or not. | Features encode geometry of ads and phrases occurring in the URL. | 3279 | Text | Classification | 1998 | [441][442] | N. Kushmerick |
Internet Usage Dataset | General demographics of internet users. | None. | 10,104 | Text | Classification, clustering | 1999 | [443] | D. Cook |
URL Dataset | 120 days of URL data from a large conference. | Many features of each URL are given. | 2,396,130 | Text | Classification | 2009 | [444][445] | J. Ma |
Phishing Websites Dataset | Dataset of phishing websites. | Many features of each site are given. | 2456 | Text | Classification | 2015 | [446] | R. Mustafa et al. |
Online Retail Dataset | Online transactions for a UK online retailer. | Details of each transaction given. | 541,909 | Text | Classification, clustering | 2015 | [447] | D. Chen |
Freebase Simple Topic Dump | Freebase is an online effort to structure all human knowledge. | Topics from Freebase have been extracted. | large | Text | Classification, clustering | 2011 | [448][449] | Freebase |
Farm Ads Dataset | The text of farm ads from websites. Binary approval or disapproval by content owners is given. | SVMlight sparse vectors of text words in ads calculated. | 4143 | Text | Classification | 2011 | [450][451] | C. Masterharm et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Poker Hand Dataset | 5 card hands from a standard 52 card deck. | Attributes of each hand are given, including the Poker hands formed by the cards it contains. | 1,025,010 | Text | Regression, classification | 2007 | [452] | R. Cattral |
Connect-4 Dataset | Contains all legal 8-ply positions in the game of connect-4 in which neither player has won yet, and in which the next move is not forced. | None. | 67,557 | Text | Classification | 1995 | [453] | J. Tromp |
Chess (King-Rook vs. King) Dataset | Endgame Database for White King and Rook against Black King. | None. | 28,056 | Text | Classification | 1994 | [454][455] | M. Bain et al. |
Chess (King-Rook vs. King-Pawn) Dataset | King+Rook versus King+Pawn on a7. | None. | 3196 | Text | Classification | 1989 | [456] | R. Holte |
Tic-Tac-Toe Endgame Dataset | Binary classification for win conditions in tic-tac-toe. | None. | 958 | Text | Classification | 1991 | [457] | D. Aha |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Housing Data Set | Median home values of Boston with associated home and neighborhood attributes. | None. | 506 | Text | Regression | 1993 | [458] | D. Harrison et al. |
The Getty Vocabularies | structured terminology for art and other material culture, archival materials, visual surrogates, and bibliographic materials. | None. | large | Text | Classification | 2015 | [459] | Getty Center |
Yahoo! Front Page Today Module User Click Log | User click log for news articles displayed in the Featured Tab of the Today Module on Yahoo! Front Page. | Conjoint analysis with a bilinear model. | 45,811,883 user visits | Text | Regression, clustering | 2009 | [460][461] | Chu et al. |
British Oceanographic Data Centre | Biological, chemical, physical and geophysical data for oceans. 22K variables tracked. | Various. | 22K variables, many instances | Text | Regression, clustering | 2015 | [462] | British Oceanographic Data Centre |
Congressional Voting Records Dataset | Voting data for all USA representatives on 16 issues. | Beyond the raw voting data, various other features are provided. | 435 | Text | Classification | 1987 | [463] | J. Schlimmer |
Entree Chicago Recommendation Dataset | Record of user interactions with Entree Chicago recommendation system. | Details of each users usage of the app are recorded in detail. | 50,672 | Text | Regression, recommendation | 2000 | [464] | R. Burke |
Insurance Company Benchmark (COIL 2000) | Information on customers of an insurance company. | Many features of each customer and the services they use. | 9,000 | Text | Regression, classification | 2000 | [465][466] | P. van der Putten |
Nursery Dataset | Data from applicants to nursery schools. | Data about applicant's family and various other factors included. | 12,960 | Text | Classification | 1997 | [467][468] | V. Rajkovic et al. |
University Dataset | Data describing attributed of a large number of universities. | None. | 285 | Text | Clustering, classification | 1988 | [469] | S. Sounders et al. |
Blood Transfusion Service Center Dataset | Data from blood transfusion service center. Gives data on donors return rate, frequency, etc. | None. | 748 | Text | Classification | 2008 | [470][471] | I. Yeh |
Record Linkage Comparison Patterns Dataset | Large dataset of records. Task is to link relevant records together. | Blocking procedure applied to select only certain record pairs. | 5,749,132 | Text | Classification | 2011 | [472][473] | University of Mainz |
Nomao Dataset | Nomao collects data about places from many different sources. Task is to detect items that describe the same place. | Duplicates labeled. | 34,465 | Text | Classification | 2012 | [474][475] | Nomao Labs |
Movie Dataset | Data for 10,000 movies. | Several features for each movie are given. | 10,000 | Text | Clustering, classification | 1999 | [476] | G. Wiederhold |
Open University Learning Analytics Dataset | Information about students and their interactions with a virtual learning environment. | None. | ~ 30,000 | Text | Classification, clustering, regression | 2015 | [477][478] | J. Kuzilek et al. |
Mobile phone records | Telecommunications activity and interactions | Aggregation per geographical grid cells and every 15 minutes. | large | Text | Classification, Clustering, Regression | 2015 | [479] | G. Barlacchi et al. |