Best Image Datasets for Machine Learning and Data Science (ML002)
You can checkout the previous post - Why we need to know about Machine Learning? (ML001)
In this the second post in my Machine learning series(you can checkout the first part of this series). In this post I will be discussing about Machine learning datasets in particular. Some people might start with algorithms in the beginning but I think one should start with the data itself which is the most important aspects. We will look into some of the best available Machine learning Image datasets in the world and also go through different usecases and their datasets. In the future we will also go in depth on how to visualize and understand the data well in order to choose the right algorithms.
The datasets listed below are some of the best image datasets available for Machine learning research and development. With high qualiy datasets we can acheive a lot in this field but often finding high quality datasets are a challenge and comes at a expensive cost because of the time needed to label a dataset. For unsupervised datasets, where labelling is not required is very hard to collect.
Where to find Machine learning datasets?
There are good datasets available nowadays in many websites and University websites which serves as a great starter for machine learning projects.
So, we will list out some of the best resources below -
Kaggle - Kaggle is a machine learning website which hosts different ML competitions and provide quality datasets. It is a great community to get started, one must visit and start using Kaggle and interact with the community to learn a lot. I will post a detailed post on how to get started with Kaggle.
UCI ML Repository - University of California, Irvine is another great University which provide rich resources in Machine learning. It has some state of art datasets which one can easily download and get started on.
VisualData - It has a great collection of datasets and it also contains datasets from latest conferences like - CVPR2020 and ECCV2020. Do check it out.
CMU Libraries - It is a repository by Carnegie Mellon University and contains some of the best datasets in the field of AI and Machine learning.
Google Dataset Search - It is just like a search engine but its only for datasets.
Now as we saw some of the online resources where you can find datasets to get started with. Now we look at the best datasets according to usecases.
Machine learning datasets (Usecases wise) -
Image Datasets -
Facial Recognition Datasets
Action Recognition Datasets
Object detection and recognition
Handwritten and character recognition
Image Datasets -
Facial Recognition Datasets -
FERET (facial recognition technology) - 11338 images of 1199 individuals in different positions and at different times.
Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) - 7,356 video and audio recordings of 24 professional actors. 8 emotions each at two intensities.
SCFace - Color images of faces at various angles.
Yale Face Database - Faces of 15 individuals in 11 different expressions.
Cohn-Kanade AU-Coded Expression Database - Large database of images with labels for expressions.
JAFFE Facial Expression Database - 213 images of 7 facial expressions (6 basic facial expressions + 1 neutral) posed by 10 Japanese female models.
FaceScrub - Images of public figures scrubbed from image searching.
BioID Face Database - Images of faces with eye positions marked.
Skin Segmentation Dataset - Randomly sampled color values from face images.
Bosphorus - 3D Face image database.
UOY 3D-Face - neutral face, 5 expressions: anger, happiness, sadness, eyes closed, eyebrows raised.
CASIA 3D Face Database - Expressions: Anger, smile, laugh, surprise, closed eyes
CASIA NIR - Expressions: Anger Disgust Fear Happiness Sadness Surprise
BU-3DFE - neutral face, and 6 expressions: anger, happiness, sadness, surprise, disgust, fear (4 levels). 3D images extracted.
Face Recognition Grand Challenge Dataset - Up to 22 samples for each subject. Expressions: anger, happiness, sadness, surprise, disgust, puffy. 3D Data.
Gavabdb - Up to 61 samples for each subject. Expressions neutral face, smile, frontal accentuated laugh, frontal random gesture. 3D images
3D-RMA - Up to 100 subjects, expressions mostly neutral. Several poses as well.
SoF - 112 persons (66 males and 46 females) wear glasses under different illumination conditions.
IMDB-WIKI - IMDB and Wikipedia face images with gender and age labels.
Action Recognition Datasets -
TV Human Interaction Dataset - Videos from 20 different TV shows for prediction social actions: handshake, high five, hug, kiss and none.
Berkeley Multimodal Human Action Database (MHAD) - Recordings of a single person performing 12 actions.
THUMOS Dataset - Large video dataset for action classification.
MEXAction2 - Video dataset for action localization and spotting.
Object detection and recognition -
Visual Genome - Visual Genome is a dataset, a knowledge base, an ongoing effort to connect structured image concepts to language.
Berkeley 3-D Object Dataset - 849 images taken in 75 different scenes. About 50 different object classes are labeled.
Berkeley Segmentation Data Set and Benchmarks 500 (BSDS500) - 500 natural images, explicitly separated into disjoint train, validation and test subsets + benchmarking code. Based on BSDS300.
Microsoft Common Objects in Context (COCO) - complex everyday scenes of common objects in their natural context.
SUN Database - Very large scene and object recognition database.
ImageNet - Labeled object image database, used in the ImageNet Large Scale Visual Recognition Challenge.
Open Images - A Large set of images listed as having CC BY 2.0 license with image-level labels and bounding boxes spanning thousands of classes.
Handwriting and character recognition -
Artificial Characters Dataset - Artificially generated data describing the structure of 10 capital English letters.
Letter Dataset - Upper case printed letters.
CASIA-HWDB - Offline handwritten Chinese character database. 3755 classes in the GB 2312 character set.
Character Trajectories Dataset - Labeled samples of pen tip trajectories for people writing simple characters.
Chars74K Dataset - Character recognition in natural images of symbols used in both English and Kannada
UJI Pen Characters Dataset - Isolated handwritten characters
Gisette Dataset - Handwriting samples from the often-confused 4 and 9 characters.
Omniglot dataset - 1623 different handwritten characters from 50 different alphabets.
MNIST database - Database of handwritten digits.
Optical Recognition of Handwritten Digits Dataset - Normalized bitmaps of handwritten data.
Pen-Based Recognition of Handwritten Digits Dataset - Handwritten digits on electronic pen-tablet.
Aerial Images -
Aerial Image Segmentation Dataset - 80 high-resolution aerial images with spatial resolution ranging from 0.3 to 1.0.
KIT AIS Data Set - Multiple labeled training and evaluation datasets of aerial images of crowds.
Wilt Dataset - Remote sensing data of diseased trees and other land cover.
MASATI dataset - Maritime scenes of optical aerial images from the visible spectrum. It contains color images in dynamic marine environments, each image may contain one or multiple targets in different weather and illumination conditions.
Forest Type Mapping Dataset - Satellite imagery of forests in Japan.
Overhead Imagery Research Data Set - Annotated overhead imagery. Images with multiple objects.
SpaceNet - SpaceNet is a corpus of commercial satellite imagery and labeled training data.
So, these are some of the datasets which you can try your hands at. We will keep updating our repository for more such image datasets. We will also cover some of the other datasets in other domains. Thanking you again for reading this blog. If you find it helpful do like, comment and share this post. If you have any questions do mail me @firstname.lastname@example.org.