Best Image Datasets for Machine Learning and Data Science (ML002)

Updated: Nov 30, 2020

You can check out the previous post - Why we need to know about Machine Learning? (ML001)

This the second post in my Machine learning series(you can check out the first part of this series). In this post, I will be discussing Machine learning datasets in particular. Some people might start with algorithms in the beginning but I think one should start with the data itself which is the most important aspect. We will look into some of the best available Machine learning Image datasets in the world and also go through different use-cases and their datasets. In the future, we will also go in-depth on how to visualize and understand the data well in order to choose the right algorithms.

Best machine learning and datascience datasets

The datasets listed below are some of the best image datasets available for Machine learning research and development. With high-quality datasets, we can achieve a lot in this field but often finding high-quality datasets are a challenge and come at an expensive cost because of the time needed to label a dataset. For unsupervised datasets, where labeling is not required is very hard to collect.


Where to find Machine learning datasets?


There are good datasets available nowadays on many websites and University websites that serve as a great starter for machine learning projects.


So, we will list out some of the best resources below -

  • Kaggle - Kaggle is a machine learning website that hosts different ML competitions and provides quality datasets. It is a great community to get started, one must visit and start using Kaggle and interact with the community to learn a lot. I will post a detailed post on how to get started with Kaggle.

  • UCI ML Repository - University of California, Irvine is another great University that provides rich resources in Machine learning. It has some state of art datasets that one can easily download and get started on.

  • VisualData - It has a great collection of datasets and it also contains datasets from the latest conferences like - CVPR2020 and ECCV2020. Do check it out.

  • CMU Libraries - It is a repository by Carnegie Mellon University and contains some of the best datasets in the field of AI and Machine learning.

  • Google Dataset Search - It is just like a search engine but it's only for datasets.

Now as we saw some of the online resources where you can find datasets to get started with. Now we look at the best datasets according to use-cases.


Machine learning datasets (Usecases wise) -


Image Datasets -

  • Facial Recognition Datasets

  • Action Recognition Datasets

  • Object detection and recognition

  • Handwritten and character recognition

  • Aerial Images

Image Datasets -


Facial Recognition Datasets -

  • FERET (facial recognition technology) - 11338 images of 1199 individuals in different positions and at different times.

  • Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) - 7,356 video and audio recordings of 24 professional actors. 8 emotions each at two intensities.

  • SCFace - Color images of faces at various angles.

  • Yale Face Database - Faces of 15 individuals in 11 different expressions.

  • Cohn-Kanade AU-Coded Expression Database - Large database of images with labels for expressions.

  • JAFFE Facial Expression Database - 213 images of 7 facial expressions (6 basic facial expressions + 1 neutral) posed by 10 Japanese female models.

  • FaceScrub - Images of public figures scrubbed from image searching.

  • BioID Face Database - Images of faces with eye positions marked.

  • Skin Segmentation Dataset - Randomly sampled color values from face images.

  • Bosphorus - 3D Face image database.

  • UOY 3D-Face - neutral face, 5 expressions: anger, happiness, sadness, eyes closed, eyebrows raised.

  • CASIA 3D Face Database - Expressions: Anger, smile, laugh, surprise, closed eyes

  • CASIA NIR - Expressions: Anger Disgust Fear Happiness Sadness Surprise

  • BU-3DFE - neutral face, and 6 expressions: anger, happiness, sadness, surprise, disgust, fear (4 levels). 3D images extracted.

  • Face Recognition Grand Challenge Dataset - Up to 22 samples for each subject. Expressions: anger, happiness, sadness, surprise, disgust, puffy. 3D Data.

  • Gavabdb - Up to 61 samples for each subject. Expressions neutral face, smile, frontal accentuated laugh, frontal random gesture. 3D images

  • 3D-RMA - Up to 100 subjects, expressions mostly neutral. Several poses as well.

  • SoF - 112 persons (66 males and 46 females) wear glasses under different illumination conditions.

  • IMDB-WIKI - IMDB and Wikipedia face images with gender and age labels.


Action Recognition Datasets -

Object detection and recognition -

  • Visual Genome - Visual Genome is a dataset, a knowledge base, an ongoing effort to connect structured image concepts to language.

  • Berkeley 3-D Object Dataset - 849 images are taken in 75 different scenes. About 50 different object classes are labeled.

  • Berkeley Segmentation Data Set and Benchmarks 500 (BSDS500) - 500 natural images, explicitly separated into the disjoint train, validation, and test subsets + benchmarking code. Based on BSDS300.

  • Microsoft Common Objects in Context (COCO) - complex everyday scenes of common objects in their natural context.

  • SUN Database - Very large scene and object recognition database.

  • ImageNet - Labeled object image database, used in the ImageNet Large Scale Visual Recognition Challenge.

  • Open Images - A Large set of images listed as having CC BY 2.0 license with image-level labels and bounding boxes spanning thousands of classes.

Handwriting and character recognition -

Aerial Images -

  • Aerial Image Segmentation Dataset - 80 high-resolution aerial images with spatial resolution ranging from 0.3 to 1.0.

  • KIT AIS Data Set - Multiple labeled training and evaluation datasets of aerial images of crowds.

  • Wilt Dataset - Remote sensing data of diseased trees and other land covers.

  • MASATI dataset - Maritime scenes of optical aerial images from the visible spectrum. It contains color images in dynamic marine environments, each image may contain one or multiple targets in different weather and illumination conditions.

  • Forest Type Mapping Dataset - Satellite imagery of forests in Japan.

  • Overhead Imagery Research Data Set - Annotated overhead imagery. Images with multiple objects.

  • SpaceNet - SpaceNet is a corpus of commercial satellite imagery and labeled training data.

Hope you liked my article. If you have any questions and doubts related to this topic or any topic in AI and machine learning, do let me know in the comment section, and I will be more than happy to help you out. Do hit like on this article and share it among your friends who are in AI. Follow us on Instagram and Twitter. Let's democratize AI.

 

Drop Me a Line, Let Me Know What You Think

                                                                                                  © Subham Tewari                                                                                        Privacy Policy