In this article, we will take a look at the various datasets available that can be used for different problems.
Table of contents
- General classification
- Image/ video classification
- Object detection
- Medical image analysis
- Pose detection
- Text segmentation
- Recommender systems
- Time series analysis
- Multipurpose datasets for different NLP Tasks
- Handwriting recognition
With so many datasets available in the internet and new datasets being added frequently, we will often find it difficult to find the right one for practicing the machine learning concepts learned. Let us now explore about various datasets that are popularly used for different machine learning tasks.
- Iris dataset - This is a classic dataset used to practice building various classification models. This dataset contains attributes such as sepal length, sepal width, petal length and petal with along with their species/classes of 3 varieties of iris flowers. Each class contains 50 records.
- Palmer penguin dataset - This dataset is the new iris dataset! It contains data for 344 penguins. There are 3 different species of penguins in this dataset, collected from 3 islands in the Palmer Archipelago, Antarctica.
- Wine Classification Dataset- This dataset is an example for multi-class classification problem. This is great for practicing classification on imbalanced datasets as the number of normal wines are greater than the poor or excellent ones though it can also be modeled into a regression problem. This is a set of two datasets that are related to red and white variants of the Portuguese "Vinho Verde" wine.
- Pima Indians Diabetes dataset - This healthcare dataset can be used for practicing binary classification. The dataset consists of several medical predictor variables such as number of pregnancies the patient has had, their BMI, insulin level, age, and so on to predict whether the patient is diabetic or not.
Image/ video classification
- ImageNet dataset - This is regarded as the king of all computer vision datasets. This has images organized based on the WordNet hierarchy where each entity is described by a set of words or phrases. The main task of this dataset is image classification and is widely used for academia.
- YouTube-8M dataset - This dataset is powered by Google and contains 8 million classified YouTube videos along with their annotations and IDs which makes it the largest dataset available for multi-video classification.
- Fashion MNIST dataset - This dataset is a variation of the MNIST dataset and is a go-to dataset for practicing image classification. It has the same structure as the MNIST dataset and contains 70,000 labeled fashion images of size 28x28.
- COCO dataset - This is a set of high quality, yet challenging dataset used for object detection, segmentation and captioning. It features two object detection tasks: bounding box output and object segmentation output.
- CIFAR-10 dataset - This is a labeled subset of the 80 million tiny images dataset. This dataset consists of 60000 colour images in 10 classes, with 6000 images per class of size 32x32. It has 50,000 training images and 10,000 test images and is one ideal dataset for object detection.
- Open Images V6 dataset - This dataset contain about 9 million images approximately. All these images contain image-level annotations, object bounding boxes and segmentation masks.
Medical image analysis
- Breast Histopathology Images dataset - This is a reduced version of the original dataset and it contains whole mount slide images of 277,524 patches of size 50 x 50 and is a biased dataset of 198,738 negative and 78,786 positive entries. This can be used to predict cancer.
- Brain Tumor MRI dataset - This is a combination of three datasets and contains 7022 images of human brain MRI scans. This is used to predict brain tumor and classify them. The images are classified into 4 categories: glioma, meningioma, no tumor and pituitary.
- Chest Xray dataset - This can be used for detection of Covid-19. This dataset contains two types of chest xray images, one which is infected by covid-19 and the other is a xray of a normal chest.
- CrowdPose dataset - This dataset is used in the paper 'CrowdPose: Efficient Crowded Scenes Pose Estimation and A New Benchmark' by Jiefeng Li, Can Wang, Hao Zhu, Yihuan Mao, Hao-Shu Fang and Cewu Lu. It is used for detecting body movements or poses of all the individuals in a crowded environment.
- MPII Human Pose dataset - This is another dataset for human pose estimation. It has around 25,000 images that contains over 40,000 people with their body joints annotated. The dataset covers 410 human activities.
- Malach Corpus - This dataset contains interviews and their transcripts in English. It consists of over 115,000 hours of natural speech from more than 50,000 speakers in 32 different languages. 10% of this dataset has been manually segmented for the task of topic segmentation.
- Coco-Text dataset - This dataset contains over 63,600 images with more than 239,000 text instances that are annotated. The three labeled attributes for every word are: machine-printed vs. handwritten, legible vs. illegible, and English vs. non-English
- MovieLens dataset - This is a classic dataset to start with recommender systems - predict which movie to recommend. This dataset contains over 20 million movie ratings from the year 1995 to 2015. Additionally, it also contains details on tagging activities and it does not contain any demographic information. The data is spread across 6 files.
- Goodreads-books dataset - This dataset gives detailed information of various books spread over 12 columns; some of which are title, authors, isbn, publication_date and language_code. This is ideal for building a book recommender system.
- Netflix Movies and TV Shows dataset - This dataset contains listings of all the movies and TV shows on Netflix and is updated periodically. It contain details such as cast, title, director, country, date_added, duration and many more. It is perfect for building a recommendation system that gives us exposure to its practical real-world application.
- Bike sharing demand dataset - This is a slightly challenging dataset for beginners and contains the count of rental bikes on hourly and daily basis in the Capital bikeshare program in Washington, D.C from year 2011 to 2012. It also contains additional information about the weather.
- Boston house prices dataset - This is one of the classic datasets used for practicing regression problems. It contains data collected by the U.S. Census Service concerning housing in Boston Mass and contains 506 records.
- WHO life expectancy dataset - This dataset consists of various factors that one should consider while estimating the life expectancy of a person and contains around a thousand records of people from different countries. This can be used effectively for practicing multiple linear regression, EDA and also data visualization.
Time series analysis
- E-Commerce dataset - This dataset can be used for the classic time series forecasting job of predicting sales. It is a transactional dataset and contains data of all transactions made between 01/12/2010 and 09/12/2011 for a UK-based gift company whose main customers are wholesalers.
- Daily minimum temperatures - This dataset contains daily minimum temperatures in the city of Melbourne, Australia for a period of 10 years from 1981 to 1990. It has only 2 columns : Date and Daily minimum temperature.
- NIFTY-50 Stock Market Data - This dataset contains price history and trading volumes of the fifty stocks from National Stock Exchange, India. It currently contains data from January 1,2000 and is updated every month in order to have the latest information. This can be used for stock market prediction.
Multipurpose datasets for different NLP Tasks
- Amazon reviews dataset - This has been the go to dataset for sentiment analysis and is extensively used. This dataset contains a few million Amazon customer reviews which are the input texts and star ratings which are the output labels.
- BBC News dataset - It is a textual dataset by BBC that contains 2225 articles, each of which are labeled as one of the 5 categories: tech, business, politics, entertainment and sport. This can be used for text classification and various other NLP tasks.
- ArXiv dataset - It is a collection of scholarly papers. This can be used for building text generation, knowledge graph construction, summarization and building semantic search interfaces.
- MNIST- This is a dataset of handwritten digits. It has a training set of 60,000 examples and a test set of 10,000 examples. It is a classic dataset that is widely used.
- Quick, Draw! Dataset - This dataset is by Google and contains more than 50 million drawings over 345 categories. This can be used to model drawing classifiers which can identify various different drawings.
- Handwriting recognition dataset - This dataset contains more than four hundred thousand handwritten names which were collected through charity projects. This is a great dataset to work with for OCR tasks.