In this article, we will see some of the data sources from where we can download and use datasets for free for our data science projects.
Table of contents
- UCI machine learning repository
- Google cloud public datasets
- Google Dataset search
- AWS Open Data Registry
Working with data is absolute fun! So why let only big organizations have all the fun? The best way to get the knack of data science is by working on a project. Thanks to platforms that provide data resources for free, now everyone gets access to numerous datasets. Now, let us take a look at some of these sources.
Kaggle is one of the most used data source and this platform is popular among data scientists. It hosts more than 150,000 datasets on a wide range of topics - from healthcare, education, food, stocks to cartoons and music! We can find datasets of most kinds here: text, audio, image, video and numerical. Other than just hosting datasets, Kaggle also has various worked out code examples as notebooks where we can see how various ML algorithms are implemented. It allows its users to publish datasets, hence the cleanliness of the dataset may vary. Check out their datasets here.
UCI machine learning repository
The UCI machine learning repository was one of the first data sources available on the internet. It was created in the year 1987. It contains more than 600 datasets and is very beginner friendly. We can filter datasets based on the type of data we need, the ML algorithm we aim to use, the field of study it comes under, attribute type, number of attributes and instances. Most of the datasets are small, clean and can be easily downloaded and used for machine learning projects. Check out their datasets here.
Google cloud public datasets
Google hosts datasets too! More than 200 datasets are hosted by BigQuery and Cloud. These can be easily accessed and downloaded. Even these small number of datasets are neatly categorized. Google cloud public datasets has data from various sources such as Bitcoin, Github, NASA and many more. Check out their datasets here.
Google Dataset search
Google dataset search works like a search engine for datasets; similar to how Google scholar works for academic purposes. It allows us to find datasets hosted on various platforms and websites. It allows us to filter the search results based on download format, usage rights, last updated, topic and pricing. One drawback is that due to the huge amounts of datasets it contains, it may take us time to find the right dataset. Check it out here.
AWS Open Data Registry
Amazon hosts a number of datasets on its AWS open data registry. Like Kaggle, here too users can add datasets. Each dataset has a tag but does not have a feature that allows us to filter the datasets. One main advantage is that AWS open data registry has many examples on the dataset is used, for every dataset available on the platform. Check it out here.
Data.world is yet another data source that stacks modern data. It calls itself a "collaborative data community" and is home to over 100,000 datasets belonging to varied categories ranging from crime to social media. Check out the datasets here.
Data.gov is home to the US Government's open data. It was created by the government as an attempt to be more transparent and hosts over 300,000 datasets related to various fields such as environment, ocean, agriculture and many more. We can access all the available datasets for free but some require us to agree to licenses and other technicalities before we download them. Check it out here.
Earthdata was created by NASA as a part of its Earth Science Data Systems Program, Earth Observing System Data and Information System (EOSDIS) to be more specific. It hosts data related to Earth and Space collected from various NASA aircrafts, satellites and also field data from the ground station. Check it out here.