Data Collection

Jason Drummond
5 min readApr 13, 2021

We are generating huge amounts of data on a daily basis simply by using our phones, playing that YouTube video, or by simply paying for a meal. The companies that are providing these services to us collect this data internally and use it to help them make data driven decisions. In general the public will never get to see that data unless the company decides to make it freely available. One of the first problems you will be faced with as a data scientist will be where to find the data for your project. Here we will go over some resources to find great datasets or how to create your own for your next project!

Open Data Sets

This will by far be the easiest way to find a dataset in terms of preparation. There are a variety of websites that are simply there to serve as a repository of data sets. Many times you will find that the datasets here have either been used many times over or are brand new and challenge the data science community to create a model for it. We will go over a few of the most notable sites here.

UCI Machine Learning Repository

The UCI Machine Learning Repository was created in 1987 by David Aha and a number of graduate students at UC Irvine. It has been widely used by students and educators to go over the basics of Machine learning algorithms. Most of these datasets are already cleaned so minimal effort is needed to cover this step allowing for a gentle introduction to what a data science project entails. Here is a link to this repository so you can check out many of the datasets including some of the most popular, such as the iris or wine datasets which you very well may have heard of already.

https://archive.ics.uci.edu/ml/datasets.php

Kaggle

Kaggle got its start by offering machine learning competitions in which people from across the world are tasked with creating their best model from a given dataset or prompt, with the winners generally earning cash prizes. It has since grown to be a space where users can find real world datasets to practice their data cleaning, EDA, and model building skills. One of the many benefits of the kaggle website is that you can view other users notebooks to see how they analysed a dataset, how they built their models, and get a general flow for what a project looks like. They also have mini courses if you are just starting out in your journey. All in all it is maybe one of the better places to find a dataset as it is continually updated by the community.

Public Records

Public records are another great way of gathering data. They can be collected and shared by international organizations like the World Bank, the UN, or the WTO, national statistical offices, who use census and survey data, or government agencies, who make information about for example the weather, environment or population publicly available. Here in the us we have the site data.gov which has health, education, and commerce data available for free download.

APIs

First off you may be wondering just what an API is, API stands for Application Programming Interface. It’s an easy way of requesting data from a third party over the internet. Many companies have public APIs to let anyone access their data, that is the data that they allow you to access. There are many great APIs out there that will allow you to do some amazing things, some noteable APIs include Twitter, Spotify, Yelp, and Google Maps. We’ll go over one type of way that you can use an API to gather data here.

Twitter API

Twitter is one of my favorite APIs to work with as it’s easy to work with and can be used for a host of different types of projects. Suppose we wanted to track Tweets about the new Marvel show The Falcon and the Winter Soldier. We can find a hashtag that people use when discussing this show and use the Twitter API to request all Tweets with this hashtag. We could then perform a sentiment analysis on the text of each Tweet and get an idea of how people are enjoying the show. We could also simply track how often the hashtag appears each week to see if its core audience is staying engaged or if its losing viewership. There are a multitude of different types of projects you can do, it can even be used to supplement a dataset you already have.

Web Scraping

Web Scraping is a bit more of an involved process but can also be the best route if you want a highly customized dataset. Many times you will find that there is a website that has the perfect dataset for your project but there is no API to retrieve that data or publically available dataset. This is where webscraping comes in, with packages like beutifulsoup you can use python to get the data directly from the website that it is being hosted on.

It is worth it to note here that you need to be careful when you are webscraping. It is possible to get your IP address blocked from websites as you can make a large number of requests per second to that site using beautifulsoup, most likely flagging you as some type of bot. Always use caution with this library, adding in sleep timers is one of the best ways to ensure that you don’t get blocked but you should also read the documentation before playing around with this library!

Conclusion

Finding a dataset is one of the first problems that you will be faced with at the onset of your project. These resources should be of great value to you in your quest to finding that perfect dataset. I especially recommend getting used to using APIs as it is a good middle ground between an open data set thats ready to be used right away and one that you would have to scrape and most likely clean before it is ready to be used.

--

--