Datasets

CS 301 is a data-oriented course. On this page, we'll maintain a collection of free, online datasets (or collections of datasets) you can use. We'll use these for examples in class as well as projects. Or, you may find datasets here you want to explore to satisfy your own curiosity. If you discover new interesting datasets, please email them to us, and we'll post them here if they seem to be of general interest.

Google Dataset Search: This search tool is in Beta, but it provides high quality results integrated with Google Scholar, such that you can see how many papers cite a given dataset.

FiveThirtyEight: this news outlet, started by Nate Silver, focuses on data-driven stories, often about sports and politics. Many of the datasets underlying their stories are online on GitHub.

OpenPayments: this datasets makes financial relationships between physicians and medical companies more transparent. You can extract the total dollar value of payments made to individual physicians.

IMDb: Data about movies (casts, title, length, etc) is available here.

OpenStreetMap: You can download data about the coordinates and types of streets and paths.

Sloan Digital Sky Survey: Like OpenStreetMaps, but for the known universe.

US Census Data: You can find various stats about populations and demographics here.

Yelp: Yelp has an API for fetching recent data about businesses and reviews. They have also released a sample of data from 10 metropolitan areas, covering 188K businesses.

Project Gutenberg: This is a great source of online books, which may be useful for various kinds of textual analysis.

r/datasets: This subreddit is a forum for requesting and sharing sources of interesting data.

Beazley on City Data: David Beazley gives a fast paced Python tutorial on how to analyze a variety of city datasets (e.g., for bus routes, potholes, and restaurant inspections). The datasets, slides, and videos are available online.

Madison Open Data: City of Madison data about parking, bike traffic, and much more.

Amazon Product Data: this dataset contains 143M product reviews for products on Amazon, across a variety of categories and 18 years.

openFDA: the FDA has posted many datasets here, regarding food recalls, adverse reactions to drugs, and approved medical devices.

Caren on Social Science: Neal Caren has pulled together a collection of Python examples on a variety of topics, from associating crime with the weather to identifying trends in poetry style.

Open Access Directory: This is a rich collection of datasets across a variety of fields (archaeology, astronomy, linguistics, medicine, and much more).

Registry of Research Data Repositories: over 2000 datasets across a variety of browsable topics.

Global Health Observatory: There are a variety of health related datasets here, from child mortality to road safety.

Inter-University Consortium for Political and Social Research: there are many political and social datasets here. Also, they have created data exercises to get you started working with the data.

National Center for Education Statistics: This site hosts education data, about things like dropout rate and funding.