Remark
Please be aware that these lecture notes are accessible online in an ‘early access’ format. They are actively being developed, and certain sections will be further enriched to provide a comprehensive understanding of the subject matter.
1.2. Best Sources for Public Datasets#
Public datasets are invaluable resources for data scientists, researchers, and analysts. Here are some of the best sources for accessing high-quality public datasets:
1.2.1. Government Sources#
Data.gov: The U.S. government’s open data portal offers over 301,985 datasets covering topics like health, education, and climate. It serves as a central clearinghouse for open data from federal, state, and local governments. We can access the data at https://data.gov/open-gov/.
U.S. Census Bureau: Provides comprehensive demographic, economic, and geographic data about the United States population. We can access the data at https://www.census.gov/.
National Climatic Data Center: Offers extensive climate and weather datasets. We can access the data at https://www.ncei.noaa.gov/cdo-web/
HealthData.gov: Focuses on health-related datasets from various U.S. government agencies. We can access the data at https://healthdata.gov/
USAspending.gov: The official open data source of federal spending information, tracking how federal money is spent across America and beyond. We can access the data at https://www.usaspending.gov/
GSA Open Data: Provides a wide variety of datasets from the General Services Administration, including information on government operations and facilities. We can access the data at https://open.gsa.gov/data/
Open Government Data Act Repository: Found at resources.data.gov, this repository contains tools, best practices, and schema standards to facilitate the adoption of open data practices across the Federal Government.
Open Government Portal (Canada): The central hub for Canadian government open data, offering a wide range of datasets and resources. We can find the data at https://search.open.canada.ca/data/
Statistics Canada Open Data: Provides various open databases, including:
Open Database of Businesses
Open Database of Infrastructure
Open Database of Greenhouses
Open Database of Buildings
Open Database of Educational Facilities
Open Database of Healthcare Facilities
Open Database of Cultural and Art Facilities
Open Database of Addresses
Open Database of Recreational and Sport Facilities
We can find the data at https://www.statcan.gc.ca/en/our-data/where/open-data
Open Science and Data Platform: Provided by Natural Resources Canada and Environment and Climate Change Canada, this platform offers environmental data and scientific publications. We can find the data at https://osdp-psdo.canada.ca/dp/en
UK Government Data Hub: The central repository for UK government open data, providing access to thousands of datasets from public sector organizations. We can access the data at https://data.gov.uk/
Office for National Statistics (ONS): The UK’s largest independent producer of official statistics, offering comprehensive demographic, economic, and social data. We can access the data at https://www.ons.gov.uk/
Met Office: Provides comprehensive climate and weather datasets for the United Kingdom, including historical climate data and weather observations. We can access the data at https://www.metoffice.gov.uk/
Natural England: Offers environmental and ecological datasets related to England’s natural resources, including biodiversity and habitat information. We can access the data at https://www.gov.uk/government/organisations/natural-england
UK Health Security Agency Data: Provides public health datasets and epidemiological data for disease monitoring and health surveillance. We can access the data at https://ukhsa-dashboard.data.gov.uk/
Ordnance Survey Open Data: Offers free geographic and mapping data for Great Britain, including boundaries, roads, and place names. We can access the data at https://osdatahub.os.uk/data/downloads/open
data.gov.au: The central source of Australian open government data, aggregating datasets from federal, state, and local governments. We can access the data at https://data.gov.au/
Australian Bureau of Statistics (ABS): Australia’s national statistical agency, providing official statistics on a wide range of economic, social, population, and environmental matters. We can access the data at https://www.abs.gov.au/
Bureau of Meteorology (BOM): Provides Australia’s weather, climate, and water information, including historical climate data and real-time observations. We can access the data at http://www.bom.gov.au/climate/data/
Geoscience Australia: Offers spatial data covering Australia’s geology, geography, and earth sciences, useful for mapping and resource analysis. We can access the data at https://www.ga.gov.au/
Australian Institute of Health and Welfare (AIHW): Provides authoritative information and statistics on Australia’s health and welfare. We can access the data at https://www.aihw.gov.au/
CSIRO Data Access Portal: The Commonwealth Scientific and Industrial Research Organisation’s portal for accessing research data across various scientific disciplines. We can access the data at https://data.csiro.au/
1.2.2. State and Provincial Datasets#
Every state in the USA, every province in Canada, every country within the UK, and every state in Australia has its own publicly available data. Many cities in these regions also have their own datasets. Here are some examples.
New York Open Data: Offers a wide range of datasets from various state agencies and local governments in New York. We can access the data at https://data.ny.gov/
California Open Data Portal: Provides access to public data from California state agencies and departments. We can access the data at https://data.ca.gov/
Chicago Data Portal: Offers datasets specific to the City of Chicago, covering topics like transportation, public safety, and city services. We can access the data at https://data.cityofchicago.org/
Ontario Data Catalogue: Offers datasets published by the Government of Ontario, covering topics such as agriculture, economy, education, environment, and health. We can access the data at https://data.ontario.ca/
British Columbia Data Catalogue: Provides access to datasets from various provincial ministries and agencies in British Columbia. We can access the data at https://catalogue.data.gov.bc.ca/
Alberta Open Data Portal: Provides access to datasets from Government of Alberta ministries and agencies, covering topics like environment, energy, transportation, and health. We can access the data at https://open.alberta.ca/
Montreal Open Data Portal: Offers datasets specific to the City of Montreal, covering urban planning, transportation, and public services. We can access the data at https://donnees.montreal.ca/
UK Government Data Hub: The central repository for UK government open data, providing access to thousands of datasets from public sector organizations. We can access the data at https://data.gov.uk/
England Open Data Portal: Offers datasets specific to England, covering topics like local government services, planning, and public services. We can access the data at https://data.england.gov.uk/
Scotland Open Data Hub: Provides access to open datasets from Scottish public bodies and local authorities. We can access the data at https://data.spatialhub.scot/
Welsh Government Data Hub: Offers datasets from Welsh public sector organizations covering topics like environment, health, and social services. We can access the data at https://data.gov.wales/
NSW Government Open Data: The central portal for New South Wales government data, offering datasets on transport, environment, and health. We can access the data at https://data.nsw.gov.au/
DataVic (Victoria): Provides access to Victorian government open data, including spatial data and datasets from various departments. We can access the data at https://www.data.vic.gov.au/
Queensland Open Data: Offers datasets from the Queensland Government, covering areas such as infrastructure, community, and environment. We can access the data at https://www.data.qld.gov.au/
DataWA (Western Australia): The central repository for Western Australia’s public sector data, including geospatial and statistical datasets. We can access the data at https://data.wa.gov.au/
Melbourne Open Data: Provides datasets specific to the City of Melbourne, including real-time sensor data, pedestrian counts, and urban planning information. We can access the data at https://data.melbourne.vic.gov.au/
1.2.3. Academic and Research Institutions#
Universities and research organizations often share datasets:
UCI Machine Learning Repository: Hosted by the University of California, Irvine, it features 673 datasets ideal for machine learning projects. The repository serves as a valuable resource for the machine learning community, offering datasets across various domains such as healthcare, finance, and social sciences. It’s particularly useful for benchmarking machine learning algorithms and comparing model performance. The data can be accessed at http://archive.ics.uci.edu.
Harvard Dataverse: An open repository where researchers can share and explore data across various disciplines. It’s open to all researchers from any field, both inside and outside of the Harvard community. Harvard Dataverse offers features such as dataset and file-level DOIs, data citations, and tools for data analysis and visualization. The data can be accessed at https://dataverse.harvard.edu/.
Figshare: An open access data repository where researchers can preserve, share, cite, and explore research outputs such as datasets, images, and videos. It assigns a digital object identifier (DOI) for citations and offers 20 GB of free storage for researchers to comply with funder, publisher, and institutional mandates. The data can be accessed at https://figshare.com/
Open Science Framework (OSF): A free, open-source research management and collaboration tool designed to help researchers document their project’s lifecycle and archive materials. It’s built and maintained by the nonprofit Center for Open Science. OSF supports preregistration, collaboration, and project visibility management. The data can be accessed at https://osf.io/
Borealis: A bilingual, multidisciplinary, secure, Canadian research data repository, supported by academic libraries and research institutions across Canada. It supports open discovery, management, sharing, and preservation of Canadian research data. Borealis uses the open-source Dataverse software and provides features such as automatic DOI assignment and metadata harvesting. The data can be accessed at https://borealisdata.ca/
1.2.4. Technology Companies#
Major tech companies host large public datasets:
Google Dataset Search: A powerful search engine indexing millions of datasets from various sources. Google Dataset Search helps researchers locate online data that is freely available for use. It features datasets across various fields such as machine learning, social sciences, government data, geosciences, biology, and more. The service was launched in September 2018 and moved out of beta in January 2020. Google Dataset Search complements Google Scholar and relies heavily on dataset providers’ use of metadata in accordance with schema.org standards. You can access Google Dataset Search at https://datasetsearch.research.google.com/
Amazon Web Services (AWS) Public Datasets: Offers large datasets accessible via AWS cloud services. AWS hosts a variety of public datasets that anyone can access for free. The AWS Open Data Team provides storage infrastructure to host and make data available, as well as tools for analysis. The Registry of Open Data on AWS (RODA) is a website designed to help researchers find datasets publicly available through AWS. As of 2018, it contained 59 datasets covering various topic areas and domains. You can explore AWS Public Datasets at https://registry.opendata.aws/
Google Public Datasets: Available through Google Cloud Platform’s BigQuery tool. The Google Cloud Public Datasets Program facilitates access to high-demand public datasets, making it easy for users to access and uncover new insights in the cloud. As of 2019, the program hosted more than 130 datasets in BigQuery and Google Cloud Storage. These datasets cover various domains, including weather and climate data from providers like NOAA, NASA, and the European Space Agency. You can explore Google Public Datasets at https://cloud.google.com/public-datasets
Microsoft Azure Open Datasets: Provides curated public datasets that are ready to use in machine learning projects. These datasets are integrated into Azure Machine Learning, making it easy to incorporate them into your workflows. Azure Open Datasets cover various domains such as weather, census data, and geographic information. You can access Azure Open Datasets at https://azure.microsoft.com/en-us/services/open-datasets/
IBM Data Asset eXchange (DAX): Offers a curated collection of open datasets for AI. DAX provides datasets in various domains, including computer vision, natural language processing, and time-series analysis. The platform also includes sample notebooks to help users get started with the datasets. You can explore IBM DAX at https://developer.ibm.com/exchanges/data/
1.2.5. Specialized Repositories#
These focus on specific types of data:
Kaggle: A popular platform for data science competitions, offering a wide range of user-contributed datasets. Kaggle allows users to create, share, and explore datasets across various domains. It features a data explorer for quick browsing of file contents and supports multiple file formats, including CSV, SQLite, and archives. We can access Kaggle datasets at https://www.kaggle.com/datasets.
FiveThirtyEight: Known for data journalism, they publish datasets used in their articles, focusing on politics, sports, and culture. FiveThirtyEight offers datasets through their GitHub repository and an R package called ‘fivethirtyeight’. We can access FiveThirtyEight datasets at fivethirtyeight/data.
NASA Open Data: Provides datasets related to earth science and space exploration. NASA’s Open Data Portal (data.nasa.gov) serves as a clearinghouse for publicly available NASA datasets, with many dataset entries providing metadata and links to data hosted on other NASA archive sites. We can access NASA Open Data at https://data.nasa.gov.
Forecasting Data Platform: Provides datasets and resources designed for forecasting research and evaluation. We can access it at https://forecastingdata.org/.
Note
It’s important to note that while platforms like Kaggle offer a wide variety of datasets, users should exercise caution when working with user-curated and uploaded datasets. The quality, accuracy, and reliability of such datasets can vary significantly. Always verify the source, methodology, and any potential biases in user-contributed data before using it for analysis or research purposes.
1.2.6. General-Purpose Repositories#
These platforms host diverse types of datasets:
Figshare: An open-access repository for researchers to share various research outputs, including datasets, figures, images, videos, and more. Figshare provides unlimited public space and 20GB of free private space. It assigns DOIs to all publicly shared content, making it easily citable. Figshare supports version control and allows users to track views, downloads, and citations of their research outputs. We can access Figshare at https://figshare.com/.
Zenodo: Developed by CERN, it’s a general-purpose open-access repository. Zenodo allows researchers to deposit research papers, data sets, research software, reports, and any other research-related digital artifacts. It provides integration with GitHub, making it easy for researchers to create DOIs for their code repositories. Zenodo supports various data formats and offers up to 50GB of free storage per dataset. We can access Zenodo at https://zenodo.org/.
data.world: Described as a “social network for data people,” it offers a collaborative platform for sharing and analyzing datasets. data.world provides tools for data discovery, preparation, and collaboration. It supports various data formats and allows users to create projects, combine datasets, and share insights. The platform also offers integration with popular data analysis tools and supports SPARQL queries. We can access data.world at https://data.world/.
1.2.7. Earth Observation and Geospatial Data#
Google Earth Engine: A cloud-based platform for planetary-scale environmental data analysis. It provides access to a multi-petabyte catalog of satellite imagery and geospatial datasets, allowing users to visualize and analyze changes to the Earth’s surface. We can access Google Earth Engine at https://earthengine.google.com/
Copernicus Open Access Hub: Provides free and open access to Sentinel-1, Sentinel-2, Sentinel-3, and Sentinel-5P satellite data from the European Space Agency (ESA). We can access this data at https://scihub.copernicus.eu/
USGS Earth Explorer: Offers access to satellite imagery, aerial photography, and cartographic products from various U.S. government agencies. We can access USGS Earth Explorer at https://earthexplorer.usgs.gov/
1.2.9. Biological and Life Sciences#
Gene Expression Omnibus (GEO): A public repository for high-throughput gene expression and other functional genomics data. We can access GEO at https://www.ncbi.nlm.nih.gov/geo/
European Nucleotide Archive (ENA): Provides a comprehensive record of nucleotide sequencing information, raw sequencing data, and assembly information. We can access ENA at https://www.ebi.ac.uk/ena/browser/home
1.2.8. Social Science and Humanities#
ICPSR (Inter-university Consortium for Political and Social Research): One of the world’s largest archives of digital social science data. It provides access to an extensive collection of social science data, including political, sociological, economic, and historical data. We can access ICPSR at https://www.icpsr.umich.edu/
World Bank Open Data: Offers free access to global development data, including economic, social, and environmental indicators for countries around the world. We can access World Bank Open Data at https://data.worldbank.org/