Ultimate List of AI Training and Learning Datasets

Create Time: November 4, 2025

On this page, AISOTA.com have compiled a comprehensive list of commonly used datasets for AI professionals. These datasets can be used for training, testing, and automated learning of various machine learning and AI models. If any resources are unavailable or if you have new resource recommendations, please feel free to contact us.

AI Datasets Index

I. Comprehensive Platforms & Search Engines
II. Life Sciences & Healthcare
III. Earth Sciences & Environment
IV. Government & Public Data
V. Natural Language Processing (NLP)
VI. Computer Vision
VII. Autonomous Driving & Robotics
VIII. Audio & Music
- 8.1 Speech Recognition
- 8.2 Music & Sound Effects
IX. Social Science & Networks
X. Law & Public Affairs
- 10.1 Legal Datasets
XI. Multimodal & Reinforcement Learning
- 11.1 Multimodal Datasets
- 11.2 Reinforcement Learning Environments & Benchmarks

AI Datasets Detail List

I. Comprehensive Platforms & Search Engines

Description: The "search engines" or "aggregators" for datasets, serving as the primary starting point for data discovery.

1.1 Cross-domain Search Engines

Google Dataset Search: Indexes datasets from millions of repositories on the web, making it the top search engine for finding cross-domain data.
Data.gov: The U.S. government's open data portal, offering hundreds of thousands of national-level datasets from agriculture and climate to transportation and finance.
EU Open Data Portal: The European Union's open data platform, providing datasets on economy, society, environment, etc., that span across member states.
Data.gov.uk: The UK government's open data website, covering various sectors like business, education, health, and transport.
Data.gov.au: The Australian government's open data portal, providing data on the country's environment, economy, society, and more.
Dataverse Project: An open-source research data repository software adopted by numerous universities and research institutions worldwide for publishing and sharing academic research data.
Zenodo: A multidisciplinary open-access repository developed by CERN for storing research data, software, and other research outputs.
Figshare: An open-access repository where researchers can upload and share their research outputs, including datasets, images, videos, and code.
Datahub.io: A platform operated by the Open Knowledge Foundation, offering a large collection of easy-to-use datasets, with a special focus on data related to global development.
World Bank Open Data: Provides free access to a vast collection of datasets on global development, including indicators on poverty, education, health, and economy.

1.2 Research & Academic Data

Kaggle Datasets: A highly active data science community featuring a vast collection of community-uploaded datasets and competition data, often accompanied by shared code from members.
UCI Machine Learning Repository: A classic repository containing a large number of small datasets, commonly used for machine learning education and algorithm benchmarking.
Papers with Code: Links the latest academic papers with their implementation code and the datasets used, making it an excellent resource for tracking cutting-edge research datasets.
Hugging Face Datasets: While famous for NLP, it also provides thousands of cross-domain, easy-to-load datasets with a unified Python interface.
VizieR: Provides catalogs of astronomical data in astrometry, photometry, and spectroscopy.
ICPSR (Inter-university Consortium for Political and Social Research): A vast archive of social science data used by universities and research institutions worldwide.
Dryad: An open, non-profit international data repository primarily serving research data in the fields of science and medicine.
KONECT (The Koblenz Network Collection): A large collection focusing on various types of network data, such as social networks and hyperlink networks.
OpenML: An open, online machine learning platform that allows users to share data, code, and experiments to promote reproducible research.
Nature Scientific Data: A journal by Nature that publishes peer-reviewed data descriptor articles and links to accessible datasets.

1.3 Cloud Service Provider Data Registries

AWS Open Data Registry: A collection of large datasets hosted on the AWS cloud, covering life sciences, satellite imagery, public transport, etc., and can be directly integrated with AWS services.
Google Cloud Public Datasets: Large datasets hosted on BigQuery and Google Cloud, covering climate, genomics, cryptocurrencies, etc., for easy analysis within the Google Cloud ecosystem.
Azure Open Datasets: Curated datasets from Microsoft Azure, covering census, weather, holidays, etc., pre-loaded into the Azure cloud for use in machine learning pipelines.
IBM Data Asset eXchange (DAX): An open data repository curated by IBM, designed for AI and data analysis use cases, containing selected datasets, Jupyter notebooks, and tutorials.
Oracle Cloud Infrastructure (OCI) Data Science Public Datasets: A series of curated public datasets provided by Oracle Cloud, aimed at accelerating data science projects.
Alibaba Tianchi: Alibaba Group's Tianchi platform, which not only provides a large number of public datasets but also frequently hosts data science competitions, with many datasets related to e-commerce, logistics, etc.
SberCloud Public Datasets: Public datasets provided by SberBank's cloud service in Russia, containing unique data in Russian and other fields.
NVIDIA NGC Catalog: Primarily offers optimized AI models and industry-specific SDKs, but also includes curated datasets for training these models, especially in computer vision and speech.

II. Life Sciences & Healthcare

Description: Covers genomics, proteomics, medical imaging, clinical records, and more.

2.1 Genomics & Bioinformatics

NCBI Datasets: Provides standardized bulk downloads of data from NCBI databases such as GenBank, Gene, Assembly, and BioSample.
Ensembl: Offers high-quality genome annotation data for various vertebrates and model organisms, including information on genes, transcripts, and variations.
1000 Genomes Project: A public catalog containing data on single nucleotide polymorphisms, structural variants, and haplotypes from thousands of human genomes from multiple populations worldwide.
gnomAD: The Genome Aggregation Database, which aggregates genetic variation data from tens of thousands of exomes and genomes from large-scale sequencing projects to assess the frequency and pathogenicity of variants.
UCSC Genome Browser: An interactive platform that provides browsing, analysis, and data download functions for a large number of vertebrate and invertebrate genomes.
DNA Data Bank of Japan (DDBJ): One of the three major international nucleotide sequence databases, alongside GenBank and ENA, collecting sequencing data from Japan and Asia.
European Nucleotide Archive (ENA): Part of the European Molecular Biology Laboratory, providing a comprehensive collection of publicly available nucleotide sequences and their associated metadata.
Gene Expression Omnibus (GEO): A public repository that stores and freely distributes data generated by microarrays, next-generation sequencing, and other high-throughput functional genomics experiments.
The Cancer Genome Atlas (TCGA): A landmark project containing multi-omics data (genomic, epigenomic, transcriptomic, proteomic) from tens of thousands of samples across dozens of cancer types.
International Cancer Genome Consortium (ICGC): An international collaborative project aimed at obtaining a comprehensive description of genomic changes occurring in major cancer types worldwide.

2.2 Medical Imaging

The Cancer Imaging Archive (TCIA): A large, open archive of medical images (e.g., CT, MRI, PET) from cancer patients, often linked with clinical and genomic data.
Medical Segmentation Decathlon: A challenge aimed at promoting AI research in medical image segmentation, providing CT and MRI datasets for ten different anatomical sites.
CheXpert: A large dataset containing hundreds of thousands of chest X-rays, labeled with common pathologies found in radiology reports, used for developing automated chest disease detection algorithms.
MIMIC-CXR: A large, public chest X-ray dataset associated with radiology reports from the Beth Israel Deaconess Medical Center.
Brain Tumor Segmentation (BraTS) Challenges: Multi-year challenges providing a large number of multimodal (e.g., T1, T2, FLAIR) brain MRI scans for brain tumor segmentation tasks.
OASIS (Open Access Series of Imaging Studies): A project aimed at providing free access to brain MRI datasets for the scientific community, with a focus on neurodegenerative diseases like Alzheimer's.
NIH ChestX-ray8: Over 100,000 anonymized chest X-ray images released by the NIH Clinical Center, with image-level labels for 14 diseases.
DeepLesion: A large-scale CT image dataset from the NIH, containing various types of radiologically annotated lesions.
ISIC (International Skin Imaging Collaboration) Archive: An authoritative platform for dermoscopic image analysis, containing a large number of skin lesion images and their diagnostic metadata.
LIDC-IDRI (Lung Image Database Consortium): A chest CT scan database designed to provide resources for lung cancer detection and diagnosis research, containing annotations from multiple radiologists.

2.3 Protein & Drug Discovery

Protein Data Bank (PDB): The global single archive for 3D structural data of proteins, nucleic acids, and other complex assemblies, primarily determined by X-ray crystallography, NMR, and cryo-EM.
ChEMBL: A manually curated database containing information on bioactive molecules with drug-like properties, such as binding affinity, pharmacology, and ADMET data.
STITCH: A database that integrates known and predicted chemical-protein interactions, covering millions of proteins and chemicals from multiple species.
PubChem: An open chemistry repository providing bioactive data, chemical structure information, and screening datasets for small molecules.
DrugBank: A unique bioinformatics and cheminformatics resource that combines detailed drug data with comprehensive target information.
KEGG (Kyoto Encyclopedia of Genes and Genomes): A database resource that integrates genomic, chemical, and systemic functional information, known for its pathway maps and disease networks.
UniProt (Universal Protein Resource): A comprehensive, high-quality, and freely accessible database of protein sequence and functional information.
Reactome: An open-source, open-access, peer-reviewed, and continuously updated pathway knowledgebase for visualizing biological processes.
BindingDB: A public, accessible database primarily measuring interactions between proteins (considered drug targets) and small molecule drug candidates.
ZINC: A free virtual compound database containing over 230 million "purchasable" molecules for virtual screening and high-throughput screening.

III. Earth Sciences & Environment

Description: Data about our planet, including climate, geography, ecology, and satellite data.

3.1 Climate & Meteorology

Copernicus Climate Data Store (C3S): Operated by the European Centre for Medium-Range Weather Forecasts, it provides global climate monitoring data, reanalysis data (like ERA5), and seasonal forecast data.
NOAA National Centers for Environmental Information (NCEI): One of the world's largest environmental data archives, managed by the U.S. National Oceanic and Atmospheric Administration, covering oceanic, atmospheric, geophysical, and solar data.
World Meteorological Organization (WMO) Data Repository: The official data repository of the WMO, providing key meteorological and hydrological data collected by its global members.
NASA Global Change Master Directory (GCMD): A comprehensive directory that helps users discover data, services, and educational resources related to Earth and global change research.
Berkeley Earth: An independent non-profit organization that provides global land temperature datasets, known for its transparent analysis of historical temperature records and open-source methodology.
Climate Data Online (CDO) - NOAA: An online platform provided by NOAA for searching and downloading historical meteorological and climate data for the U.S. and the world.
Intergovernmental Panel on Climate Change (IPCC) Data Distribution Centre: Provides climate, socio-economic, and environmental datasets and scenarios that support the IPCC assessment reports.
National Center for Atmospheric Research (NCAR) Data Archive: Offers a vast collection of datasets generated by the atmospheric and Earth system research community, including climate model outputs and observational data.
The PRISM Climate Group: Provides very high spatial resolution climate data for the United States, such as temperature and precipitation, widely used in environmental and ecological modeling.
European Climate Assessment & Dataset (ECA&D): Provides daily meteorological station data for Europe, used for climate change monitoring and impact studies.

3.2 Satellite & Remote Sensing

NASA Earthdata Search: The portal for NASA's Earth Science Data Systems, providing search and access to a vast amount of data from NASA's Earth observation satellites, aircraft, and field measurements.
Sentinel Hub: Provides convenient access, visualization, and processing services for data from the European Copernicus program's Sentinel series of satellites (optical, radar, etc.).
USGS EarthExplorer: The portal of the United States Geological Survey, providing access to a large number of satellite images, aerial photographs, and digital map products from series like Landsat, MODIS, and ASTER.
Google Earth Engine: A planetary-scale geospatial analysis platform that combines satellite imagery and geographic datasets with powerful computing capabilities for scientific analysis and visualization.
ESA's Earth Online: The European Space Agency's portal for Earth observation missions and data, providing metadata, documentation, and data access for missions like ERS, Envisat, and Sentinel.
MODIS (Moderate Resolution Imaging Spectroradiometer) Web: Provides data products from the MODIS sensor on NASA's Terra and Aqua satellites, covering vegetation, fire, ocean color, and more.
JAXA Global Rainfall Watch: Global precipitation data provided by the Japan Aerospace Exploration Agency, primarily based on its satellite observations.
Planetary Computer - Microsoft: A platform launched by Microsoft that provides large-scale environmental and Earth science data, equipped with computing power to accelerate sustainable development solutions.
Digital Earth Africa: A platform that uses Earth observation data to provide insights for the African continent, primarily based on Landsat and Sentinel data, to monitor changes in water resources, agriculture, forests, etc.
GIBS (Global Imagery Browse Services): A near real-time global satellite imagery browsing service provided by NASA, supporting multiple projections and resolutions.

3.3 Geography & GIS Data

OpenStreetMap (OSM): A global, freely editable map database and project built by volunteers worldwide, providing rich vector map data.
Natural Earth Data: Provides public domain vector and raster map data suitable for creating small-scale maps, including cultural, physical, and raster categories.
USGS National Map: The USGS provides various geographic base data for the United States, including topography, elevation, land cover, hydrology, transportation, and administrative boundaries.
OpenTopography: A portal that provides high-resolution, LiDAR-based topographic data, tools, and resources.
Global Administrative Areas (GADM): Provides spatial data for the boundaries of all countries and their administrative units (e.g., provinces, counties) worldwide.
UNEP Environmental Data Explorer: An authoritative database maintained by the United Nations Environment Programme, containing over 500 variables covering various environmental themes.
WorldPop: Provides high-resolution spatial datasets on global population distribution, dynamics, and characteristics, supporting applications in development, health, and environmental protection.
OpenLandMap: An open-source system that provides global maps for a range of land-related parameters such as soil properties, land cover, and topography.
GEBCO (General Bathymetric Chart of the Oceans): Provides topographic data for the world's oceans and land, i.e., seafloor and land elevation.
Copernicus Land Monitoring Service: Provides a range of geographic information products on land cover, land use, vegetation state, water bodies, and snow cover, primarily covering Europe.

IV. Government & Public Data

Description: Authoritative data released by government agencies, covering economy, society, transportation, etc.

4.1 International Economic Organizations

World Bank Open Data: Provides global development data, covering thousands of indicators on poverty, education, health, economy, environment, etc.
IMF Data: The International Monetary Fund provides macroeconomic data on international finance, debt, exchange rates, government finance, and trade.
OECD Data: The Organisation for Economic Co-operation and Development provides comparable statistics on its member countries in economic, social, environmental, and other areas.
UN Data: The United Nations Statistics Division integrates global statistical information, providing datasets on multiple themes from various UN agencies and member countries.
WTO Data: The World Trade Organization provides statistical data on international trade and tariffs.
ILOSTAT: The International Labour Organization provides data on global labor markets, employment, working conditions, etc.
WHO Data: The World Health Organization's global health data platform, covering diseases, health systems, mortality rates, etc.
FAOSTAT: The Food and Agriculture Organization of the United Nations provides global food and agriculture statistics.
UNESCO Institute for Statistics: Provides comparable data on education, science, culture, and communication worldwide.
UN Comtrade Database: Provides detailed global import and export trade data.

4.2 National Data Portals

data.gov: The U.S. government's open data portal, providing hundreds of thousands of datasets from hundreds of federal agencies.
data.gov.uk: The UK government's open data platform, covering data from central and local governments.
data.gov.au: The Australian government's open data portal, providing data from national, state, and local governments.
data.gov.in: The Indian government's open data platform, aimed at promoting the accessibility and use of government data.
Statistics Canada: Canada's national statistical agency, providing extensive economic, social, and census data.
data.gouv.fr: The French government's open data portal.
GovData (Germany): The open data portal for Germany's federal, state, and local governments.
Data.gov.cn (China Government Open Data): Official statistical data released by the central government of China and the National Bureau of Statistics.
Statistics Bureau of Japan (e-Stat): The official statistics portal of the Japanese government, integrating data from various government departments.
Mexico Open Data: The Mexican government's open data platform.
Brazil Open Data Portal: The Brazilian government's open data portal.
South Africa Open Data Portal: The South African government's open data initiative.
Singapore Government Data: The Singapore government's open data platform.

4.3 City Data

NYC Open Data: The open data portal of the New York City government, providing data from various departments such as business, transportation, and public safety.
London Datastore: The Mayor of London's open data platform, providing data on population, transport, environment, etc.
Los Angeles Open Data: The open data portal of the city of Los Angeles.
Chicago Data Portal: The open data platform of the city of Chicago.
SF OpenData: The open data portal of the city of San Francisco.
Amsterdam Open Data: The open data platform of the city of Amsterdam.
Berlin Open Data: The open data portal of Berlin, Germany.
Tokyo Statistical Yearbook (Open Data): Statistical data and open datasets from the Tokyo Metropolitan Government.
Hong Kong Data.gov.hk: The open data platform of the Hong Kong Special Administrative Region Government.
Shanghai Open Data: The public data open platform of the Shanghai Municipal Government.
Barcelona Open Data BCN: The open data portal of the city of Barcelona, Spain.
Melbourne Open Data: The open data platform of the city of Melbourne, Australia.

V. Natural Language Processing (NLP)

Description: Data used to train machines to understand and generate human language.

5.1 General Corpora & Platforms

Hugging Face Datasets: Provides a unified access interface to thousands of NLP datasets, supporting one-click loading and preprocessing, and is one of the most active communities today.
Common Crawl: A non-profit organization that continuously crawls web data, providing massive, multilingual raw text corpora, which is fundamental for training large language models like the GPT series.
Google C4 (Colossal Clean Crawled Corpus): A carefully cleaned subset of the Common Crawl English corpus, used to train well-known models like T5.
The Pile: A large-scale, diverse, open-source language modeling dataset created by EleutherAI, containing 22 high-quality subsets covering academic, professional, and web text.
Wikipedia (Dumps): Full-text data dumps regularly released by Wikipedia, serving as a vital source of high-quality, structured multilingual text.
BookCorpus: A corpus containing a large amount of text from unpublished novels, often used for training early generative and reading comprehension models.
Project Gutenberg: Offers tens of thousands of public domain e-books whose copyrights have expired, serving as a source of high-quality literary text.
LAION Datasets: Primarily provides large-scale image-text pairs collected from the web, crucial for training multimodal models like CLIP and Stable Diffusion, and also includes pure text metadata.
Reddit Data: Large-scale conversational and community text data collected from Reddit forums, used for training dialogue models and social language analysis.
Twitter Data (via Academic API): Tweet data obtained under compliance, used for studying social discourse, public opinion analysis, and real-time language trends.

5.2 Task-Specific Datasets

Question Answering

SQuAD: The Stanford Question Answering Dataset, extracted from Wikipedia paragraphs, where the answer is contained within the given passage.
Natural Questions: A question-answering dataset released by Google, containing real user questions posed to search engines and long/short answers from Wikipedia.
TriviaQA: A large-scale reading comprehension dataset containing question-answer pairs and evidence documents from the web.
HotpotQA: A multi-hop reading comprehension dataset that requires the model to aggregate information from multiple supporting documents to answer a question.
MS MARCO: A large-scale question-answering and reading comprehension dataset released by Microsoft, based on real user search queries.
BoolQ: A question-answering dataset containing natural yes/no questions, with questions from user queries and passages from Wikipedia.

Text Summarization

CNN/Daily Mail: A classic news summarization dataset containing news articles and their bullet-point summaries.
XSum: An extreme abstractive summarization dataset that requires generating a single, highly concise summary for a news article.
Gigaword: A large news headline generation dataset aimed at compressing the first sentence of a news article into a headline.
SAMSum: A dataset containing everyday conversations and their summaries, focusing on the task of dialogue summarization.
BillSum: A dataset of bill texts from the U.S. Congress and California legislature, along with their summaries.
Multi-News: A dataset for multi-document summarization, requiring the generation of a summary for a cluster of news from multiple source documents.

Sentiment Analysis

IMDb Reviews: Contains 50,000 movie reviews from IMDb with positive or negative sentiment labels.
Amazon Reviews: A massive product review dataset containing tens of millions of reviews, usable for rating prediction and fine-grained sentiment analysis.
SST (Stanford Sentiment Treebank): Provides not only sentence-level sentiment but also fine-grained sentiment labels for each phrase in the syntactic tree.
Yelp Reviews: Similar to Amazon Reviews, containing business reviews and ratings from the Yelp platform.
Sentiment140: A dataset containing 1.6 million tweets with sentiment labels.
Financial PhraseBank: A dataset containing sentences from financial news and their sentiment polarity (positive/negative/neutral).

Machine Translation

WMT Series: Multilingual parallel corpora released at the annual Workshop on Machine Translation, one of the most authoritative benchmarks in the field.
OPUS: An open-source project that collects a large number of parallel corpora from the web, covering numerous language pairs.
Tatoeba: A community platform for collecting example sentences and translations, providing sentence-aligned data for a large number of language pairs.
UN Parallel Corpus: Contains parallel corpora of official UN documents in six languages (English, French, Spanish, Russian, Chinese, Arabic), with formal and high-quality text.
Europarl: Contains parallel corpora of the proceedings of the European Parliament, covering multiple European languages.
ParaCrawl: A large-scale parallel corpus between English and multiple European languages, based on web crawling.

5.3 Knowledge Graphs

WordNet: A large lexical database of English, grouping words into sets of synonyms (synsets) and connecting them through semantic relations.
DBpedia: A large-scale, multilingual knowledge graph built by extracting structured content from Wikipedia infoboxes.
YAGO: A large knowledge base that integrates information from Wikipedia, WordNet, and GeoNames, known for its high accuracy.
Freebase: A large, collaboratively built knowledge graph. Although maintenance has ceased, its data has been merged into Wikidata.
Wikidata: A free, collaborative, multilingual knowledge base operated by the Wikimedia Foundation, providing structured data for Wikipedia and one of the most active public knowledge graphs today.
ConceptNet: A knowledge graph containing common-sense relationships between words and phrases, designed to help machines understand the meanings in human language.
Google Knowledge Graph: The powerful knowledge base that supports Google Search. While not fully open, some information can be accessed via its API.
Microsoft Concept Graph: A large-scale concept graph released by Microsoft that associates entities with abstract concepts.
NELL (Never-Ending Language Learner): A project at Carnegie Mellon University that aims to automatically build and expand a knowledge base by continuously reading the web.
BabelNet: A multilingual lexicalized knowledge graph that combines WordNet with the largest multilingual version of Wikipedia.

This answer is generated by AI and is for reference only. Please verify carefully.

VI. Computer Vision

Description: Data used to train machines to "see" and understand images and videos.

6.1 Image Classification

ImageNet: Contains over 14 million images classified according to the WordNet hierarchy, and is the cornerstone of the deep learning revival in computer vision.
CIFAR-10 & CIFAR-100: Contains 60,000 small 32x32 pixel images, divided into 10 and 100 classes respectively, widely used for algorithm prototyping and teaching.
MNIST: Contains 70,000 grayscale images of handwritten digits (0-9), and is the most classic and commonly used introductory dataset in machine learning.
Fashion-MNIST: As a replacement for MNIST, it contains 70,000 grayscale images of Zalando articles, divided into 10 fashion categories, and is more challenging than digit recognition.
Caltech-101 & Caltech-256: Contains images of 101 or 256 object categories, with 40 to 800 images per category, often used for object recognition research.
Oxford Flowers 102: An image classification dataset of 102 common British flower species, with 40 to 258 images per category.
Stanford Cars: Contains 16,185 images of 196 car classes, with data labeled by manufacturer, model, and year, used for fine-grained image classification.
FGVC-Aircraft: A fine-grained aircraft classification dataset containing 10,000 images covering 100 aircraft models.
Food-101: Contains 100,000 images of 101 food categories, used for food recognition tasks.
Places: A large dataset containing 10 million labeled scene images, covering over 400 unique scene types from homes to natural landscapes.

6.2 Object Detection & Segmentation

COCO (Common Objects in Context): A dataset for general object detection, segmentation, and image captioning, containing over 200,000 images and 80 object categories, and is currently the most mainstream benchmark.
PASCAL VOC: A classic benchmark for object detection, segmentation, and classification. Although smaller in scale, its evaluation criteria have had a profound impact.
Google's Open Images: A very large-scale dataset containing 9 million images, annotated with bounding boxes, segmentation masks, and visual relationships for thousands of categories.
LVIS (Large Vocabulary Instance Segmentation): An instance segmentation dataset for a large number of object categories (over 1200), featuring a long-tail distribution.
Objects365: A large-scale object detection dataset containing over 600,000 images and 365 object categories, focusing on common objects.
ADE20K: A complex dataset for scene parsing and segmentation, containing over 20,000 images with dense annotations of objects and object parts.
Cityscapes: Focuses on semantic and instance segmentation of urban street scenes, containing high-quality pixel-level annotations from street scenes in 50 cities.
KITTI Vision Benchmark: A classic autonomous driving dataset providing annotations for various tasks such as 2D/3D object detection, tracking, and road segmentation.
Mapillary Vistas: A large-scale street-level image dataset containing 25,000 high-resolution images from around the world, with pixel-level and instance-level annotations.
DAVIS (Densely Annotated Video Segmentation): A semi-automatic and interactive benchmark focusing on video object segmentation, providing high-quality, dense pixel-level annotations.

6.3 Faces & People

CelebA: Contains over 200,000 celebrity face images, each annotated with 40 attribute features and 5 key points.
LFW (Labeled Faces in the Wild): A database for studying face recognition in unconstrained environments, containing over 13,000 face images collected from the web.
FFHQ (Flickr-Faces-HQ): A dataset of 70,000 high-resolution (1024x1024), high-quality face images, often used for training and evaluating generative adversarial networks like StyleGAN.
VGGFace2: A large-scale face recognition dataset containing over 3 million images of more than 9,000 individuals, covering large variations in pose, age, and ethnicity.
WIDER FACE: A large-scale face detection benchmark dataset containing 32,203 images and 393,000 face bounding boxes, with high diversity in scale, pose, and occlusion.
MS-Celeb-1M: A large-scale dataset of about 10 million images of 1 million celebrity identities, used to advance large-scale face recognition research.
300-VW: A dataset for facial landmark detection in videos in the wild, containing 114 video sequences covering various poses, lighting, and occlusion conditions.
Market-1501: A large-scale person re-identification dataset containing 32,668 bounding boxes of 1,501 different identities captured by 6 cameras.
DukeMTMC-reID: Another large-scale person re-identification dataset, containing images of over 2,700 pedestrians from 8 camera views.
MPII Human Pose: A dataset for evaluating human pose estimation, containing about 25,000 images with annotations for over 40,000 people's 3D torso and head orientation.

6.4 Scene Understanding

Cityscapes: Focuses on semantic segmentation, instance segmentation, and dense pixel annotation of urban street scenes, and is a core benchmark for autonomous driving scene understanding.
ADE20K/MIT Scene Parsing: A complex dataset for scene parsing, containing over 20,000 images annotated with 150 object and object part categories.
ScanNet: A rich 3D dataset of indoor scenes, containing RGB-D video scans of 25,000 real-world indoor scenes, along with 3D camera poses, surface reconstructions, and instance-level semantic annotations.
SUN RGB-D: An RGB-D dataset for 3D scene understanding, containing over 10,000 RGB-D images with comprehensive 2D and 3D polygon and 3D bounding box annotations.
NYU Depth Dataset V2: An RGB-D image dataset of various indoor scenes, recorded with a Microsoft Kinect, featuring dense pixel-level labels.
KITTI-360: A large-scale dataset that records 360-degree perception of urban scenes with rich annotations via an autonomous driving platform.
Hypersim: A photorealistic synthetic dataset for indoor scene understanding, containing 77,400 images with detailed per-pixel annotations and corresponding ground truth geometry.
Matterport3D: An RGB-D dataset composed of 90 building-scale scenes, containing 10,800 panoramic views with dense 2D and 3D semantic annotations.
GTA-5: A synthetic dataset generated from the Grand Theft Auto V game engine, providing pixel-level semantic annotations, instance annotations, and depth information for autonomous driving research.
BDD100K: A large-scale autonomous driving dataset containing 100,000 videos, annotated with 2D/3D bounding boxes, drivable areas, lane lines, and full-frame instance segmentation.

VII. Autonomous Driving & Robotics

Description: Contains multi-sensor fusion data (camera, LiDAR, radar, etc.) for autonomous systems.

7.1 Large-Scale Public Datasets

Waymo Open Dataset: Provides high-quality, multi-sensor data (camera, LiDAR) covering complex urban environments, focusing on 2D/3D detection, tracking, and domain adaptation.
nuScenes: A large-scale autonomous driving dataset containing a full sensor suite of data from cameras, LiDAR, and radar, with detailed 3D bounding box annotations and map information.
KITTI Vision Benchmark Suite: An early and classic autonomous driving dataset benchmark, providing stereo images, LiDAR point clouds, and 3D object annotations, which has spurred a great deal of algorithm research.
Argoverse: Provides two datasets: a 3D tracking dataset with 3D tracking annotations and HD maps, and a motion forecasting dataset with rich trajectory data for studying future path prediction of vehicles and pedestrians.
A*3D (ApolloScape 3D): A large-scale autonomous driving dataset focusing on complex urban scenes, providing dense 3D point clouds, instance-level segmentation, and 3D lane annotations.
Lyft Level 5 AV Dataset: Contains perception data from over 55,000 scenes collected by a fleet of 20 autonomous vehicles, including camera, LiDAR, and high-quality ground truth.
PandaSet: Provides data collected by a forward-facing LiDAR and surround cameras, suitable for autonomous driving research, and has released annotations for over 16,000 LiDAR scans and over 48,000 images.
H3D (Honda Research Institute 3D Dataset): A multi-sensor dataset with full-surround 3D annotations, collected using a 3D LiDAR, providing detailed object annotations.
ONCE (One millioN sCenEs): A large-scale autonomous driving dataset containing 1 million scenes, aimed at advancing supervised and self-supervised learning for 3D object detection.
ZOD (Zenseact Open Dataset): An autonomous driving dataset for the global research community, containing multi-sensor data under various weather and lighting conditions, and supporting tasks like 2D/3D object detection and tracking.

7.2 Driving Behavior & Video

Berkeley DeepDrive BDD100K: One of the largest and most diverse autonomous driving datasets available, containing 100,000 driving videos annotated with 2D/3D bounding boxes, drivable areas, lane lines, and full-frame instance segmentation.
Cityscapes: Focuses on semantic and instance segmentation of urban street scenes, containing high-quality pixel-level annotations from street scenes in 50 cities, and is an important benchmark for scene understanding.
ApolloScape: A large-scale autonomous driving dataset providing annotated data for multiple tasks such as scene parsing, lane segmentation, self-localization, and trajectory prediction.
DR(eye)VE: An autonomous driving video dataset that incorporates driver gaze data, used for studying the relationship between driver attention and autonomous driving perception systems.
HEVI (Historic and Environmental Visual Infrastructure): Provides continuous driving videos under various weather and lighting conditions for studying domain adaptation and robust perception under environmental changes.
LISA (Laboratory for Intelligent and Safe Automobiles): Provides datasets on traffic signs, traffic lights, vehicle detection, and driver behavior, focusing on specific driving scenarios.
comma2k19: A dataset containing 33 hours of highway driving records, aimed at promoting reproducible research in autonomous driving systems.
D²-City: A large-scale dashcam dataset focusing on detection and tracking tasks in diverse driving scenarios.
TUM Traffic Dataset: Provides data on complex urban traffic scenes, including detailed trajectory annotations for vehicles, bicycles, and pedestrians.
KITTI-360: A large-scale dataset that records 360-degree perception of urban scenes with rich annotations via an autonomous driving platform, including semantic instance annotations and dynamic object trajectories.

RoboNet: A large-scale, open-source dataset of robot interaction videos, containing data from multiple robot platforms performing grasping tasks in different environments.
RLBench: A large-scale, diverse robot learning and benchmarking platform, providing a large number of customizable vision-driven manipulation tasks.
Matterport3D: An RGB-D dataset composed of 90 building-scale scenes, suitable for indoor robot navigation and scene understanding research.
Gibson Environment: A virtual environment database based on real-world scans, providing a realistic simulation platform for robot navigation and vision tasks.
Open X-Embodiment: A very large-scale robot manipulation dataset that aggregates data from over 20 robotics institutions, covering dozens of robot morphologies and hundreds of skills.
YCB Object and Model Set: A widely used set of objects, including 3D models and physical properties, serving as a benchmark object set for robot grasping and manipulation research.
House3D: A rich, interactive virtual environment built on the SUNCG dataset, suitable for robot navigation and scene understanding research.
ScanNet: A dataset containing 3D reconstructions of real-world indoor scenes, providing rich semantic annotations suitable for robot perception and planning in complex environments.
The Stanford 2D-3D-Semantics Dataset (2D-3D-S): Provides comprehensive, dense indoor scene annotations from multiple sensor modalities, supporting both 2D and 3D scene understanding.
CARLA Simulator: An open-source autonomous driving simulator. While primarily used for autonomous driving, it also provides a highly configurable virtual testing platform for robot navigation and research.

VIII. Audio & Music

Description: Data for speech recognition, music information retrieval, and audio generation.

8.1 Speech Recognition

LibriSpeech: An English speech dataset of approximately 1000 hours based on LibriVox audiobooks, which has become a standard benchmark for training and evaluating speech recognition systems.
Common Voice: A crowdsourcing project initiated by Mozilla to create a free, open, multilingual speech dataset, currently containing thousands of hours of speech data in dozens of languages.
TIMIT: A classic English speech recognition dataset containing broadband recordings of 630 speakers, particularly suitable for phoneme recognition research.
VoxCeleb1 & VoxCeleb2: Large-scale speaker recognition datasets extracted from YouTube videos, containing hundreds of thousands of speech segments from thousands of celebrities.
Libri-Light: A large-scale, sparsely labeled English speech dataset aimed at advancing semi-supervised and unsupervised learning research in speech recognition.
Google's Speech Commands Dataset: A dataset for limited-vocabulary speech recognition and keyword spotting, containing spoken short words (e.g., "yes", "no", "stop", "go").
TED-LIUM: A speech recognition corpus built from TED talk audio and their transcripts, containing high-quality educational and speech content.
AISHELL: A large-scale open-source Mandarin Chinese speech database recorded by 400 speakers, containing 178 hours of high-quality recordings.
CHiME: A series of challenge datasets focusing on speech recognition and enhancement in noisy everyday environments, containing multi-channel audio.
AMI Meeting Corpus: Contains 100 hours of meeting recordings, with close-talk and microphone array audio, as well as transcriptions, summaries, and annotations.

8.2 Music & Sound Effects

Million Song Dataset (MSD): A large-scale dataset containing audio features and metadata for modern popular music tracks, a cornerstone of music information retrieval research.
Freesound: A collaborative, crowdsourced database containing hundreds of thousands of Creative Commons licensed sound effects and short audio clips, with extremely diverse content.
NSynth (Neural Audio Synthesis): A large-scale dataset of thousands of musical notes, each with rich annotations for pitch, timbre, and envelope.
MusicNet: A dataset of 330 classical music compositions, providing time-based annotations of instrument activity and notes.
MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization): A dataset containing over 200 hours of paired audio and MIDI data from piano performances.
Lakh MIDI Dataset: A collection of 176,000 matched MIDI files, widely used for music generation and symbolic music analysis.
GTZAN Genre Collection: A classic dataset for music genre classification, containing 100 30-second audio clips for each of 10 genres.
MedleyDB: A dataset containing multitrack audio, suitable for research on tasks like melody extraction and source separation.
FMA (Free Music Archive): A high-quality, legally licensed audio dataset containing over 100,000 tracks, with rich metadata and pre-extracted features.
AudioSet: A large-scale audio event dataset released by Google, containing over 2 million 10-second YouTube video clips annotated with 632 audio event categories.

Description: Data on human social behavior, network structures, and economic activities.

SNAP (Stanford Network Analysis Project): Provides numerous large-scale social and information network datasets, covering online social platforms, citation networks, collaboration networks, etc.
Network Repository: An interactive data warehouse containing thousands of graph/network datasets from various domains (social, biological, technological, etc.).
Konect (Koblenz Network Collection): A large collection of hundreds of different types of networks (e.g., social, hyperlink, communication networks).
The Colorado Index of Complex Networks (ICON): A searchable index of complex network datasets, linking to hundreds of sources worldwide.
Social Computing Data Repository at Arizona State University: Provides datasets from social media platforms (like Twitter, Flickr) for social computing research.
Twitter API (via Academic Research track): Allows researchers to access the full archive of Twitter's history for studying public conversations and social network structures (requires approval).
Facebook Data for Good (e.g., Social Connectedness Index): Provides anonymized, aggregated data based on the Facebook social graph for studying social connections and population mobility.
Reddit Submission and Comment Data (via Pushshift.io): An archived public data platform for Reddit, containing billions of posts and comments, a treasure trove for studying online communities and discussions.
Wikipedia Link Data: Provides hyperlink network data between Wikipedia pages, which can be used to study information structure and knowledge dissemination.
Ethereum Transaction Network: Public transaction network data based on the blockchain, used to study transaction behavior and network evolution in the cryptocurrency economy.

9.2 Recommendation Systems

MovieLens: Released by the GroupLens research group, it provides movie rating datasets ranging from 100k to 25M, and is the most classic benchmark for recommendation system research and teaching.
Book-Crossing Dataset: Contains 1 million rating data for 270,000 books from the Book-Crossing community.
Amazon Product Data: Contains product reviews and metadata (like "also bought" relationships) from Amazon, spanning a long period and covering a wide range of categories.
Yelp Dataset: Contains business, review, and user data from Yelp, suitable for business recommendation and user behavior analysis.
Last.fm Music Dataset: Contains user music listening information and social network data from the Last.fm platform, used for music recommendation research.
Goodbooks-10k: A curated dataset containing 10,000 popular books and about 1 million ratings, of high quality.
Jester: A collaborative filtering dataset for rating jokes.
Criteo AI Lab's Ad Click Prediction Dataset: A large-scale dataset for ad click-through rate prediction, containing billions of impression data.
Netflix Prize Dataset: Although the original competition dataset is no longer officially available, its derivatives and simulated datasets are still widely used in recommendation system research.
MIND (Microsoft News Dataset): A large-scale dataset for news recommendation research, containing user click behavior and news article information.

9.3 Economics & Finance

Quandl: An aggregator of economic and financial data, providing millions of time-series datasets covering stocks, futures, macroeconomic indicators, etc.
Yahoo Finance: Provides extensive historical stock prices, dividends, and split data through its API or historical data download feature.
WRDS (Wharton Research Data Services): A comprehensive data platform for business schools and researchers worldwide, containing many authoritative business and financial databases like Compustat and CRSP (usually requires institutional subscription).
FRED (Federal Reserve Economic Data): Maintained by the St. Louis Fed, it provides hundreds of thousands of U.S. and international macroeconomic time-series data, such as GDP, inflation, and employment.
World Bank Open Data: Provides free and comprehensive data on global development, including thousands of indicators on poverty, education, health, economy, etc.
IMF Data: The International Monetary Fund publishes data on international financial statistics, exchange rates, government finance, external debt, etc.
OECD.Stat: The data portal of the Organisation for Economic Co-operation and Development, providing comparative statistics and indicators for its member countries.
Kaggle Financial Datasets: The Kaggle platform hosts a large number of finance-related datasets, such as stock prices, cryptocurrency data, and loan risk.
Cryptocurrency Historical Prices (e.g., from CoinMarketCap): Provides historical daily/hourly market price and trading volume data for thousands of cryptocurrencies.
Eurostat: The statistical office of the European Union, providing high-quality statistical information on EU member states to compare situations across European countries.

X. Law & Public Affairs

Description: Legal documents, court judgments, government files, etc.

10.1 Legal Datasets

Harvard Dataverse: An open data repository containing a large amount of social science research data, including numerous datasets related to law, politics, and public policy.
European Union Law (EUR-Lex): The official legal database of the European Union, providing comprehensive access to EU treaties, legislation, case law, and preparatory acts.
Caselaw Access Project (CAP): Aims to provide free access to the complete digital versions of all published U.S. court cases, containing over 6.5 million cases.
CourtListener: A free legal search engine and database providing opinions, orders, and related information from U.S. state and federal courts.
Oyez: A free project providing a complete archive of audio, transcripts, case summaries, and opinions from the U.S. Supreme Court.
WorldLII (World Legal Information Institute): A global legal information portal that integrates judgments, legislation, and treaties from multiple legal databases around the world.
ICLR (The Incorporated Council of Law Reporting): Provides authoritative law reports for England and Wales, an important resource for the common law system.
PACER (Public Access to Court Electronic Records): The public access service for electronic records of U.S. federal courts, providing online access to U.S. federal court documents (usually requires registration and per-page fees).
UN Treaty Collection: Provides an online database of United Nations treaties, including detailed information on all registered UN treaties and their status.
OpenLaw: An open-source platform providing searchable legal documents, judgments, and regulations, and exploring the application of blockchain in the legal field.
Legal Natural Language Processing (NLP) Datasets: Such as "LEGAL-BENCH" and "LexGLUE," are benchmark datasets built to evaluate NLP models in the legal domain.
Google Scholar Case Law: Google Scholar's case law search engine, containing a wide range of legal opinions from U.S. state and federal courts.
ACLU (American Civil Liberties Union) Data: Provides datasets and analytical reports related to civil rights, liberties, and litigation in the United States.
OpenCorporates: The world's largest open database of companies, providing registration information for over 200 million companies, used for research on corporate legal structures.
GovTrack.us: Tracks the status of U.S. congressional legislation and provides data on legislators and voting records, all of which is available for free download.

XI. Multimodal & Reinforcement Learning

Description: Combines multiple data types (e.g., image + text) or is used to train agents to act in an environment.

11.1 Multimodal Datasets

Conceptual Captions: Contains about 3.3 million images and their descriptions obtained from the web, with a highly automated and large-scale annotation process.
MS-COCO (Common Objects in Context): In addition to providing excellent object detection and segmentation annotations, each image also includes five human-written descriptive captions, making it a cornerstone of multimodal research.
Flickr30k: Contains 31,000 images, each with five independent descriptive sentences written by crowdsourced workers, often used for image captioning and retrieval.
Visual Genome: A very densely annotated dataset containing over 100,000 images with region descriptions, objects, attributes, relationships, and question-answer pairs.
LAION-5B: A massive multimodal dataset containing 5.85 billion image-text pairs, built by filtering Common Crawl data from the internet, and is key to training large multimodal models like CLIP and Stable Diffusion.
HowTo100M: A dataset of 136 million video clips from YouTube instructional videos, accompanied by narrative text generated by automatic speech recognition, used for learning visual and language correspondence.
AudioSet: A large-scale audio event dataset. Although it primarily focuses on audio, its data is sourced from YouTube videos, making it inherently multimodal (audio + video frames).
VQA (Visual Question Answering) v2: A visual question answering dataset containing open-ended questions from COCO images, requiring the model to understand both the image content and the question text to answer correctly.
ImageNet: Although primarily used for image classification, each of its categories is associated with a WordNet synset, providing structured semantic information that can be used to explore the relationship between vision and semantics.
YouCook2: A dataset for cooking video description, containing 2000 long videos annotated with procedural segments and descriptions.

11.2 Reinforcement Learning Environments & Benchmarks

OpenAI Gym (now Gymnasium): An open-source toolkit for developing and comparing reinforcement learning algorithms, providing a diverse range of environments from classic control to Atari games.
DeepMind Control Suite: A set of continuous control tasks based on the MuJoCo physics engine, serving as a benchmark for reinforcement learning agents, focusing on physical movement and control.
Atari Grand Challenge Dataset: Contains experience replay data from humans playing Atari games (including frame sequences and actions), which can be used for imitation learning or offline reinforcement learning research.
DM Lab: Built by DeepMind based on its Quake III Arena 3D game platform, it provides a first-person perspective 3D navigation and task-solving environment.
Procgen Benchmark: A set of 16 procedurally generated game environments designed to evaluate the generalization ability of reinforcement learning agents, rather than overfitting to a single task.
Meta-World: A benchmark composed of 50 different robotic manipulation tasks, used to measure the performance of meta-reinforcement learning and multi-task learning algorithms.
D4RL (Datasets for Deep Data-Driven Reinforcement Learning): A benchmark designed to advance offline reinforcement learning, providing standardized datasets in multiple domains (e.g., Adroit robotic hand, MuJoCo locomotion tasks).
MineRL: A large-scale dataset and environment built around the game Minecraft, aimed at solving complex, hierarchical decision-making tasks through imitation learning and reinforcement learning.
StarCraft II Learning Environment: A reinforcement learning environment based on StarCraft II, known for its huge state and action spaces and partial observability, and is one of the ultimate testbeds for studying multi-agent reinforcement learning and macro-decision making.
OpenSpiel: A framework that integrates multiple games and environments for studying general reinforcement learning and multi-agent interaction, including games from chess to poker.

This article was written by the author with the assistance of artificial intelligence (such as outlining, draft generation, and improving readability), and the final content was fully fact-checked and reviewed by the author.