Supervised machine learning (ML) algorithms require labeled data for algorithm training and validation. What represents labeled data is a complex question that depends on the problem being addressed. For example, an algorithm designed to analyze screening chest CT images and estimate the probability of a positive screening result (probability the person has cancer) may only need an outcome measure such as pathology confirmed cancer diagnosis, whereas a researcher focused on improving image segmentation would need accurate expert segmentations of the appropriate regions of interest. TCIA continues to improve its support for Artificial Intelligence and Machine Learning based research by asking data submitters for more detailed information and improving our approaches to help researchers find the data they need.

Finding annotated datasets on TCIA

Many collections on TCIA contain annotations which can be used for training and testing artificial intelligence models. However, users who are not familiar with medical imaging may benefit from assistance learning how to identify and utilize data types suitable for common deep learning tasks such as image classification, object detection, and object segmentations. This page seeks to summarize key information about finding image labels that may not be obvious to researchers who do not have a background in radiology or histopathology. It also provides a host of useful links from third party sources which could be useful to data scientists who are new to working with medical images.

Object Segmentation

Many collections on TCIA contain information about regions of interest indicated in the images. In radiology images these are typically created by one or more radiologists hand-drawing regions of interest around objects such as the patient's tumor(s) or organs on each image. These kinds of data can be shared using a few different file formats.

DICOM provides support for these kinds of data using SEG and RTSTRUCT modalities.
Many popular open-source tools export these labels in other formats. Popular formats include NIFTI, NRRD, and MHA.

On TCIA you can find these data in a couple of ways.

When Browsing Collections you can filter for SEG, RTSTRUCT (which represent DICOM modalities) or Segmentation (non-DICOM like NIfTI/NRRD/MHA) in the modality column.
When Browsing Analysis Results derived from TCIA collections, simply use the filter above the table to search for "segmentations" which will find any instance of this in the Analysis Artifacts column.
We have a Jupyter notebook that shows how to use Python to search for and visualize segmentation data via our REST APIs

Image classification

TCIA includes a wealth of non-image data that could be utilized for image classification purposes. When Browsing Collections you can try the following:

Image Analyses
1. You can filter the table for "Image Analyses" which may include expert-derived image annotations (e.g. Where is the tumor located? What is the shape of the tumor?) or quantitative imaging features (e.g. What is the tumor volume? What is the texture of the tumor?).
2. The Browse Analysis Results page also contains similar types of data that were published by researchers who analyzed TCIA collections.
Clinical data - Filter the table for Demographic, Diagnosis, Molecular Test, Treatment, or Follow-Up values in the Data Types column.
Finding datasets with healthy controls or other diseases to study cancer detection can be achieved by filtering for "non-cancer", which will show datasets that were screening studies, have other diagnoses (e.g. COVID-19) or healthy controls.
Distinguishing between cancer types (e.g. low grade vs high grade gliomas) - Cancer Type is one of the columns on the Browse Collections, making it easy to filter or search for datasets based on this criteria.
Genomic/Proteomic subtypes - Filter the table for "genomics" or "proteomics" on the Browse Collections page to find datasets with this type of information. In most cases you will need to retrieve specific details about the patients' genomic/proteomic from external databases such as NCI's Genomic Data Commons or Proteomic Data Commons. Please note these websites are not supported by TCIA staff, but we do coordinate with the teams that operate these archives to ensure common patient identifiers are used which enable you to link these data to TCIA images.

Interactive Python notebooks and tcia_utils package

There are a series of notebooks which demonstrate how to access and work with TCIA datasets using Python and our REST APIs. Most of them heavily leverage functionality from tcia_utils, which is a Python package that aims to provide functions to make it easier to work with TCIA datasets.

Guidance on sharing and using datasets related to Machine Learning or Artificial Intelligence studies on TCIA

In the case of "radiomics" and other quantitative imaging features it is critical to use standardized image feature definitions such as those outlined in this publication.
Radiology: Artificial Intelligence has initiated a collection of articles to address challenges of bias in medical imaging AI systems which we recommend researchers keep in mind when publishing or using datasets on TCIA.
Please review our letter to the editors of Radiology: Imaging Cancer to learn more about TCIA's plans and recommendations to address demographic disparities in our datasets.
We encourage researchers to consider the use of https://aime-registry.org/ as a way to provide a detailed reporting record of their AI systems.

List of Deep Learning Parameters

Information about deep Learning parameters are also necessary for researchers to reproduce Deep Learning experiments. Where applicable, we recommend that data submitters include the following key pieces of information in their dataset summaries such that TCIA users can easily reproduce their study and compare their analysis results.

Deep Neural Network (DNN) Name - for example, VGG16, ResNet-101, UNet, etc., or a link to GitHub repository or manuscript for customized DNNs if applicable.
Data Augmentation Methods - for example, color augmentation (HSV or RGB color space), transformation, noise, GAN, patch generation, downsizing parameters, etc.
Training, Validation, and Testing Set Configuration - for example number of samples per each set, total number of samples, etc.
Hyperparameters - for example, learning rate, early stopping, batch size, number of epochs, etc.
Training Statistics - for example, wall time spent in training, accuracy metrics such as if average score or best score is reported, etc.
Training Environment - for example, GPU type, Deep Learning framework used such as TensorFlow/PyTorch, number of GPUs, number of nodes, etc

American College of Radiology's "Define-AI" Use Case Directory

The ACR Data Science Institute's Define-AI Use Case Directory was created to empower AI developers to produce algorithms that are clinically relevant, ethical, and effective. Each use case provides narrative descriptions and flow charts which specify the health care goal of the algorithm, the required clinical input, how it should integrate into the clinical workflow and how it will interface with users and tools. Publicly available datasets which could potentially be used to tackle these use cases are listed at the bottom of each Use Case page, many of which include TCIA datasets. Visit https://www.acrdsi.org/DSI-Services/Define-AI to learn more.

Third party tips and tutorials for applying deep learning to medical imaging data

RSNA Deep Learning Lab courses
https://mayo-radiology-informatics-lab.github.io/MIDeL/index.html - MIDeL is a website to help healthcare professionals and medical imaging scientists learn to apply deep learning methods to medical images. It consists of a comprehensive text (think of an electronic textbook) combined with actual code examples to help you learn about Deep Learning.
https://github.com/RSNA/MagiciansCorner - Notebooks, datasets, other content for the Radiology:AI series known as Magicians Corner by Brad Erickson
http://modelhub.ai/ - a repository of self-contained deep learning models pretrained for a wide variety of applications which includes many models trained with TCIA datasets along with example notebooks
https://www.youtube.com/watch?v=-XUKq3B4sdw - how a radiologist interprets lung CTs
https://www.kaggle.com/gzuidhof/full-preprocessing-tutorial - how to pre-process images for deep learning
https://theaisummer.com/medical-image-coordinates/ - DICOM deep learning for medical imaging novices
https://developer.nvidia.com/clara-medical-imaging - NVIDIA package for simplifying deep learning tasks in medical imaging
https://forums.fast.ai/t/fastai-v2-has-a-medical-imaging-submodule/56117 - FastAI package for simplifying deep learning in medical imaging
"TCIA as a Centralized Data Resource for Development of AI" from RSNA 2019
https://www.kaggle.com/marcovasquez/basic-eda-data-visualization - RSNA intracranial hemorrhaging guide

Important Information -- Because of changes in NIH policy, Collection downloads that previously required login-access are no longer available via TCIA.

Space shortcuts

Child pages

Finding annotated datasets on TCIA

Object Segmentation

Image classification

Interactive Python notebooks and tcia_utils package

Guidance on sharing and using datasets related to Machine Learning or Artificial Intelligence studies on TCIA

List of Deep Learning Parameters

American College of Radiology's "Define-AI" Use Case Directory

Third party tips and tutorials for applying deep learning to medical imaging data

Important Information -- Because of changes in NIH policy, Collection downloads that previously required login-access are no longer available via TCIA.

Space shortcuts

Child pages

Annotated Data for AI/ML

Finding annotated datasets on TCIA

Object Segmentation

Image classification

Interactive Python notebooks and tcia_utils package

Guidance on sharing and using datasets related to Machine Learning or Artificial Intelligence studies on TCIA

List of Deep Learning Parameters

American College of Radiology's "Define-AI" Use Case Directory

Third party tips and tutorials for applying deep learning to medical imaging data