Supervised machine learning (ML) algorithms require labeled data for algorithm training and validation. What represents labeled data is a complex question that depends on the problem being addressed. For example, an algorithm designed to analyze screening chest CT images and estimate the probability of a positive screening result (probability the person has cancer) may only need an outcome measure such as pathology confirmed cancer diagnosis, whereas a researcher focused on improving image segmentation would need accurate expert segmentations of the appropriate regions of interest. TCIA continues to improve its support for Artificial Intelligence and Machine Learning based research by asking data submitters for more detailed information and improving our approaches to help researchers find the data they need.
Finding annotated datasets on TCIA
Many collections on TCIA contain annotations which can be used for training and testing artificial intelligence models. However, users who are not familiar with medical imaging may benefit from assistance learning how to identify and utilize data types suitable for common deep learning tasks such as image classification, object detection, and object segmentations. This page seeks to summarize key information about finding image labels that may not be obvious to researchers who do not have a background in radiology or histopathology. It also provides a host of useful links from third party sources which could be useful to data scientists who are new to working with medical images.
Many collections on TCIA contain information about regions of interest indicated in the images. In radiology images these are typically created by one or more radiologists hand-drawing regions of interest around objects such as the patient's tumor(s) or organs on each image. These kinds of data can be shared using a few different file formats.
- DICOM provides support for these kinds of data using SEG and RTSTRUCT modalities.
- Many popular open-source tools export these labels in other formats. Popular formats include NIFTI, NRRD, and MHA.
On TCIA you can find these data in a couple of ways.
- When Browsing Collections
- You can look for SEG / RTSTRUCT in the modality column to determine where DICOM segmentations or contours are available.
- You can also filter for "Image Analyses" in the supporting data column. If a collection says "Image Analyses" but does not include SEG or RTSTRUCT in the modality this is typically because the analysis was in some other format. This could be segmentation data in NIFTI/NRRD/MHA formats, but it might also represent some other kind of analysis such as image classification.
- When Browsing Analysis Results derived from TCIA collections, simply use the filter above the table to search for "segmentations" which will find any instance of this in the Analysis Artifacts column.
- We also have a Jupyter notebook that shows how to use Python to search for and visualize segmentation data via our REST APIs
TCIA includes a wealth of non-image data that could be utilized for image classification purposes.
- Image Analyses
- You can filter the table for "Image Analyses" on the Browse Collections page to find datasets with this type of information. Image Analyses could include expert-derived image annotations (e.g. Where is the tumor located? What is the shape of the tumor?) or quantitative imaging features (e.g. What is the tumor volume? What is the texture of the tumor?).
- The Browse Analysis Results page contains similar types of analysis data that were published by researchers who analyzed TCIA collections.
- Clinical data (e.g. outcomes, stage) - Filter the table for "clinical" on the Browse Collections page to find datasets with this type of information
- Distinguishing between cancer types (e.g. low grade vs high grade gliomas) - Cancer Type is one of the columns on the Browse Collections, making it easy to filter or search for datasets based on this criteria.
- Genomic/Proteomic subtypes - Filter the table for "genomics" or "proteomics" on the Browse Collections page to find datasets with this type of information. In most cases you will need to retrieve specific details about the patients' genomic/proteomic from external databases such as NCI's Genomic Data Commons or Proteomic Data Commons. Please note these websites are not supported by TCIA staff, but we do coordinate with the teams that operate these archives to ensure common patient identifiers are used which enable you to link these data to TCIA images.
Interactive Python notebooks and tcia_utils package
There are a series of notebooks which demonstrate how to access and work with TCIA datasets using Python and our REST APIs. Most of them heavily leverage functionality from tcia_utils, which is a Python package that aims to provide functions to make it easier to work with TCIA datasets.
Guidance on sharing datasets related to Machine Learning or Artificial Intelligence studies on TCIA
In the case of "radiomics" and other quantitative imaging features it is critical to use standardized image feature definitions such as those outlined in this publication.
Deep Learning parameters are also critical for researchers to reproduce Deep Learning experiments. Where applicable, we recommend that data submitters include the following key pieces of information in their dataset summaries such that TCIA users can easily reproduce their study and compare their analysis results.
List of Deep Learning Parameters
- Deep Neural Network (DNN) Name - for example, VGG16, ResNet-101, UNet, etc., or a link to GitHub repository or manuscript for customized DNNs if applicable.
- Data Augmentation Methods - for example, color augmentation (HSV or RGB color space), transformation, noise, GAN, patch generation, downsizing parameters, etc.
- Training, Validation, and Testing Set Configuration - for example number of samples per each set, total number of samples, etc.
- Hyperparameters - for example, learning rate, early stopping, batch size, number of epochs, etc.
- Training Statistics - for example, wall time spent in training, accuracy metrics such as if average score or best score is reported, etc.
- Training Environment - for example, GPU type, Deep Learning framework used such as TensorFlow/PyTorch, number of GPUs, number of nodes, etc
We also encourage you to review the following papers:
American College of Radiology's "Define-AI" Use Case Directory
The ACR Data Science Institute's Define-AI Use Case Directory was created to empower AI developers to produce algorithms that are clinically relevant, ethical, and effective. Each use case provides narrative descriptions and flow charts which specify the health care goal of the algorithm, the required clinical input, how it should integrate into the clinical workflow and how it will interface with users and tools. Publicly available datasets which could potentially be used to tackle these use cases are listed at the bottom of each Use Case page, many of which include TCIA datasets. Visit https://www.acrdsi.org/DSI-Services/Define-AI to learn more.
Third party tips and tutorials for applying deep learning to medical imaging data
- RSNA Deep Learning Lab courses
- https://mayo-radiology-informatics-lab.github.io/MIDeL/index.html - MIDeL is a website to help healthcare professionals and medical imaging scientists learn to apply deep learning methods to medical images. It consists of a comprehensive text (think of an electronic textbook) combined with actual code examples to help you learn about Deep Learning.
- https://github.com/RSNA/MagiciansCorner - Notebooks, datasets, other content for the Radiology:AI series known as Magicians Corner by Brad Erickson
- http://modelhub.ai/ - a repository of self-contained deep learning models pretrained for a wide variety of applications which includes many models trained with TCIA datasets along with example notebooks
- https://www.youtube.com/watch?v=-XUKq3B4sdw - how a radiologist interprets lung CTs
- https://www.kaggle.com/gzuidhof/full-preprocessing-tutorial - how to pre-process images for deep learning
- https://theaisummer.com/medical-image-coordinates/ - DICOM deep learning for medical imaging novices
- https://developer.nvidia.com/clara-medical-imaging - NVIDIA package for simplifying deep learning tasks in medical imaging
- https://forums.fast.ai/t/fastai-v2-has-a-medical-imaging-submodule/56117 - FastAI package for simplifying deep learning in medical imaging
- "TCIA as a Centralized Data Resource for Development of AI" from RSNA 2019
- https://www.kaggle.com/marcovasquez/basic-eda-data-visualization - RSNA intracranial hemorrhaging guide