Supervised machine learning (ML) algorithms require labeled data for algorithm training and validation. What represents labeled data is a complex question that depends on the problem being addressed. For example, an algorithm designed to analyze screening chest CT images and estimate the probability of a positive screening result (probability the person has cancer) may only need an outcome measure such as pathology confirmed cancer diagnosis, whereas a researcher focused on improving image segmentation would need accurate expert segmentations of the appropriate regions of interest. TCIA continues to improve its support for Artificial Intelligence and Machine Learning based research by asking data submitters for more detailed information and improving our approaches to help researchers find the data they need.
Finding annotated datasets on TCIA
Many collections on TCIA contain annotations which can be used for training and testing artificial intelligence models. However, users who are not familiar with medical imaging may benefit from assistance learning how to identify and utilize data types suitable for common deep learning tasks such as image classification, object detection, and object segmentations. This page seeks to summarize key information about finding image labels that may not be obvious to researchers who do not have a background in radiology or histopathology. It also provides a host of useful links from third party sources which could be useful to data scientists who are new to working with medical images.
Many collections on TCIA contain information about regions of interest indicated in the images. In radiology images these are typically created by one or more radiologists hand-drawing regions of interest around objects such as the patient's tumor(s) or organs on each image. These kinds of data can be shared using a few different file formats.
- DICOM provides support for these kinds of data using SEG and RTSTRUCT modalities.
- Many popular open-source tools export these labels in other formats. Popular formats include NIFTI, NRRD, and MHA.
On TCIA you can find these data in a couple of ways.
- When Browsing Collections
- You can look for SEG / RTSTRUCT in the modality column to determine where DICOM segmentations or contours are available.
- You can also filter for "Image Analyses" in the supporting data column. If a collection says "Image Analyses" but does not include SEG or RTSTRUCT in the modality this is typically because the analysis was in some other format. This could be segmentation data in NIFTI/NRRD/MHA formats, but it might also represent some other kind of analysis such as image classification.
- When Browsing Analysis Results derived from TCIA collections, simply use the filter above the table to search for "segmentations" which will find any instance of this in the Analysis Artifacts column.
TCIA includes a wealth of non-image data that could be utilized for image classification purposes.
- Image Analyses
- You can also filter the table for "Image Analyses" on the Browse Collections page to find datasets with this type of information. Image Analyses could include expert-derived image annotations (e.g. Where is the tumor located? What is the shape of the tumor?) or quantitative imaging features (e.g. What is the tumor volume? What is the texture of the tumor?).
- The Browse Analysis Results page contains similar types of analysis data that were published by researchers who analyzed TCIA collections.
- Clinical data (e.g. outcomes, stage) - Filter the table for "clinical" on the Browse Collections page to find datasets with this type of information
- Distinguishing between cancer types (e.g. low grade vs high grade gliomas) - Cancer Type is one of the columns on the Browse Collections, making it easy to filter or search for datasets based on this criteria.
- Genomic/Proteomic subtypes - Filter the table for "genomics" or "proteomics" on the Browse Collections page to find datasets with this type of information. In most cases you will need to retrieve specific details about the patients' genomic/proteomic from external databases such as NCI's Genomic Data Commons or Proteomic Data Commons. Please note these websites are not supported by TCIA staff, but we do coordinate with the teams that operate these archives to ensure common patient identifiers are used which enable you to link these data to TCIA images.
Guidance on sharing datasets related to Machine Learning or Artificial Intelligence studies on TCIA
In the case of "radiomics" and other quantitative imaging features it is critical to use standardized image feature definitions such as those outlined in this publication.
Deep Learning parameters are critical for researchers to reproduce Deep Learning experiments. Where applicable, we recommend that data submitters include the following key pieces of information in their dataset summaries such that TCIA users can easily reproduce their study and compare their analysis results.
List of Deep Learning Parameters
- Deep Neural Network (DNN) Name - for example, VGG16, ResNet-101, UNet, etc., or a link to GitHub repository or manuscript for customized DNNs if applicable.
- Data Augmentation Methods - for example, color augmentation (HSV or RGB color space), transformation, noise, GAN, patch generation, downsizing parameters, etc.
- Training, Validation, and Testing Set Configuration - for example number of samples per each set, total number of samples, etc.
- Hyperparameters - for example, learning rate, early stopping, batch size, number of epochs, etc.
- Training Statistics - for example, wall time spent in training, accuracy metrics such as if average score or best score is reported, etc.
- Training Environment - for example, GPU type, Deep Learning framework used such as TensorFlow/PyTorch, number of GPUs, number of nodes, etc
We also encourage you to review the following papers:
Third party tips and tutorials for applying deep learning to medical imaging data
- https://github.com/kirbyju/TCIA_Notebooks/blob/main/README.md - Tutorials for working with TCIA data in Jupyter Notebooks
- http://modelhub.ai/ - a repository of self-contained deep learning models pretrained for a wide variety of applications which includes many models trained with TCIA datasets along with example notebooks
- https://www.youtube.com/watch?v=-XUKq3B4sdw - how a radiologist interprets lung CTs
- https://www.kaggle.com/gzuidhof/full-preprocessing-tutorial - how to pre-process images for deep learning
- https://theaisummer.com/medical-image-coordinates/ - DICOM deep learning for medical imaging novices
- https://developer.nvidia.com/clara-medical-imaging - NVIDIA package for simplifying deep learning tasks in medical imaging
- https://forums.fast.ai/t/fastai-v2-has-a-medical-imaging-submodule/56117 - FastAI package for simplifying deep learning in medical imaging
- "TCIA as a Centralized Data Resource for Development of AI" from RSNA 2019
- https://www.kaggle.com/marcovasquez/basic-eda-data-visualization - RSNA intracranial hemorrhaging guide
- https://github.com/RSNA/AI-Deep-Learning-Lab - RSNA 2019 deep learning course
- https://github.com/RSNA/MagiciansCorner - Notebooks, datasets, other content for the Radiology:AI series known as Magicians Corner by Brad Erickson