At the National Cancer Institute's (NCI) direction, personnel from The Cancer Imaging Archive (TCIA) collect and curate clinical and pre-clinical (animal studies) Radiology and Pathology images, clinical trial data (including patient demographics and clinical outcomes), annotations and image derived features and other types of clinical research data (e.g., gene expression profiles). Data comes from NIH programs, funded research and clinical trials. The ultimate goal of TCIA is to make the data publicly available to increase transparency and reproducibility in cancer imaging research. To facilitate this goal, TCIA provides data de-identification, curation and hosting services to remove these burdens from individual investigators and institutions.
This document is intended to provide details of The Cancer Imaging ArchiveTCIA's (TCIA) protocol for data collection, de-identification and curation so that submitting sites are comfortable with the protocol prior to agreeing to use the established procedures to accomplish these activities. The purpose of this protocol is to assure the research community that Department of Biomedical Informatics at UAMS hosts The Cancer Imaging Archive for the National Cancer Institute (NCI), therefore this process is implemented under the supervision of the University of Arkansas for Medical Sciences (UAMS) Institutional Review Board (IRB # 205568).The Department of Biomedical Informatics at UAMS hosts The Cancer Imaging Archive for the
Obtaining permission to share your data
The value of TCIA increases as we receive new data sharing proposals from the research community. Researchers with the following objectives are encouraged to submit an application to publish their data:
- Meeting the data sharing requirements set forth by the National Cancer Institute (NCI)
- for grant or contract awards
- Meeting the data sharing requirements set forth by a peer reviewed journal for publication
- Sharing data that could stimulate discoveries in emerging areas of cancer imaging research (e.g. radiogenomics,
- Sharing data to be used in challenge competitions or for benchmarking and validating analysis techniques in image processing
We do not charge a fee for sharing your data through TCIA except in rare circumstances where proposals are extremely large. TCIA is funded by the National Cancer Institute, therefore all applications must have relevance to cancer research. Applications are reviewed monthly by the TCIA Advisory Group to assess their utility to the TCIA user community. A strong preference is given to data sets which can be fully public and do not impose any special access or usage restrictions. Proposals which contain supporting non-image data (e.g. patient outcomes, training classifiers/labels, tumor segmentations) are highly preferred to those which lack these characteristics.
If approved, submitting sites must sign our TCIA UAMS Data Transfer Agreement before data collection is initiated. For cases where submitters are not legally permitted to allow commercial use of their data please see TCIA Non-Commercial Data Submission Agreement. NCI and TCIA strongly discourages prohibiting commercial use as this significantly hinders those who wish to use TCIA data to translate research into clinical patient care. See our Data Usage Policies and Restrictions page for more information about the rules for downstream use of these data.
Data Curation Overview
All data is fully de-identified in accordance with international standards, US laws and UAMS IRB protocol requirements. TCIA then makes the data freely and openly available under the Creative Commons licensing as shown below:
“This collection is freely available to browse, download, and use for commercial, scientific and educational purposes as outlined in the Creative Commons Attribution 3.0 Unported License. See TCIA's Data Usage Policies and Restrictions for additional details. Questions may be directed to firstname.lastname@example.org.”
Prior to uploading data to TCIA, the submitter must obtain IRB approval (or ethics board equivalent) from their institution allowing them to submit a de-identified version of their data to TCIA. The NCI and the submitting site IRB are jointly responsible for reviewing the consent under which data to be submitted to TCIA was originally collected.
All data is All data is anonymized to the fullest extent possible at the submitting site, and then encrypted prior to transmission to UAMS. All incoming data is captured in a quarantine system and treated as if it contains PHI. All TCIA personnel are trained in HIPAA regulations and procedures. TCIA servers are managed by UAMS IT as if they were UAMS clinical systems. Once the full analysis and de-identification is complete, data is moved to a separate public repository and made available to the research community. This process has been reviewed by the UAMS Chief Security Officer.
Pathology Curation Steps
A TCIA submission expert will work with an Imaging point of contact from your site to receive your data. The submission expert will provide instructions to clean common locations where PHI might exist (e.g. file names, slide labels) and a link to upload the data into our secure UAMS Box system. Upon receipt the slides are visually reviewed for burned in PHI, incorrectly labeled slides, scan quality issues and ensuring filenames match labels. Metadata fields are also reviewed to ensure they contain no PHI. Slide types supported include: Aperio (.svs, .tif), Hamamatsu (.vms, .vmu, .ndpi), Leica (.scn), MIRAX (.mrxs), Philips (.tiff), Sakura (.svslide), Trestle (.tif), Ventana (.bif, .tif) and Generic tiled TIFF (.tif).
Radiology Curation Steps
A TCIA submission expert will work with an Imaging point of contact from your site. The expert will provide all the required tools for de-identifying and sending your imaging data and will answer any questions you have throughout the process. These tools have been approved by NIH and comply with the Digital Imaging and Communications in Medicine (DICOM) international standard for medical image de-identification. Following industry best practices, TCIA uses a standards-based approach to de-identification of DICOM images and non-image data to insure that all data make publicly available are free of protected health information (PHI). The TCIA de-identification process ensures that the HIPAA de-identification standard is met by following the Safe Harbor Method as defined in section 164.514(b)(2) of the HIPPA Privacy Rule utilizing the following steps:
- TCIA will help your technical point of contact (PACS administrator or designated IT technician, henceforth referred to as "submitter") install TCIA's software on a standard desktop computer. The software runs on regular Windows/Mac/Linux desktop computers and requires Java to also be installed. It does not require any specialized hardware (e.g. servers are not necessary).
- TCIA will help the submitter create mapping tables (which do not leave the submitting site) which our software will use to assign anonymous patient IDs and to offset study dates.
- TCIA will walk the submitter through software testing using a small sample set of their study data (e.g. 1-2 patients).
- TCIA will help the submitter export the full set of imaging studies from their local PACS (or wherever the data resides) into the TCIA software for processing.
- Note: Please do not utilize your PACS system de-identification or other de-identification software as this usually deletes critical information researchers will need to make use of the data.
- TCIA will help the submitter use the software used to de-identify and transmit images to TCIA according to DICOM standards (Attribute Confidentiality Profile – DICOM PS 3.15: Appendix E) before it leaves your institution.
- TCIA quality control and curation staff will work with you to ensure the data are fully de- identified and received. Additional reviews are performed, and any remaining PHI are deleted if found.
- TCIA will publish the final data set with a descriptive page and announce its addition via our mailing list and social media channels.
Software used by TCIA
- Clinical Trials Processor (CTP)
- CTP is a stand-alone image processing application for DICOM (radiology) data. Our curation teams provide this tool to submitters in order to perform the initial de-identification of DICOM images before they are transferred to TCIA.
- CTP Wizard
- CTP Wizard is a custom extension of CTP developed by TCIA. It provides an interactive graphical user interface which can be used to more easily walk submitters through the process of importing, de-identifying, and exporting their DICOM data.
- Posda is an open source application for the archival and curation of DICOM datasets. After receiving the data from submitters our curation teams use POSDA to perform additional quality checks and ensure all data was completely de-identified. It allows users to import DICOM data while tracking date and time received. Users can also prioritize multiple data submission streams based on assigned priority, identify and resolve duplicate unique identifiers (UIDs) submitted with different image or metadata, and check and edit data for DICOM conformance, consistency, and referential integrity.
DICOM De-identification Details
Following industry best-practices, TCIA uses a standards-based approach to de-identification of DICOM images to insure that images are free of protected health information (PHI). The TCIA de-identification process ensures that the HIPAA de-identification standard is met by following the Safe Harbor Method as defined in section 164.514(b)(2) of the HIPPA HIPAA Privacy Rule. The standard for de-identification of DICOM objects is defined by Attribute Confidentiality Profile – DICOM PS 3.15: Appendix E. At the submitting site, a DICOM PS 3.15 compliant script removes or modifies DICOM tags deemed to be unsafe (See table 1 for a complete listing). TCIA incorporates the “Basic Application Confidentiality Profile” which is amended by inclusion of the following profile options: Clean Pixel Data Option, Clean Descriptors Option, Retain Longitudinal With Modified Dates Option, Retain Patient Characteristics Option, Retain Device Identity Option, and Retain Safe Private Option. The de-identification rules applied to each object are recorded by TCIA in the DICOM sequence Method Code Sequence [0012,0063] by entering the Code Value, Coding Scheme Designator, and Code Meaning for each profile and option that were applied to the DICOM object during de-identification. The DICOM standard for de-identification of objects defines a minimum set of elements to de-identify to be in compliance with the standard. It is up to the user doing the de-identification to insure that PHI is removed or cleaned according to the laws and practices in place at the time de-identification occurs.
Private Tags - Unfortunately, there are many cases where vendors do not make the conformance statement for a piece of equipment publicly available or do not adequately define what is stored in the private tags, but these fields are extensively used by DICOM vendors to store information about the scans which are sometimes necessary for researchers to utilize the data. When a submitting site sends DICOM data to TCIA all private tags are retained and then de-identified by TCIA during curation of the data according to the Retain Safe Private Option. The Retain Safe Private Option allows for the retention of DICOM tags stored in the private fields. TCIA uses a private tag dictionary maintained by the Posda curation toolkit to decide the disposition of a vendor written private tag. The Posda private tag dictionary is a compilation of 4 well-known private tag dictionaries, :
- the TCIA De-identification Knowledge Base (DeID KB),
- Grassroots DICOM
The addition of the other 3 private tag dictionaries allows for an expanded set of scientifically useful tags to be retained. To implement the new Posda private tag dictionary, TCIA resolved any discrepancies between the 4 included dictionaries and assigned dispositions to all private tags ever seen by Posda. Unique values seen within the private tags were inspected to ensure that dispositions were correctly assigned. If a new private tag is encountered in the Posda database that does not have a private tag disposition, values are inspected in relation to the tag description together with values in the tags and a disposition is assigned. If there is no existing private tag description, an attempt is made to find a manufacturer’s definition of the tag. If no such description can be found the disposition is defined to remove the tag. TCIA will remove any private tags from the images that are not specified in the private tag dictionary or are defined as containing a form of PHI such as name, SSN, etc. All date and datetime private tags that are retained are offset using the same offset as applied to the standard tags for the image. All private tags containing UIDs are assigned a TCIA root and appended with a hashed value as done with the standard tags. This ensures all references to other images contained within TCIA are maintained. A manual inspection of all private tags is performed using tagSniffer reports and any PHI that may be found is removed, emptied, date offset, or hashed as appropriate.
The following table details the de-identification performed at the submitting site by way of a TCIA supplied de-identification script. All other tags not mentioned in the table below are reviewed and cleaned if necessary during our Posda curation.
Table 1 - DICOM Tags Modified or Removed at the source site (18 forms of PHI as currently defined by the Safe Harbor Method, DICOM PS 3.15 compliant)