Overview
At the National Cancer Institute's (NCI) direction, personnel from The Cancer Imaging Archive (TCIA) staff has accumulated a wealth of knowledge on best practices and procedures for DICOM image de-identification in the process of maintaining our archive. In order to share this information with the wider research community we are maintaining the following knowledge base. This is a living document and will continue to be updated as we learn from our experiences. If you have feedback or questions please contact us at feedback@cancerimagingarchive.net.
Background Information
Here are some presentations and papers which provide an overview on various aspects of DICOM de-identification:
- Image Data Sharing for Biomedical Research: Meeting the De-identification and Informatics Challenges publication, Journal of Digital Imaging (DOI: 10.1007/s10278-011-9422-x)
- Image Data Sharing for Biomedical Research: Meeting the De-identification and Informatics Challenges presentation, SIIM Annual Meeting, Washington, D.C., June 4, 2011
- De-identification Revisited - DICOM Supplement 142 presentation, DICOM Conference 2010
- Automated Standards-based Anonymization Profile for Image Sharing Using RSNA's Clinical Trial Processor poster with Q&A session, RSNA Annual Meeting, Chicago, IL, Nov 30, 2009
DICOM Private Data Elements
It is desirable to retain DICOM private data elements that contain parameters describing the acquisition while removing elements containing PHI. Performing this task requires understanding the mechanism defined by DICOM to support private elements. DICOM PS 3.5, section 7.8.1 states:
It is possible that multiple implementors may define Private Elements with the same (odd) group number. To avoid conflicts, Private Elements shall be assigned Private Data Element Tags according to the following rules.
a) Private Creator Data Elements numbered (gggg,0010-00FF) (gggg is odd) shall be used to reserve a block of Elements with Group Number gggg for use by an individual implementor. The implementor shall insert an identification code in the first unused (unassigned) Element in this series to reserve a block of Private Elements. The VR of the private identification code shall be LO (Long String) and the VM shall be equal to 1.
b) Private Creator Data Element (gggg,0010), is a Type 1 Data Element that identifies the implementor reserving element (gggg,1000-10FF), Private Creator Data Element (gggg,0011) identifies the implementor reserving elements (gggg,1100-11FF), and so on, until Private Creator Data Element (gggg,00FF) identifies the implementor reserving elements (gggg,FF00-FFFF).
c) Encoders of Private Data Elements shall be able to dynamically assign private data to any available (unreserved) block(s) within the Private group, and specify this assignment through the blocks corresponding Private Creator Data Element(s). Decoders of Private Data shall be able to accept reserved blocks with a given Private Creator identification code at any position within the Private group specified by the blocks corresponding Private Creator Data Element.
We will use data in group 0009 as a practical example. The table below shows an example of data that could be included in group 0009.
Tag | Description | Value |
---|---|---|
0009, 0010 | Private Creator Element | ACME |
0009, 1001 | Average Density | 15.5 |
0009, 1002 | Density Standard Deviation | 2.2 |
In the example, the element with tag (0009, 0010) is a private creator element with value "ACME". That reserves a block of elements for this manufacturer. The element (0009, 1001) is part of that block; the 10 in the element tag (1001) corresponds to the 10 that is in the tag of the Private Creator Element (0009, 0010).
This only becomes complex when different manufacturers want to use the same reserved block to store information. When this occurs in a single image, the creator of the image reserves a block (for example, 0010). When a second application wants to add data to that same group, it detects the block written by the creator and creates a separate block (for example, 0011). The creator is not required to start at block 0010, but that appears to be common practice. The second or third application is not required to use 0011 or 0012. Based on this encoding scheme, some observations are:
- If a collection of images are produced by equipment from different manufacturers, you may have collisions in the sets of private elements you want to retain and discard. For example, element (0009, 1001) from manufacturer A may contain an important physical parameter while that same element from manufacturer B may contain PHI.
- If the collection has images that are created by an acquisition modality and are then modified by another application (PACS, workstation), a private group may have multiple reserved blocks. Also, one cannot assume that the original creator will have always chosen reserved block 0010.
DICOM Basic Attribute Confidentiality Profile
DICOM standards committee Working Group 18 wrote Supplement 142 that is now incorporated into the published DICOM Standard. The Attribute Confidentiality Profile (DICOM PS 3.15: Appendix E) provides a standard for image de-identification and a process with which to reduce the complexity involved in safely de‐identifying DICOM image data while providing flexibility for scenarios which necessitate preservation of certain information needed for quality control and analysis that is essential to research. This is achieved by providing a number of Application Level Confidentiality Profiles which includes a Basic Profile along with a number of Option Profiles. These profiles provide the necessary instructions for how to safely clean DICOM elements which may contain PHI. The DICOM Standard, including Part 15, is available at the NEMA web site: http://medical.nema.org/standard.html The original Supplement 142 guidance document can be obtained at ftp://medical.nema.org/medical/dicom/final/sup142_ft.doc. We recommend you use the published standard above as it will be updated with any change proposals.
Appendix E of PS 3.15 documents a system for protecting attributes. We quote a small section of the document.
The Attributes listed in Table E.1-1 for each profile are contained in Standard IODs, or may be contained in Standard Extended IODs. An implementation claiming conformance to an Application Level
Confidentiality Profile as a de-identifier shall protect or retain all instances of the Attributes listed in Table E.1-1, whether contained in the main dataset or embedded in an Item of a Sequence of Items. The following action codes are used in the table:
– D – replace with a non-zero length value that may be a dummy value and consistent with the VR
– Z – replace with a zero length value, or a non-zero length value that may be a dummy value and consistent with the VR
– X – remove
– K – keep (unchanged for non-sequence attributes, cleaned for sequences)
– C – clean, that is replace with values of similar meaning known not to contain identifying information and consistent with the VR
– U – replace with a non-zero length UID that is internally consistent within a set of Instances
– Z/D – Z unless D is required to maintain IOD conformance (Type 2 versus Type 1)
– X/Z – X unless Z is required to maintain IOD conformance (Type 3 versus Type 2)
– X/D – X unless D is required to maintain IOD conformance (Type 3 versus Type 1)
– X/Z/D – X unless Z or D is required to maintain IOD conformance (Type 3 versus Type 2 versus Type 1)
– X/Z/U* - X unless Z or replacement of contained instance UIDs (U) is required to maintain IOD conformance (Type 3 versus Type 2 versus Type 1 sequences containing UID references)
PS 3.15: E.2 then defines the Basic Application Level Confidentiality Profile which describes how to apply the scheme above with a number of options that determine the scope of protection that is provided. These definitions allow a system to follow a standard procedure and document in a standard way the behavior of that system.
Software Tools
CTP
TCIA utilizes the RSNA Clinical Trials Processor (CTP) software in conjunction with caBIG's National Biomedical Imaging Archive (NBIA) to de‐identify and host the images in the archive. The Cancer Imaging Program's Informatics Team has been working closely with the developer of CTP since 2009 to incorporate support for this standard as it was being defined by WG18. A full summary and time line of this project can be found athttps://wiki.nci.nih.gov/display/CIP/Incorporation+of+DICOM+WG18+Supplement+142+into+CTP.
CTP provides an interface that allows application of any combination of the profiles to a set of images, and allows for application of an audit trail for retroactively tracking applied de‐identification. For images that are submitted to TCIA the staff begins with the Basic Application Confidentiality Profile (which is the most aggressive) in combination with the following options:
- Clean Descriptors Option: Removal of identification information from descriptive tags which contain unstructured plain text values over which an operator has control
- Retain Modified Longitudinal Temporal Information Options: Modification of tags that contain dates or times
- Retain Patient Characteristics Option: Retention of physical characteristics of the patient that are descriptive rather than identifying information (e.g. metabolic measures, body weight, etc.)
- Retain Device Identity Option: Retention of information about the characteristics of the device used to perform the acquisition
- Retain Safe Private Option: Retention of Private Attributes confirmed not to contain PHI
DICOM Tag Sniffer
In order to simplify our ability to implement some of the "clean" instructions specified in DICOM PS 3.15 a new tool was developed to help inspect the contents of DICOM elements which allow free text entry by a technician and Private Tags for potential PHI. This tool scans a folder and included subfolders for DICOM objects and produces several different outputs that depend on the mode used and input profiles. The software reads each DICOM object and iterates through each public and private element. The software then uses the profiles below to determine whether to retain the value of the element for later inspection:
- Confidentiality Profile: One input profile corresponds to the entries in table E.1-1 in DICOM PS 3.15. We list the attributes in the table and the coded values according to the table entries. When scanning the DICOM objects, each public element is checked against the data in the profile. If the element is found in the profile, the software knows if it should record the element value for later inspection or if the software can ignore it. For example, if the DICOM profile indicates the element is to be deleted, there is no reason to review the value in that element.
- The Confidentiality Profile input is augmented with elements that are known to contain physical parameters such as rows, columns or pixel spacing. Rather than tell the software to ignore values with a specific value representation, we list those elements explicitly.
- Modality Software Profile: This input profile describes the private elements that are documented in the conformance statement by the manufacturer. This file takes into account the Private Creator Data Elements described above and has a code table for indicating program actions (record the value, ignore the value, ...)
These outputs are relevant at different stages of the curation and image publication process.
- Element Inventory: is the set of DICOM tags that are found in the image set. The tags include only the hexadecimal tags (xxxx, yyyy) and no values. All public and private tags are listed, but each is listed only once. The Confidentiality Profile and Modality Software Profile are not consulted as no values are retained for review.
- Element Values, Pre-Deidentification: We want to examine element values to determine how to configure CTP scripts for proper de-identification. As mentioned above, we want to retain as many elements as possible while not exposing PHI. We also do not want to review all element values in all DICOM objects. We use a Confidentiality Profile that corresponds to the DICOM Basic Application Confidentiality Profile and a Modality Software Profile that properly describes the private elements in the DICOM objects.
- Element Values, Final Review: In this mode, we want to review the values in the DICOM objects just before publication. We have de-identified the data and want to analyze the data as a final check. In this mode, we use a different Confidentiality Profile and different Modality Software Profile. For the Confidentiality Profile, we only list elements that we know are physical parameters (rows, columns, ....) and do not include the DICOM references from PS 3.15, Table E.1-1. That will direct the software to record the element values. Likewise, the Modality Software Profile used will direct the software to record all values for later analysis.
We believe this tool might be useful to the rest of the research community and so it's been made freely available as an open source application. We have also created documentation for how a researcher could utilize in the context of their own projects. This can be found at https://mirgforge.wustl.edu/gf/project/dicomtagsniffer/.
TCIA De-identification Work Flow
The TCIA provides standards‐based curation support to ensure safe and thorough de‐identification of all images in the archive per federal HIPAA and HITECH regulations. In order to achieve this compliance without stripping the data of its scientific utility TCIA staff perform a redundant, thorough de‐identification and analysis procedure based on guidance provided by the industry experts in DICOM standards committee Working Group 18. Each collection submitted for publication is analyzed and de-identified as a whole using the steps listed below. All steps are completed before the collection is released for publication.
- Each image in the collection is visually inspected to guarantee there is no PHI burned into the pixel data.
- TagSniffer is used to review the collection and produce an Element Inventory that is annotated with data from the DICOM Basic Application Confidentiality Profile and our set of Modality Software Profiles. This produces the list of DICOM elements found in the collection with a simple annotation scheme:
- One of the Basic Application Confidentiality Profile codes that indicates the DICOM scheme for de-identification (if the element is listed by DICOM)
- A simple code from our Modality Software Profile (No PHI: Retain, PHI: Delete, Not Sure: Review)
- No code, indicating the element is not registered
- The Pre-Identification output of the Tag Sniffer is also generated. This will contain the set of elements in the collection and all values that need to be reviewed for PHI. If the Basic Application Confidentiality Profile or applicable Modality Software Profile indicates the attribute is to be cleaned or that the attribute is a physical parameter that does not contain PHI, there is no need to review that element at this step. We know that our de-identification script will process the element properly.
- We combine the information from steps 2 and 3 to create a CTP de-identification script for the collection. In the event of multiple scanners from different manufacturers, we might create and apply different scripts based on manufacturer.
- The CTP de-identification script (or scripts) is (are) applied to the image collection and a separate copy of the images is created. That is, we retain the original set in case we need to repeat a step.
- TagSniffer is used to review the de-identified images and create the Final Review Output. This is a more complete output that is reviewed by analysts to guarantee there is no PHI carried forward after de-identification. Both public and private elements are included in the output for review.
- If any errors are detected in de-identification in step 6, the CTP script is adjusted and the image set is processed again starting at step 5.
Only after this inspection is complete are the images made available to the general public. For general information on what to expect as an image provider please see our web site at http://www.cancerimagingarchive.net/provider.html.
Manufacturer Specific Private Tags
As discussed above, medical manufacturers include private elements in their DICOM images to convey information not defined in the DICOM Standard. This section documents the information we have gathered by reading appropriate conformance statements.
The sections below describe information by manufacturer. That information is encoded in files that describe the private elements created by those manufacturers. Those files are part of the run time environment of the Tag Sniffer and are maintained in our forge: https://mirgforge.wustl.edu/gf/project/dicomtagsniffer/scmsvn/?action=browse&path=%2Ftrunk%2Fdeploy%2Fprofiles%2Fdevice-profiles%2F
GE Medical Systems
collect and curate clinical and pre-clinical (animal studies) Radiology and Pathology images, clinical trial data (including patient demographics and clinical outcomes), annotations and image derived features and other types of clinical research data (e.g., gene expression profiles). Data comes from NIH programs, funded research and clinical trials. The ultimate goal of TCIA is to make the data publicly available to increase transparency and reproducibility in cancer imaging research. To facilitate this goal, TCIA provides data de-identification, curation and hosting services to remove these burdens from individual investigators and institutions.
This document is intended to provide details of TCIA's protocol for data collection, de-identification and curation so that submitting sites are comfortable with the protocol prior to agreeing to use the established procedures to accomplish these activities. The Department of Biomedical Informatics at UAMS hosts The Cancer Imaging Archive for the National Cancer Institute (NCI), therefore this process is implemented under the supervision of the University of Arkansas for Medical Sciences (UAMS) Institutional Review Board (IRB # 205568).
Obtaining permission to share your data
The value of TCIA increases as we receive new data sharing proposals from the research community. Researchers with the following objectives are encouraged to submit an application to publish their data:
- Meeting the data sharing requirements set forth by the National Cancer Institute (NCI) for grant or contract awards
- Meeting the data sharing requirements set forth by a peer reviewed journal for publication
- Sharing data that could stimulate discoveries in emerging areas of cancer imaging research (e.g. radiogenomics, immunotherapy)
- Sharing data to be used in challenge competitions or for benchmarking and validating analysis techniques in image processing
We do not charge a fee for sharing your data through TCIA except in rare circumstances where proposals are extremely large. TCIA is funded by the National Cancer Institute, therefore all applications must have relevance to cancer research. Applications are reviewed monthly by the TCIA Advisory Group to assess their utility to the TCIA user community. A strong preference is given to data sets which can be fully public and do not impose any special access or usage restrictions. Proposals which contain supporting non-image data (e.g. patient outcomes, training classifiers/labels, tumor segmentations) are highly preferred to those which lack these characteristics.
If approved, submitting sites must sign our TCIA UAMS Data Transfer Agreement before data collection is initiated. For cases where submitters are not legally permitted to allow commercial use of their data please see TCIA Non-Commercial Data Submission Agreement. NCI and TCIA strongly discourages prohibiting commercial use as this significantly hinders those who wish to use TCIA data to translate research into clinical patient care. See our Data Usage Policies and Restrictions page for more information about the rules for downstream use of these data.
Data Curation Overview
All data is fully de-identified in accordance with international standards, US laws and UAMS IRB protocol requirements. All data is anonymized to the fullest extent possible at the submitting site, and then encrypted prior to transmission to UAMS. All incoming data is captured in a quarantine system and treated as if it contains PHI. All TCIA personnel are trained in HIPAA regulations and procedures. TCIA servers are managed by UAMS IT as if they were UAMS clinical systems. Once the full analysis and de-identification is complete, data is moved to a separate public repository and made available to the research community. This process has been reviewed by the UAMS Chief Security Officer.
Pathology Curation Steps
A TCIA submission expert will work with an Imaging point of contact from your site to receive your data. The submission expert will provide instructions to clean common locations where PHI might exist (e.g. file names, slide labels) and a link to upload the data into our secure UAMS Box system. Upon receipt the slides are visually reviewed for burned in PHI, incorrectly labeled slides, scan quality issues and ensuring filenames match labels. Metadata fields are also reviewed to ensure they contain no PHI. Slide types supported include: Aperio (.svs, .tif), Hamamatsu (.vms, .vmu, .ndpi), Leica (.scn), MIRAX (.mrxs), Philips (.tiff), Sakura (.svslide), Trestle (.tif), Ventana (.bif, .tif) and Generic tiled TIFF (.tif).
Radiology Curation Steps
A TCIA submission expert will work with an Imaging point of contact from your site. The expert will provide all the required tools for de-identifying and sending your imaging data and will answer any questions you have throughout the process. These tools have been approved by NIH and comply with the Digital Imaging and Communications in Medicine (DICOM) international standard for medical image de-identification. Following industry best practices, TCIA uses a standards-based approach to de-identification of DICOM images and non-image data to insure that all data make publicly available are free of protected health information (PHI). The TCIA de-identification process ensures that the HIPAA de-identification standard is met by following the Safe Harbor Method as defined in section 164.514(b)(2) of the HIPPA Privacy Rule utilizing the following steps:
- TCIA will help your technical point of contact (PACS administrator or designated IT technician, henceforth referred to as "submitter") install TCIA's software on a standard desktop computer. The software runs on regular Windows/Mac/Linux desktop computers and requires Java to also be installed. It does not require any specialized hardware (e.g. servers are not necessary).
- TCIA will help the submitter create mapping tables (which do not leave the submitting site) which our software will use to assign anonymous patient IDs and to offset study dates.
- TCIA will walk the submitter through software testing using a small sample set of their study data (e.g. 1-2 patients).
- TCIA will help the submitter export the full set of imaging studies from their local PACS (or wherever the data resides) into the TCIA software for processing.
- Note: Please do not utilize your PACS system de-identification or other de-identification software as this usually deletes critical information researchers will need to make use of the data.
- TCIA will help the submitter use the software used to de-identify and transmit images to TCIA according to DICOM standards (Attribute Confidentiality Profile – DICOM PS 3.15: Appendix E) before it leaves your institution.
- TCIA quality control and curation staff will work with you to ensure the data are fully de- identified and received. Additional reviews are performed, and any remaining PHI are deleted if found.
- TCIA will publish the final data set with a descriptive page and announce its addition via our mailing list and social media channels.
Software used by TCIA
- Clinical Trials Processor (CTP)
- CTP is a stand-alone image processing application for DICOM (radiology) data. Our curation teams provide this tool to submitters in order to perform the initial de-identification of DICOM images before they are transferred to TCIA.
- CTP Wizard
- CTP Wizard is a custom extension of CTP developed by TCIA. It provides an interactive graphical user interface which can be used to more easily walk submitters through the process of importing, de-identifying, and exporting their DICOM data.
- Posda
- Posda is an open source application for the archival and curation of DICOM datasets. After receiving the data from submitters our curation teams use POSDA to perform additional quality checks and ensure all data was completely de-identified. It allows users to import DICOM data while tracking date and time received. Users can also prioritize multiple data submission streams based on assigned priority, identify and resolve duplicate unique identifiers (UIDs) submitted with different image or metadata, and check and edit data for DICOM conformance, consistency, and referential integrity.
DICOM De-identification Details
Following industry best-practices, TCIA uses a standards-based approach to de-identification of DICOM images to insure that images are free of protected health information (PHI). The TCIA de-identification process ensures that the HIPAA de-identification standard is met by following the Safe Harbor Method as defined in section 164.514(b)(2) of the HIPAA Privacy Rule. The standard for de-identification of DICOM objects is defined by Attribute Confidentiality Profile – DICOM PS 3.15: Appendix E. At the submitting site, a DICOM PS 3.15 compliant script removes or modifies DICOM tags deemed to be unsafe (See table 1 for a complete listing). TCIA incorporates the “Basic Application Confidentiality Profile” which is amended by inclusion of the following profile options: Clean Pixel Data Option, Clean Descriptors Option, Retain Longitudinal With Modified Dates Option, Retain Patient Characteristics Option, Retain Device Identity Option, and Retain Safe Private Option. The de-identification rules applied to each object are recorded by TCIA in the DICOM sequence Method Code Sequence [0012,0063] by entering the Code Value, Coding Scheme Designator, and Code Meaning for each profile and option that were applied to the DICOM object during de-identification. The DICOM standard for de-identification of objects defines a minimum set of elements to de-identify to be in compliance with the standard. It is up to the user doing the de-identification to insure that PHI is removed or cleaned according to the laws and practices in place at the time de-identification occurs.
Base level de-identification
The Basic Application Confidentiality Profile requires that Patient Name and Patient ID are either blanked or modified. TCIA incorporates an ID mapping between the original Patient ID and the ID that the images will have within TCIA. The mapping table is created at the image submitting site, the mapping performed prior to the images leaving the sites host computer, and TCIA never sees the original Patient ID. The remapped Patient ID is also mapped to the Patient Name field. This is done for the case where a DICOM viewer or application being used by the TCIA user that downloaded the data would require a Patient Name to be present. To show that the Patient Identity has been removed, the term “YES” is written into DICOM tag 00120062 “PatientIdentityRemoved”.
In general, the Basic Application Profile specifies removal or modification of any tag that by definition would contain PHI that could be used either alone or together with other information to uniquely identify a subject. Removal of detailed geographic information, dates, exam identifiers, patient demographics, free text entry fields, vendor private tags, etc. are all done to minimize the possibility of being able to uniquely identify an individual. The options to the DICOM de-identification standard allows for retention of information to help make the data scientifically valuable, but as more options are added the chance of PHI is increased and a rigorous de-identification process must be followed.
Exam Identifiers - DICOM makes extensive use of universal identifiers (UID) that could be used to identify a subject if a user had access to the PACS system at the institution where the images originated. The Basic Application Confidentiality Profile requires that all UIDs be removed or modified. TCIA uses its own root UID, appends an 8 digit string in the form of xxxx.yyyy (where xxxx is related to the collection and yyyy is related to a submitting site) and then appends a hashed value of the original UID. UIDs have no special meaning other than serving as unique identifiers and the only reason TCIA adds the 8 digit string is to minimize the possibility of two images being assigned the same UID as images come from many different sites. This technique insures that images stay associated with the appropriate series, study, and subject as well as ensuring that referenced images between secondary capture images, structured reports, PET/CT, etc. are still valid references to images within TCIA. Any image resubmitted to TCIA will have the same UID to avoid the same image appearing twice with a different identifier. Original accession numbers are hashed with a 16 bit string to prevent linking of DICOM objects back to the submitting site.
Dates - The Retain Longitudinal With Modified Dates Option allows dates to be retained as long as they are modified from the original date. Date and Date-Time fields in TCIA DICOM image headers have been offset based on a random number, but the longitudinal relationship between dates is maintained. Therefore, a researcher won’t know the precise date the scan occurred, but if a follow up scan was performed 120 days later, that same 120 day difference between scans of a subject will exist in the TCIA images. Dates that occur in DICOM tags other than Date or Date-Time fields are removed. An example of this would be a date entered into the Series Description field. If the date is associated with a library for Code Meaning then that date is preserved as the date would be required to look up the meaning in the correct version of the library. To show that the dates have been modified, the term “MODIFIED” is written into DICOM tag 00280303 “LongitudinalTemporalInformationModified”.
Optionally, a computed "Days from Baseline (e.g. diagnosis) can be inserted in the DICOM tag (0012,0050) Clinical Trial Time Point ID with the associated tag (0012,0051) Clinical Trial Time Point Description set to “Days from Baseline”. "Baseline Year" (e.g. year of diagnosis) can optionally be inserted in DICOM tag (0013,1051).
Patient Demographics – The keep Patient Characteristics Option allows keeping some patient demographics for research purposes. The allowed fields are Patient’s Sex, Patient’s Age, Patient’s Size, Patient’s Weight, Ethnic Group, Smoking Status, and Pregnancy Status. If a subject is over 90 years of age, then the age must be listed as 90+. Allergies, Patient State (this is not where they live, rather their condition), Pre-Medication, and Special Needs are defined by the DICOM standard as “clean” and are kept by TCIA and examined for PHI along with all tags during curation. Other patient demographics such as birthdate, address, religious affiliations, etc. are removed or emptied.
The names of health care providers including staff, hospital name, assigned IDs etc. are removed from the DICOM objects in cases where there is enough detail to identify an individual or facility where the scan was done.
Free Text - The Clean Descriptors Option allows for DICOM tags where free text could be entered by a technician to be kept. The following tags fall under that option and are all kept, inspected, and cleaned of PHI by TCIA during the curation process: Allergies, Patient State, Study Description, Series Description, Admitting Diagnoses Description, Admitting Diagnoses Code Sequence, Derivation Description, Identifying Comments, Medical Alerts, Occupation, Additional Patient’s History, Patient Comments, Contrast Bolus Agent, Protocol Name, Acquisition Device Processing Description, Acquisition Comments, Acquisition Protocol Description, Contribution Description, Image Comments, Frame Comments, Reason for Study, Requested Procedure Description, Requested Contrast Agent, Study Comments, Discharge Diagnosis Description, Service Episode Description, Visit Comments, Scheduled Procedure Step Description, Performed Procedure Step Description, Comments on Performed Procedure Step, Requested Procedure Comments, Reason for Imaging Service Request, Imaging Service Request Comments, Interpretation Text, Interpretation Diagnosis Description, Impressions, and Results Comments. The TCIA de-identification script run at the submitting sites removes the field “Request Attributes Sequence” as that tag typically contains PHI and provides no scientific value. Many of these fields contain information valuable to research and are important to retain. For images that are submitted with missing Series Descriptions, TCIA will add text to Series Descriptions to help researchers during TCIA image searches. When a missing series description is encountered, TCIA staff will use the following approach: Enter “LOCALIZER” if the ImageType contains the word localizer; Enter “Contrast” and then append the value contained in Contrast Bolus Agent if a value is present; if Contrast Bolus Agent is missing or empty other tags will be examined to see if a series was scanned with contrast (The Image Comments field is often used by sites to denote contrast); if the Image is an MR then TCIA will map the Scanning Sequence parameters into the Series Description; if none of those conditions apply then TCIA will map Scan Options or simply enter “none” into the Series Description field.
Devices - The Retain Device Identity Option of the DICOM de-identification standard allows for the retention of information related to the scanner used. The option allows for the following relevant tags to be retained: Station Name, Device Serial Number, Device UID, Plate ID, Generator ID, Cassette ID, Gantry ID, Detector ID, Scheduled Study Location, Scheduled Study Location AE Title, Scheduled Station AE Title, Scheduled Station Name, Scheduled Procedure Step Location, Performed Station AE Title, Performed Station Name, Performed Station Name Code Sequence, Scheduled Station Name Code Sequence, Scheduled Station Geographic Location Code Sequence, and Performed Station Geographic Location Code Sequence. TCIA removes Station Name as part of its de-identification process as Station Name often contains information related to the site where the scan occurred. The other tags listed above are retained if they are found to be free of PHI after TCIA curation of the submitted DICOM objects.
Private Tags - Unfortunately, there are many cases where vendors do not make the conformance statement for a piece of equipment publicly available or do not adequately define what is stored in the private tags, but these fields are extensively used by DICOM vendors to store information about the scans which are sometimes necessary for researchers to utilize the data. When a submitting site sends DICOM data to TCIA all private tags are retained and then de-identified by TCIA during curation of the data according to the Retain Safe Private Option. The Retain Safe Private Option allows for the retention of DICOM tags stored in the private fields. TCIA uses a private tag dictionary maintained by the Posda curation toolkit to decide the disposition of a vendor written private tag. The Posda private tag dictionary is a compilation of 4 well-known private tag dictionaries:
- the TCIA De-identification Knowledge Base (DeID KB),
- Grassroots DICOM
- DICOM3tools
- DCMTK
The addition of the other 3 private tag dictionaries allows for an expanded set of scientifically useful tags to be retained. To implement the new Posda private tag dictionary, TCIA resolved any discrepancies between the 4 included dictionaries and assigned dispositions to all private tags ever seen by Posda. Unique values seen within the private tags were inspected to ensure that dispositions were correctly assigned. If a new private tag is encountered in the Posda database that does not have a private tag disposition, values are inspected in relation to the tag description together with values in the tags and a disposition is assigned. If there is no existing private tag description, an attempt is made to find a manufacturer’s definition of the tag. If no such description can be found the disposition is defined to remove the tag. TCIA will remove any private tags from the images that are not specified in the private tag dictionary or are defined as containing a form of PHI such as name, SSN, etc. All date and datetime private tags that are retained are offset using the same offset as applied to the standard tags for the image. All private tags containing UIDs are assigned a TCIA root and appended with a hashed value as done with the standard tags. This ensures all references to other images contained within TCIA are maintained. A manual inspection of all private tags is performed using tagSniffer reports and any PHI that may be found is removed, emptied, date offset, or hashed as appropriate.
Info | ||
---|---|---|
| ||
Body Part Examined - When images are made public, a single body part examined, corresponding to the cancer of interest, is assigned to all images. If the collection consists of sarcoma images (or any other cancer affecting multiple organs within the image collection), there may be multiple body parts assigned, though only one to any series. In phantom collections, body part examined is simply labeled “PHANTOM”.
All Tags - The TCIA de-identification process ensures that every DICOM tag of every DICOM object is free of the 18 forms of PHI as currently defined by the Safe Harbor Method. At the submitting site, a DICOM PS 3.15 compliant script removes or modifies DICOM tags deemed to be unsafe (See table 1 for a complete listing). At TCIA, a software routine known as tagSniffer extracts every unique value found within a collection being curated and prints them to a report. This report is examined by curators and any actions necessary to remove PHI is applied when moving the images from the Intake server to the Public Server. Every DICOM image is inspected by curators for burned in PHI. Once the images reach the Public Server, the tags are inspected by two curators for PHI using new tagSniffer reports. Images are spot checked for any burned in PHI.
The following table details the de-identification performed at the submitting site by way of a TCIA supplied de-identification script. All other tags not mentioned in the table below are reviewed and cleaned if necessary during our Posda curation.
Table 1 - DICOM Tags Modified or Removed at the source site (18 forms of PHI as currently defined by the Safe Harbor Method, DICOM PS 3.15 compliant)
Tag | Name | Action |
00080050 | AccessionNumber | hash |
00184000 | AcquisitionComments | keep |
00400555 | AcquisitionContextSeq | remove |
00080022 | AcquisitionDate | incrementdate |
0008002a | AcquisitionDatetime | incrementdate |
00181400 | AcquisitionDeviceProcessingDescription | keep |
00189424 | AcquisitionProtocolDescription | keep |
00080032 | AcquisitionTime | keep |
00404035 | ActualHumanPerformersSequence | remove |
001021b0 | AdditionalPatientHistory | keep |
00380010 | AdmissionID | remove |
00380020 | AdmittingDate | incrementdate |
00081084 | AdmittingDiagnosesCodeSeq | keep |
00081080 | AdmittingDiagnosesDescription | keep |
00380021 | AdmittingTime | keep |
00102110 | Allergies | keep |
40000010 | Arbitrary | remove |
0040a078 | AuthorObserverSequence | remove |
00130010 | BlockOwner | CTP |
00180015 | BodyPartExamined | BODYPART |
00101081 | BranchOfService | remove |
00280301 | BurnedInAnnotation | keep |
00181007 | CassetteID | keep |
00400280 | CommentsOnPPS | keep |
00209161 | ConcatenationUID | hashuid |
00403001 | ConfidentialityPatientData | remove |
00700086 | ContentCreatorsIdCodeSeq | remove |
00700084 | ContentCreatorsName | empty |
00080023 | ContentDate | incrementdate |
0040a730 | ContentSeq | remove |
00080033 | ContentTime | keep |
0008010d | ContextGroupExtensionCreatorUID | hashuid |
00180010 | ContrastBolusAgent | keep |
0018a003 | ContributionDescription | keep |
00102150 | CountryOfResidence | remove |
00089123 | CreatorVersionUID | hashuid |
00380300 | CurrentPatientLocation | remove |
00080025 | CurveDate | incrementdate |
Group | curves | remove |
00080035 | CurveTime | keep |
0040a07c | CustodialOrganizationSeq | remove |
fffcfffc | DataSetTrailingPadding | remove |
00181200 | DateofLastCalibration | incrementdate |
0018700c | DateofLastDetectorCalibration | incrementdate |
00181012 | DateOfSecondaryCapture | incrementdate |
00120063 | DeIdentificationMethod | {Per DICOM PS 3.15 AnnexE. Details in 0012,0064} |
00120064 | DeIdentificationMethodCodeSequence | 113100/113101/113105/113107/113108/113109/113111 |
00082111 | DerivationDescription | keep |
0018700a | DetectorID | keep |
00181000 | DeviceSerialNumber | keep |
00181002 | DeviceUID | keep |
fffafffa | DigitalSignaturesSeq | remove |
04000100 | DigitalSignatureUID | remove |
00209164 | DimensionOrganizationUID | hashuid |
00380040 | DischargeDiagnosisDescription | keep |
4008011a | DistributionAddress | remove |
40080119 | DistributionName | remove |
300a0013 | DoseReferenceUID | hashuid |
00102160 | EthnicGroup | keep |
00080058 | FailedSOPInstanceUIDList | hashuid |
0070031a | FiducialUID | hashuid |
00402017 | FillerOrderNumber | empty |
00209158 | FrameComments | keep |
00200052 | FrameOfReferenceUID | hashuid |
00181008 | GantryID | keep |
00181005 | GeneratorID | keep |
00700001 | GraphicAnnotationSequence | remove |
00404037 | HumanPerformersName | remove |
00404036 | HumanPerformersOrganization | remove |
00880200 | IconImageSequence | remove |
00084000 | IdentifyingComments | keep |
00204000 | ImageComments | keep |
00284000 | ImagePresentationComments | remove |
00402400 | ImagingServiceRequestComments | keep |
40080300 | Impressions | keep |
00080012 | InstanceCreationDate | incrementdate |
00080014 | InstanceCreatorUID | hashuid |
00080081 | InstitutionAddress | remove |
00081040 | InstitutionalDepartmentName | remove |
00080082 | InstitutionCodeSequence | remove |
00080080 | InstitutionName | remove |
00101050 | InsurancePlanIdentification | remove |
00401011 | IntendedRecipientsOfResultsIDSequence | remove |
40080111 | InterpretationApproverSequence | remove |
4008010c | InterpretationAuthor | remove |
40080115 | InterpretationDiagnosisDescription | keep |
40080202 | InterpretationIdIssuer | remove |
40080102 | InterpretationRecorder | remove |
4008010b | InterpretationText | keep |
4008010a | InterpretationTranscriber | remove |
00083010 | IrradiationEventUID | hashuid |
00380011 | IssuerOfAdmissionID | remove |
00100021 | IssuerOfPatientID | remove |
00380061 | IssuerOfServiceEpisodeId | remove |
00281214 | LargePaletteColorLUTUid | hashuid |
001021d0 | LastMenstrualDate | incrementdate |
00280303 | LongitudinalTemporalInformationModified | MODIFIED |
04000404 | MAC | remove |
00080070 | Manufacturer | keep |
00081090 | ManufacturerModelName | keep |
00102000 | MedicalAlerts | keep |
00101090 | MedicalRecordLocator | remove |
00101080 | MilitaryRank | remove |
04000550 | ModifiedAttributesSequence | remove |
00203406 | ModifiedImageDescription | remove |
00203401 | ModifyingDeviceID | remove |
00203404 | ModifyingDeviceManufacturer | remove |
00081060 | NameOfPhysicianReadingStudy | remove |
00401010 | NamesOfIntendedRecipientsOfResults | remove |
00102180 | Occupation | keep |
00081070 | OperatorName | remove |
00081072 | OperatorsIdentificationSeq | remove |
00402010 | OrderCallbackPhoneNumber | remove |
00402008 | OrderEnteredBy | remove |
00402009 | OrderEntererLocation | remove |
04000561 | OriginalAttributesSequence | remove |
00101000 | OtherPatientIDs | remove |
00101002 | OtherPatientIDsSeq | remove |
00101001 | OtherPatientNames | remove |
00080024 | OverlayDate | incrementdate |
Group | overlays | remove |
00080034 | OverlayTime | keep |
00281199 | PaletteColorLUTUID | hashuid |
0040a07a | ParticipantSequence | remove |
00101040 | PatientAddress | remove |
00101010 | PatientAge | keep |
00100030 | PatientBirthDate | empty |
00101005 | PatientBirthName | remove |
00100032 | PatientBirthTime | remove |
00104000 | PatientComments | keep |
00100020 | PatientID | Re-Mapped |
00120062 | PatientIdentityRemoved | YES |
00380400 | PatientInstitutionResidence | remove |
00100050 | PatientInsurancePlanCodeSeq | remove |
00101060 | PatientMotherBirthName | remove |
00100010 | PatientName | Re-Mapped |
00102154 | PatientPhoneNumbers | remove |
00100101 | PatientPrimaryLanguageCodeSeq | remove |
00100102 | PatientPrimaryLanguageModifierCodeSeq | remove |
001021f0 | PatientReligiousPreference | remove |
00100040 | PatientSex | keep |
00102203 | PatientSexNeutered | keep |
00101020 | PatientSize | keep |
00380500 | PatientState | keep |
00401004 | PatientTransportArrangements | remove |
00101030 | PatientWeight | keep |
00400243 | PerformedLocation | remove |
00400241 | PerformedStationAET | keep |
00404030 | PerformedStationGeoLocCodeSeq | keep |
00400242 | PerformedStationName | keep |
00404028 | PerformedStationNameCodeSeq | keep |
00081052 | PerformingPhysicianIdSeq | remove |
00081050 | PerformingPhysicianName | remove |
00400250 | PerformProcedureStepEndDate | incrementdate |
00401102 | PersonAddress | remove |
00401101 | PersonIdCodeSequence | remove |
0040a123 | PersonName | empty |
00401103 | PersonTelephoneNumbers | remove |
40080114 | PhysicianApprovingInterpretation | remove |
00081048 | PhysicianOfRecord | remove |
00081049 | PhysicianOfRecordIdSeq | remove |
00081062 | PhysicianReadingStudyIdSeq | remove |
00402016 | PlaceOrderNumberOfImagingServiceReq | empty |
00181004 | PlateID | keep |
00400254 | PPSDescription | keep |
00400253 | PPSID | remove |
00400244 | PPSStartDate | incrementdate |
00400245 | PPSStartTime | keep |
001021c0 | PregnancyStatus | keep |
00400012 | PreMedication | keep |
Group | privategroups | keep |
00131010 | ProjectName | always |
00181030 | ProtocolName | keep |
00540016 | Radiopharmaceutical Information Sequence | process |
00181078 | Radiopharmaceutical Start DateTime | incrementdate |
00181079 | Radiopharmaceutical Stop DateTime | incrementdate |
00402001 | ReasonForImagingServiceRequest | keep |
00321030 | ReasonforStudy | keep |
04000402 | RefDigitalSignatureSeq | remove |
30060024 | ReferencedFrameOfReferenceUID | hashuid |
00380004 | ReferencedPatientAliasSeq | remove |
00080092 | ReferringPhysicianAddress | remove |
00080090 | ReferringPhysicianName | empty |
00080094 | ReferringPhysicianPhoneNumbers | remove |
00080096 | ReferringPhysiciansIDSeq | remove |
00404023 | RefGenPurposeSchedProcStepTransUID | hashuid |
00081140 | RefImageSeq | remove |
00081120 | RefPatientSeq | remove |
00081111 | RefPPSSeq | remove |
00081150 | RefSOPClassUID | keep |
04000403 | RefSOPInstanceMACSeq | remove |
00081155 | RefSOPInstanceUID | hashuid |
00081110 | RefStudySeq | remove |
00102152 | RegionOfResidence | remove |
300600c2 | RelatedFrameOfReferenceUID | hashuid |
00400275 | RequestAttributesSeq | remove |
00321070 | RequestedContrastAgent | keep |
00401400 | RequestedProcedureComments | keep |
00321060 | RequestedProcedureDescription | keep |
00401001 | RequestedProcedureID | remove |
00401005 | RequestedProcedureLocation | remove |
00321032 | RequestingPhysician | remove |
00321033 | RequestingService | remove |
00102299 | ResponsibleOrganization | remove |
00102297 | ResponsiblePerson | remove |
40084000 | ResultComments | keep |
40080118 | ResultsDistributionListSeq | remove |
40080042 | ResultsIDIssuer | remove |
300e0008 | ReviewerName | remove |
00404034 | ScheduledHumanPerformersSeq | remove |
0038001e | ScheduledPatientInstitutionResidence | remove |
0040000b | ScheduledPerformingPhysicianIDSeq | remove |
00400006 | ScheduledPerformingPhysicianName | remove |
00400001 | ScheduledStationAET | keep |
00404027 | ScheduledStationGeographicLocCodeSeq | keep |
00400010 | ScheduledStationName | keep |
00404025 | ScheduledStationNameCodeSeq | keep |
00321020 | ScheduledStudyLocation | keep |
00321021 | ScheduledStudyLocationAET | keep |
00321000 | ScheduledStudyStartDate | incrementdate |
00080021 | SeriesDate | incrementdate |
0008103e | SeriesDescription | keep |
0020000e | SeriesInstanceUID | hashuid |
00080031 | SeriesTime | keep |
00380062 | ServiceEpisodeDescription | keep |
00380060 | ServiceEpisodeID | remove |
00131013 | SiteID | SITEID |
00131012 | SiteName | SITENAME |
001021a0 | SmokingStatus | keep |
00181020 | SoftwareVersion | keep |
00080018 | SOPInstanceUID | hashuid |
00082112 | SourceImageSeq | remove |
00380050 | SpecialNeeds | keep |
00400007 | SPSDescription | keep |
00400004 | SPSEndDate | incrementdate |
00400005 | SPSEndTime | keep |
00400011 | SPSLocation | keep |
00400002 | SPSStartDate | incrementdate |
00400003 | SPSStartTime | keep |
00081010 | StationName | remove |
00880140 | StorageMediaFilesetUID | hashuid |
30060008 | StructureSetDate | incrementdate |
00321040 | StudyArrivalDate | incrementdate |
00324000 | StudyComments | keep |
00321050 | StudyCompletionDate | incrementdate |
00080020 | StudyDate | incrementdate |
00081030 | StudyDescription | keep |
00200010 | StudyID | empty |
00320012 | StudyIDIssuer | remove |
0020000d | StudyInstanceUID | hashuid |
00080030 | StudyTime | keep |
00200200 | SynchronizationFrameOfReferenceUID | hashuid |
0040db0d | TemplateExtensionCreatorUID | hashuid |
0040db0c | TemplateExtensionOrganizationUID | hashuid |
40004000 | TextComments | remove |
20300020 | TextString | remove |
00080201 | TimezoneOffsetFromUTC | remove |
00880910 | TopicAuthor | remove |
00880912 | TopicKeyWords | remove |
00880906 | TopicSubject | remove |
00880904 | TopicTitle | remove |
00081195 | TransactionUID | hashuid |
00131011 | TrialName | PROJECTNAME |
0040a124 | UID | hashuid |
Group | unspecifiedelements | keep |
0040a088 | VerifyingObserverIdentificationCodeSeq | remove |
0040a075 | VerifyingObserverName | empty |
0040a073 | VerifyingObserverSequence | remove |
0040a027 | VerifyingOrganization | remove |
00384000 | VisitComments | keep |
...