SummaryThis data collection is an updated and standardized version of the Digital Database for Screening Mammography (DDSM) for evaluation of future CADx and CADe systems (sometimes referred to generally as CAD) research in mammography. This data collection includes decompressed images, data selection and curation by a trained mammographer, updated ROI segmentation and bounding boxes, and pathologic diagnosis for training data.
Published research results from work in developing decision support systems in mammography are difficult to replicate due to the lack of a standard evaluation data set; most computer-aided diagnosis (CADx) and detection (CADe) algorithms for breast cancer in mammography are evaluated on private data sets or on unspecified subsets of public databases. This causes an inability to directly compare the performance of methods or to replicate prior results. We seek to resolve this substantial challenge by releasing an updated and standardized version of the Digital Database for Screening Mammography (DDSM) for evaluation of future CADx and CADe systems (sometimes referred to generally as CAD) research in mammography. Our dataset includes decompressed images, data selection and curation by a trained mammographer, updated ROI segmentation and bounding boxes, and pathologic diagnosis for training data, formatted similarly to modern computer vision data sets.
Few well-curated public datasets have been provided for the mammography community. These include the Digital Database for Screening Mammography (DDSM), the Mammographic Imaging Analysis Society (MIAS) database, and the Image Retrieval in Medical Applications (IRMA) project. Although these public data sets are useful, they are limited in terms of data set size and accessibility. For example, most researchers using the DDSM do not leverage all its images for a variety of historical reasons. When the database was released in 1997, computational resources to process hundreds or thousands of images were not widely available. Additionally, the DDSM images are saved in non-standard compression files that require the use of decompression code that has not been updated or maintained for modern computers. Finally, the ROI annotations for the abnormalities in the DDSM were provided to indicate a general position of lesions, but not a precise segmentation for them. Therefore, many researchers must implement segmentation algorithms for accurate feature extraction.
While there are substantial challenges in using the DDSM for method evaluation, due to its size and other unique characteristics, we believe that it can still be a powerful resource for imaging research. The DDSM is a database of 2,620 scanned film mammography studies. It contains normal, benign, and malignant cases with verified pathology information. The scale of the database along with ground truth validation makes the DDSM a useful tool in the development and testing of decision support systems.
We release a curated subset of DDSM we call the CBIS-DDSM (Curated Breast Imaging Subset of DDSM), which provides easily accessible data and improved ROI segmentation. We believe that this resource will contribute to the advancement of decision support system research in mammography. Furthermore, we hope that a modern, standardized mammography data distribution format will encourage others to follow suit and contribute to open research. We also recognize that modern mammography is now digital, and once large collections of digital mammograms are available, a similar methodology we used with DDSM could be applied to enhance CBIS-DDSM.
For scientific inquiries about this dataset, please contact Dr. Daniel Rubin, Department of Biomedical Data Science, Radiology, and Medicine, Stanford University School of Medicine (firstname.lastname@example.org). A manuscript describing the dataset in detail is under review in Scientific Data and will be linked here when published.
Choosing the Download option will provide you with a file to launch the TCIA Download Manager to download the entire collection. If you want to browse or filter the data to select only specific scans/studies please use the Search option.
Click the Versions tab for more info about data releases.
Number of Patients
Number of Studies
Number of Series
Number of Images
|Image Size (GB)|
The CBIS-DDSM was created from DDSM by undertaking the following specific procedures:
1) Removal of questionable mass cases
Not all DDSM ROI annotations include suspicious lesions. Due to this issue, we acquired the assistance of a trained mammographer who reviewed the questionable cases. In this process, we found 254 images in which a mass was not clearly seen. These images were removed from the final data set.
2) Image Decompression
DDSM images are distributed as lossless JPEG files (LJPEG); an obsolete image format. The only library capable of decompressing these images is the Stanford PVRG-JPEG Codec v1.1, which was last updated in 1993. We modified the PVRG-JPEG codec to successfully compile on an OSX 10.10.5 (Yosemite) distribution using Apple GCC clang-602.0.53. The decompression code outputs data in 8-bit raw binary bitmaps. We wrote python tools to read this raw data and store it as 16-bit gray scale TIFF files. These files were later converted to DICOM.
This process is entirely lossless and preserved all information from the original DDSM files.
3) Image Processing
The original DDSM files were distributed with a set of bash and C tools for Linux to perform image correction and metadata processing. These tools were very difficult to refactor for use on modern systems. Instead, we re-implemented these tools in Python to be cross-platform and easy to understand for modern users. All images in the DDSM were derived from several different scanners at different institutions. The DDSM data descriptions provide methods to convert raw pixel data into 64-bit optical density values, which are standardized across all images. Optical density values were then re-mapped to 16-bit gray scale TIFF files. The DDSM automatically clips optical density values to be between 0.05 and 3.0 for noise reduction. We perform this clipping as well, but provide a flag to remove the clipping and retain the original optical density values.
4) Image Cropping
Several CAD tasks require only analyzing abnormalities (the portion of the image in the ROI) without needing the full mammogram image. We provide a set of convenience images, which are focused crops of abnormalities. Abnormalities were cropped by determining the bounding rectangle of the abnormality with respect to its ROI. We create square crops by extending the shorter edge of the rectangle to be the same size as the long edge. The centroid of the abnormality is located in the center of these square crops.
5) Updating for precision segmentation
Mass margin and shape have long been proven to be significant indicators for diagnosis in mammography. Because of this, many methods are based on developing mathematical descriptions of the tumor outline. Due to the dependence of these methods on accurate ROI segmentation and the imprecise nature of many of the DDSM-provided annotations, we applied a lesion segmentation algorithm (described below) that is initialized by the general, original DDSM contours but is able to supply much more accurate ROIs. This was done only for masses and not calcifications. Lesion segmentation was accomplished by applying a modification to the local level set framework as presented in Chan and Vese11. Level set models follow a non-parametric deformable model, thus can handle topological changes during evolution11. Chan-Vese model is a region-based method that estimates spatial statistics of image regions and finds a minimal energy where the model best fits the image, resulting in convergence of the contour towards the desired object. Our modification of the local framework includes automated evaluation of the local region surrounding each contour point. For low contrast lesions, small local region is determined, and excessive curve evolution is thus prevented. On the other hand, for noisy or heterogeneous lesions, a relatively large local region is assigned to the contour point to prevent convergence of the level set contour into local minima. Local frameworks require an initialization of the contour, and thus in our case the original DDSM annotation was used as the level set segmentation initialization.
6) Standardized Train/Test splits
The data were split into a training set and a testing set based on the BIRADS category. This allows for an appropriate stratification for researchers working on CADe as well as CADx. The split was obtained using 20% of the cases for testing and the rest for training. The data were split for all mass cases and all calcification cases separately. Here “case” is used to indicate a particular abnormality, seen on both the CC and MLO views.
Citations & Data Usage Policy
This collection is freely available to browse, download, and use for commercial, scientific and educational purposes as outlined in the Creative Commons Attribution 3.0 Unported License. See TCIA's Data Usage Policies and Restrictions for additional details. Questions may be directed to email@example.com.
Please be sure to include the following citations in your work if you use this data set:
(2016). Curated Breast Imaging Subset of DDSM. The Cancer Imaging Archive. http://dx.doi.org/10.7937/K9/TCIA.2016.7O02S9CY
Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P, Moore S, Phillips S, Maffitt D, Pringle M, Tarbox L, Prior F. The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository, Journal of Digital Imaging, Volume 26, Number 6, December, 2013, pp 1045-1057. (paper)
Other Publications Using This Data
TCIA maintains a list of publications which leverage our data. At this time we are not aware of any publications based on this data. If you have a publication you'd like to add please contact the TCIA Helpdesk.