Number of PatientsSubjects
Number of Studies
Number of Series
Number of Images
|Image Size (GB)||163.6|
* The image data for this collection is structured such that each patient subject has multiple patient IDs. For example, patient subject 00038 has 10 separate patient IDs which provide information about the scans within the IDs (e.g. Calc-Test_P_00038_LEFT_CC, Calc-Test_P_00038_RIGHT_CC_1) This makes it appear as though there are 6,671 patients subjects according to the DICOM metadata, but there are only 1,566 actual patients subjects in the cohort.
The CBIS-DDSM contributors have provided the following additional options for subset download.
|Mass-Training Full Mammogram Images (DICOM)|
|Mass-Training ROI and Cropped Images (DICOM)|
|Calc-Training Full Mammogram Images (DICOM)|
Calc-Training ROI and Cropped Images (DICOM)
|Mass-Test Full Mammogram Images (DICOM)|
|Mass-Test ROI and Cropped Images (DICOM)|
Calc-Test Full Mammogram Images (DICOM)
Calc-Test ROI and Cropped Images (DICOM)
The CBIS-DDSM was created from DDSM by undertaking the following specific procedures:
1) Removal of questionable mass cases
Not all DDSM ROI annotations include suspicious lesions. Due to this issue, a trained mammographer reviewed the questionable cases. In this process, 254 images were identified in which a mass was not clearly seen. These images were removed from the final data set.
2) Image Decompression
DDSM images are distributed as lossless JPEG files (LJPEG); an obsolete image format. The only library capable of decompressing these images is the Stanford PVRG-JPEG Codec v1.1, which was last updated in 1993. To address this the PVRG-JPEG codec was modified to successfully compile on an OSX 10.10.5 (Yosemite) distribution using Apple GCC clang-602.0.53. The decompression code outputs data in 8-bit raw binary bitmaps. Python tools were developed to read this raw data and store it as 16-bit gray scale TIFF files. These files were later converted to DICOM.
This process is entirely lossless and preserved all information from the original DDSM files.
3) Image Processing
The original DDSM files were distributed with a set of bash and C tools for Linux to perform image correction and metadata processing. These tools were very difficult to refactor for use on modern systems. To address this the tools were re-implemented in Python to be cross-platform and easy to understand for modern users. All images in the DDSM were derived from several different scanners at different institutions. The DDSM data descriptions provide methods to convert raw pixel data into 64-bit optical density values, which are standardized across all images. Optical density values were then re-mapped to 16-bit gray scale TIFF files. The DDSM automatically clips optical density values to be between 0.05 and 3.0 for noise reduction. This clipping occurs in the CBIS-DDSM as well, but the new tools provide a flag to remove the clipping and retain the original optical density values.
4) Image Cropping
Several CAD tasks require only analyzing abnormalities (the portion of the image in the ROI) without needing the full mammogram image. A set of convenience images are also provided, which are focused crops of abnormalities. Abnormalities were cropped by determining the bounding rectangle of the abnormality with respect to its ROI. The square crops were created by extending the shorter edge of the rectangle to be the same size as the long edge. The centroid of the abnormality is located in the center of these square crops.
5) Updating for precision segmentation
Mass margin and shape have long been proven to be significant indicators for diagnosis in mammography. Because of this, many methods are based on developing mathematical descriptions of the tumor outline. Due to the dependence of these methods on accurate ROI segmentation and the imprecise nature of many of the DDSM-provided annotations, a lesion segmentation algorithm (described below) was applied that is initialized by the general, original DDSM contours but is able to supply much more accurate ROIs. This was done only for masses and not calcifications. Lesion segmentation was accomplished by applying a modification to the local level set framework as presented in Chan and Vese11. Level set models follow a non-parametric deformable model, thus can handle topological changes during evolution11. Chan-Vese model is a region-based method that estimates spatial statistics of image regions and finds a minimal energy where the model best fits the image, resulting in convergence of the contour towards the desired object. This modification of the local framework includes automated evaluation of the local region surrounding each contour point. For low contrast lesions, small local region is determined, and excessive curve evolution is thus prevented. On the other hand, for noisy or heterogeneous lesions, a relatively large local region is assigned to the contour point to prevent convergence of the level set contour into local minima. Local frameworks require an initialization of the contour, and thus the original DDSM annotation was used as the level set segmentation initialization.
6) Standardized Train/Test splits
The data were split into a training set and a testing set based on the BIRADS category. This allows for an appropriate stratification for researchers working on CADe as well as CADx. The split was obtained using 20% of the cases for testing and the rest for training. The data were split for all mass cases and all calcification cases separately. Here “case” is used to indicate a particular abnormality, seen on both the CC and MLO views.