This data collection is an updated and standardized version of the Digital Database for Screening Mammography (DDSM) for evaluation of future CADx and CADe systems (sometimes referred to generally as CAD) research in mammography. This data collection includes decompressed images, data selection and curation by a trained mammographer, updated ROI segmentation and bounding boxes, and pathologic diagnosis for training data. |
Published research results from work in developing decision support systems in mammography are difficult to replicate due to the lack of a standard evaluation data set; most computer-aided diagnosis (CADx) and detection (CADe) algorithms for breast cancer in mammography are evaluated on private data sets or on unspecified subsets of public databases. This causes an inability to directly compare the performance of methods or to replicate prior results. We seek to resolve this substantial challenge by releasing an updated and standardized version of the Digital Database for Screening Mammography (DDSM) for evaluation of future CADx and CADe systems (sometimes referred to generally as CAD) research in mammography. Our dataset includes decompressed images, data selection and curation by a trained mammographer, updated ROI segmentation and bounding boxes, and pathologic diagnosis for training data, formatted similarly to modern computer vision data sets.
Few well-curated public datasets have been provided for the mammography community. These include the Digital Database for Screening Mammography (DDSM), the Mammographic Imaging Analysis Society (MIAS) database, and the Image Retrieval in Medical Applications (IRMA) project. Although these public data sets are useful, they are limited in terms of data set size and accessibility. For example, most researchers using the DDSM do not leverage all its images for a variety of historical reasons. When the database was released in 1997, computational resources to process hundreds or thousands of images were not widely available. Additionally, the DDSM images are saved in non-standard compression files that require the use of decompression code that has not been updated or maintained for modern computers. Finally, the ROI annotations for the abnormalities in the DDSM were provided to indicate a general position of lesions, but not a precise segmentation for them. Therefore, many researchers must implement segmentation algorithms for accurate feature extraction.
While there are substantial challenges in using the DDSM for method evaluation, due to its size and other unique characteristics, we believe that it can still be a powerful resource for imaging research. The DDSM is a database of 2,620 scanned film mammography studies. It contains normal, benign, and malignant cases with verified pathology information. The scale of the database along with ground truth validation makes the DDSM a useful tool in the development and testing of decision support systems.
We release a curated subset of DDSM we call the CBIS-DDSM (Curated Breast Imaging Subset of DDSM), which provides easily accessible data and improved ROI segmentation. We believe that this resource will contribute to the advancement of decision support system research in mammography. Furthermore, we hope that a modern, standardized mammography data distribution format will encourage others to follow suit and contribute to open research. We also recognize that modern mammography is now digital, and once large collections of digital mammograms are available, a similar methodology we used with DDSM could be applied to enhance CBIS-DDSM.
For scientific inquiries about this dataset, please contact Dr. Daniel Rubin, Department of Biomedical Data Science, Radiology, and Medicine, Stanford University School of Medicine (dlrubin@stanford.edu). A manuscript describing the dataset in detail is under review in Scientific Data and will be linked here when published.
|