I Built a Breast Cancer Detection System End-to-End.

I Built a Breast Cancer Detection System End-to-End. Here’s What I Actually Learned. (Part 1: Data & Pipeline)

This isn’t a tutorial. It’s a breakdown of every decision, mistake, and insight from building a real ML pipeline on 300GB+ of raw mammography data and why the hardest problems had nothing to do with the model.

This is Part 1 of a 2-part series. Part 1 covers problem framing, dataset design, class taxonomy, and the data pipeline. Part 2 covers modeling, hyperparameter tuning, the semi-supervised training loop, metrics, and Grad-CAM explainability.

TL;DR (skip here if you’re short on time)

Dataset: VinDr-Mammo (PhysioNet), ~5,000 studies, 20,000+ images, ~300GB raw DICOM from a Vietnamese hospital system
Framed as object detection, not classification — localization matters clinically
11 finding classes, including Spiculation, a manually annotated subclass we introduced after researching clinical literature
Biggest data challenge: study-level splits, annotation noise, and 8:1 class imbalance
Pipeline built for reproducibility on ASU’s Sol supercomputer — every step deterministic and versioned
Key lesson: the pipeline is 80% of the work. The model is almost incidental.

I went into this thinking it was a modeling problem. It wasn’t. Most of the actual work happened before I wrote a single training loop, and that took a while to accept.

This article is a detailed walkthrough of that system: the architecture, the tradeoffs, the metrics that actually mattered, and the things I’d do differently. If you’re building ML for healthcare or thinking about what it means to take a model from notebook to production, I hope this is useful.

The Problem With Framing This as “Just Object Detection”

Breast cancer detection sounds like a computer vision problem. And technically, it is you drawing bounding boxes around suspicious regions in mammograms. But the moment you treat it like a standard COCO-style detection task, you run into problems that don’t exist in regular CV work.

The biggest issue is that a false negative isn’t just a metric hit; it’s a missed finding. Everything else flows from that. The annotations are also messier than you’d expect from a clinical dataset. Radiologists don’t always agree; some findings are marked at the study level with no image localization, and some labels are just inconsistent. You can’t trust the ground truth and move on.

The class structure is medically meaningful. You can’t just bucket everything into “finding” vs. “no finding” and call it done. The clinical categories matter, and they also vary wildly in frequency and annotation quality.

This forced a systems mindset from the start. Every pipeline decision downstream had to account for these constraints.

Dataset Design: Why Study-Level Thinking Changes Everything

The dataset is VinDr-Mammo, sourced from PhysioNet and originally collected from a Vietnamese hospital system. It contains approximately 5,000 study IDs and 20,000+ images. Each study includes 4 standard mammography views, two views (CC and MLO) of each breast. This structure is not incidental; it’s clinically meaningful. Radiologists interpret mammograms across views, not in isolation.

One thing worth noting: this dataset comes from a Vietnamese clinical setting, which means the patient population, imaging equipment, and annotation conventions differ from Western datasets like CBIS-DDSM. That’s not a limitation; it’s actually what makes it valuable for building generalizable systems. But it does mean you can’t blindly assume annotation styles match what you’d see in a US or European clinical workflow.

I worked with two complementary subsets throughout the project:

CLAHE-enhanced dataset is the primary training set. All raw DICOMs were preprocessed with contrast-limited adaptive histogram equalization to improve local contrast and tissue visibility. Used for benchmarking both YOLOv8 and Faster R-CNN.

Spiculated subset - a manually annotated subset layered on top of the CLAHE data. Every spiculation boundary was labelled by hand. This is the harder, ongoing part of the project, fine-tuning specifically for sensitivity to these subtle malignancy indicators.

This sounds obvious, but it’s easy to violate when you’re working with a flat image directory and a standard train/val split script. If you split at the image level, you’ll end up with images from the same patient in both train and validation. Your model will appear to generalize. It won’t. It’ll just be memorizing patient-specific anatomy.

I enforced study-level grouping explicitly before any splitting happened—grouping bystudy_id, then assigning the full group to train or val. Stratified by finding the presence to maintain class distribution across splits.

Class Design: The Hardest Taxonomy Decision

The final classification structure across both subsets:

The original VinDr-Mammo dataset had the 10 standard finding classes already defined. What we did on top of that was research-driven: we studied the clinical and radiological literature on spiculated lesions, what they look like, how they present across mammography views, and what distinguishes a true spiculation from other mass margin types and then introduced Spiculation as a manually annotated subclass within the existing "mass" category.

Every bounding box was drawn by hand, after visually reviewing the mammograms against radiological references. This wasn’t relabeling; it was extending the taxonomy based on clinical understanding of what actually indicates malignancy at the margin level.

Spiculated lesions (tumour boundaries with radiating extensions, like a starburst pattern) are one of the strongest imaging predictors of malignancy in mammography. Most public datasets don’t label them explicitly; they get folded into “Mass.” Separating them forces the model to learn morphologically distinct features, not just “there’s a dense region here.”

That’s also why the spiculated subset is still being fine-tuned. The class has fewer examples by nature, and getting recall up on it requires careful threshold tuning and additional pseudo-label rounds.

Handling the Imbalance

The numbers:

~18,000 images with no findings
~2,300 images with findings

An 8:1 ratio. Left unaddressed, the model learns to predict “No Finding” for everything and achieves 88% accuracy while being clinically useless.

What I did:

Downsampled “No Finding” to ~10–20% of finding images per training batch
Class weighting in the loss function, scaled to inverse frequency
Focal loss to penalize confident wrong predictions on hard examples (especially rare classes)
Multi-label to dominant class mapping - for studies with multiple findings, I assigned the clinically dominant class rather than treating it as multi-label, which simplified training without losing too much signal

The Data Pipeline: Built for Reproducibility, Not Just Speed

The pipeline took longer than any model experiment I ran. Here’s what it actually looked like.

raw_dicom_dir/
  {study_id}/
    {image_id}.dcm

→ Step 1: DICOM ingestion
   - Extract pixel arrays
   - Preserve windowing metadata (window center, width)
   - Handle MONOCHROME1 vs MONOCHROME2 inversion

→ Step 2: CLAHE preprocessing
   - Contrast Limited Adaptive Histogram Equalization
   - Applied per-tile, not globally - preserves local contrast in dense tissue
   - Output: normalized PNG, preserving folder structure

→ Step 3: Annotation transformation
   - Source: DICOM SR or CSV annotation files
   - Output: YOLO format .txt files (class_id, cx, cy, w, h - normalized)
   - Output: COCO format JSON (for Faster R-CNN via Detectron2)
   - Edge case handling: studies with image-level labels but no bounding boxes

→ Step 4: Study-level train/val split
   - Group by study_id
   - Stratified split on finding presence
   - Enforce no study appears in both splits

→ Step 5: Dataset versioning
   - Each experiment uses a named dataset version
   - Manifest file records: split ratios, class counts, CLAHE params, downsample seed

Why CLAHE specifically? Standard histogram equalization compresses global contrast, which can obscure subtle calcifications in dense breast tissue. CLAHE operates on local tiles, which preserves the micro-contrast that makes calcifications visible. In mammography, that difference is the difference between a model that learns to find calcifications and one that misses them entirely.

Why the manifest file? Because six weeks into the project, I could no longer remember which version of the dataset produced which results. A versioned manifest meant I could reproduce any experiment with the same images, same annotations, and same preprocessing params. This sounds like overhead. It isn’t. It’s the thing that made the iteration loop sustainable.

On infrastructure: all of this ran on Sol, ASU’s high-performance supercomputer. Processing 300GB+ of raw DICOMs, running CLAHE across 20,000+ images, and training multiple YOLOv8 rounds isn’t something you do on a local machine or a free Colab session. Having access to Sol meant I could treat the pipeline as a real system, with parallel preprocessing jobs, proper storage, and GPU-backed training runs rather than hacking around compute constraints the whole time. It also reinforced the reproducibility requirement in a real way: in a shared HPC environment, you have to be explicit about environments, dependencies, and job configs, or nothing works twice.

What’s Next

The pipeline was ready. The data was clean, versioned, and structured. Now came the harder part: getting a model to actually learn from it.

In Part 2, I’ll cover the full modeling story: why I chose object detection over classification, how experiments across YOLOv8 s/m/l played out by class, the semi-supervised self-training loop that moved mAP from 0.54 to 0.68 without any architecture change, and what Grad-CAM revealed about where the model actually breaks.

Part 2 is live here: [https://medium.com/@kamayanirai78/i-built-a-breast-cancer-detection-system-end-to-end-bf489ecf0286]

Built with: YOLOv8 (Ultralytics), Faster R-CNN (Detectron2), PyTorch, CLAHE (OpenCV), Grad-CAM++, pydicom, FastAPI, Python. Compute: ASU Sol Supercomputer.

Dataset: VinDr-Mammo (PhysioNet) — ~5,000 studies, 20,000+ images, ~300GB raw DICOM. Originally collected from a Vietnamese hospital system. ASU research project.

I Built a Breast Cancer Detection System End-to-End. was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.