How a 96% model collapsed to near-random guessing and what we did about it

A retinal disease detection model scoring an mAUC of 0.963 sounds impressive. Until it drops to 0.534 the moment you test it on a different hospital’s images.
That single number, a 43-point collapse , changed the entire direction of our project. It meant a model we thought was clinically promising was, in practice, barely better than random guessing when faced with unfamiliar camera hardware and lighting conditions. Not a marginal performance drop. A failure.
That was the moment we stopped chasing accuracy. And started chasing generalization.
Why In-Distribution Accuracy Is Not Enough
Most published retinal AI papers follow a pattern that looks rigorous on paper but breaks in deployment: train on Dataset A, evaluate on Dataset A (held-out split), report accuracy. The metric is real, but it measures in-distribution performance how well the model learned that specific dataset, not retinal disease in general.
The ODIR dataset uses a specific camera configuration, lighting standard, and patient demographic. A model trained exclusively on it learns to exploit those statistical regularities just as much as the clinical features. When you introduce images from a different fundus camera or population, domain shift can hit hard accuracy can drop by 15–30 percentage points, depending on how different the source domains are.
This is the shortcut learning problem: models find the easiest predictive signal, which is not always the clinically meaningful one.
When the Model Learned the Dataset Instead of the Disease
Standard deep learning validation relies on the i.i.d. (independent and identically distributed) assumption. Training, validation, and test sets are drawn from the same underlying distribution, so a model minimizing loss over the training split is only proving it can interpolate within that specific sandbox.
In real-world clinical medicine, this assumption completely collapses.
A model trained on a source distribution P_source(X, Y) gets deployed onto a completely different target distribution P_target(X, Y).
This break manifests in two dangerous ways:
Covariate Shift: The raw input data distributions differ between environments: P_source(X) ≠ P_target(X). This happens when hospitals use different camera sensors, lighting configurations, or patient demographics.
Concept Drift:The conditional distributions shift: P_source(Y|X) ≠ P_target(Y|X). This occurs when different clinical groups apply distinct diagnostic thresholds or subjective grading criteria for the same physical pathology.
The Shortcut Learning Trap
When a model minimizes empirical risk exclusively on a single source domain, it optimizes to minimize loss solely over that specific training distribution:
Optimization Objective: Minimize Expected Loss over P_source [Loss(f(x; θ), y)]
Medical imaging datasets are highly prone to hidden confounding factors a specific optical lens coating, a camera-specific aspect ratio, or a text overlay used exclusively by one hospital. The model minimizes its loss by optimizing for these non-clinical shortcut features. When deployed onto a new domain where those artifacts are absent, the model’s internal logic collapses.
Building the Multi-Dataset Training Pool
To systematically study domain shift, we aggregated three distinct public fundus photography datasets. Each brings a unique profile of structural, acquisition, and demographic characteristics:
ODIR (Ocular Disease Intelligent Recognition): A real-world multi-disease collection captured across various clinical centers using distinct camera configurations, aspect ratios, and resolutions.
RFMiD v1 (Retinal Fundus Multi-Disease Dataset):A highly heterogeneous dataset rich in rare clinical conditions, annotated by specialized ophthalmologists, but heavily imbalanced toward abnormal findings.
RFMiD v2:An independent secondary release with altered device parameters, distinct lighting environments, and updated annotation protocols.
The Taxonomy Alignment Problem
The raw data presented a severe structural hurdle: completely inconsistent and overlapping label taxonomies. ODIR uses patient-level multi-label encoding with thousands of free-text strings, while RFMiD uses fine-grained, image-level multi-disease flags spanning over 40 rare conditions.
To create a uniform learning objective, we designed a Strict Harmonization Protocol that collapses both taxonomies into four clean target classes the NDGC Taxonomy:
- N (Normal): Completely free of any visible ocular abnormality.
- D (Diabetic Retinopathy): Exhibits microaneurysms, hemorrhages, hard exudates, or neovascularization.
- G (Glaucoma): Characterized by increased cup-to-disc ratios, neuroretinal rim thinning, or optical nerve head cupping.
- C (Cataract): Characterized by significant lens opacity resulting in overall blur, vessel attenuation, or signal loss.
Code: Multi-Dataset Label Mapping
class RetinalDatasetHarmonizer:
def __init__(self, target_classes: List[str] = ['N', 'D', 'G', 'C']):
self.target_classes = target_classes
self.class_to_idx = {cls: idx for idx, cls in enumerate(target_classes)}
def harmonize_rfmid(self, df: pd.DataFrame) -> pd.DataFrame:
harmonized_data = []
for idx, row in df.iterrows():
label_vec = np.zeros(len(self.target_classes), dtype=np.float32)
is_dr = row.get('DR', 0) == 1
is_glaucoma = row.get('GLAUCOMA', 0) == 1
is_cataract = row.get('CATARACT', 0) == 1
if is_dr: label_vec[self.class_to_idx['D']] = 1.0
if is_glaucoma: label_vec[self.class_to_idx['G']] = 1.0
if is_cataract: label_vec[self.class_to_idx['C']] = 1.0
if not is_dr and not is_glaucoma and not is_cataract and row.get('Disease_Risk', 0) == 0:
label_vec[self.class_to_idx['N']] = 1.0
harmonized_data.append({'image_id': row['ID'], 'labels': label_vec})
return pd.DataFrame(harmonized_data)
def filter_unsupported_samples(self, df: pd.DataFrame) -> pd.DataFrame:
def is_valid(label_vec):
return np.sum(label_vec) > 0
mask = df['labels'].apply(is_valid)
return df[mask].reset_index(drop=True)
Harmonized Dataset Metrics
After applying the harmonization pipeline and enforcing strict exclusion criteria, the final dataset was carefully standardized to reduce inter-dataset inconsistencies and improve cross-domain learning reliability. During preprocessing, standalone AMD and retinitis pigmentosa samples were removed to maintain diagnostic consistency across all participating datasets.
The harmonized dataset combines images from three major retinal imaging sources: ODIR, RFMiD v1, and RFMiD v2. Each dataset contributes distinct imaging characteristics, helping the model learn under diverse acquisition conditions rather than overfitting to a single clinical environment.
ODIR contributed the largest portion of samples and introduced high variability in illumination, image resolution, and contrast levels. These characteristics simulate real-world clinical inconsistency and make generalization more challenging.
RFMiD v1 provided comparatively cleaner and professionally captured retinal images with centralized optic-disc alignment and better visual consistency, making it useful for stable feature learning.
RFMiD v2 contained fewer samples but introduced stronger class imbalance and sharper anatomical boundaries with cooler color distributions, further increasing domain diversity.
The final pool: 8,222 fundus photographs across three datasets, each with distinct camera hardware, lighting profiles, and annotation styles. Exactly the kind of diversity that breaks overfit models.
High-Fidelity Preprocessing Pipeline
Fundus photography is deeply affected by variations in physical capturing systems. Different camera apertures, lens types, and illumination bulbs alter color spaces and image clarity. Our tensor-level preprocessing pipeline targets three distinct issues:
Aspect-Ratio Preserving Letterboxing: Pads images to a uniform square shape before downsampling to prevent anatomical distortion of the optical disc.
Global Illumination Equalization: Applies local Gaussian blur subtraction to normalize brightness levels across different fields of view.
CLAHE (Contrast-Limited Adaptive Histogram Equalization): Enhances local contrast of micro-vessels and minor hemorrhages across the green channel.
class ClinicalRetinalPreprocessor:
def __init__(self, target_size: int = 384):
self.target_size = target_size
def _apply_clahe(self, img: np.ndarray) -> np.ndarray:
lab = cv2.cvtColor(img, cv2.COLOR_RGB2LAB)
l, a, b = cv2.split(lab)
clahe = cv2.createCLAHE(clipLimit=3.0, tileGridSize=(8,8))
cl = clahe.apply(l)
limg = cv2.merge((cl,a,b))
return cv2.cvtColor(limg, cv2.COLOR_LAB2RGB)
def __call__(self, pil_img: Image.Image) -> torch.Tensor:
img_np = np.array(pil_img)
img_enhanced = self._apply_clahe(img_np)
pil_enhanced = Image.fromarray(img_enhanced)
transform_pipeline = T.Compose([
T.Resize((self.target_size, self.target_size)),
T.RandomHorizontalFlip(p=0.5),
T.RandomVerticalFlip(p=0.5),
T.RandomRotation(degrees=25, interpolation=T.InterpolationMode.BILINEAR),
T.ColorJitter(brightness=0.15, contrast=0.15, saturation=0.1, hue=0.05),
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
return transform_pipeline(pil_enhanced)
Testing the Model Inside and Outside Its Comfort Zone
To uncover the performance gap hidden by standard validation, our system runs two entirely separate evaluation tracks:
1. In-Domain Strategy: The model is trained on a portion of an individual dataset (e.g., ODIR) and evaluated on a completely separated validation split from that same dataset. This mimics traditional sandbox experiments.
2. Leave-One-Dataset-Out (LODO) Strategy: Given N distinct datasets, the model trains entirely on N-1 datasets (source domains) and evaluates exclusively on the remaining untouched dataset (target domain). This simulates what happens when a model ships to a completely new hospital network.
We define three explicit cross-domain evaluation tasks:
- Task 1:Trained on ODIR + RFMiD v2 → Tested on unseen: RFMiD v1
- Task 2: Trained on ODIR + RFMiD v1 → Tested on unseen: RFMiD v2
- Task 3: Trained on RFMiD v1 + RFMiD v2 → Tested on unseen: ODIR
class LODOEvaluationEngine:
def __init__(self, dataset_registry: Dict[str, pd.DataFrame]):
self.registry = dataset_registry
def generate_lodo_splits(self, target_domain: str) -> Tuple[pd.DataFrame, pd.DataFrame]:
if target_domain not in self.registry:
raise ValueError(f"Target domain '{target_domain}' missing from execution registry.")
source_frames = []
target_frame = self.registry[target_domain]
for name, df in self.registry.items():
if name == target_domain:
continue
source_frames.append(df)
unified_source_pool = pd.concat(source_frames, ignore_index=True)
unified_source_pool = unified_source_pool.sample(frac=1.0, random_state=42).reset_index(drop=True)
return unified_source_pool, target_frame
The LODO evaluation is deliberately unforgiving. It represents a strict binary test: either the learned features transfer to completely unfamiliar clinical environments, or they do not.
In Part 2, we cover the domain generalization algorithms we built to fix the problem, including our custom DGA-DG framework, and the full results comparison across all methods and tasks.
Why Your Retinal Disease Model Fails Outside the Lab was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.