The Silicon Protocol: The Kill Switch Decision — When You Can’t Turn It Off
Three emergency shutdown patterns for healthcare LLMs. Two create worse problems than they solve. One stops the system without killing patients.

The alert fired in the middle of the night.
LLM hallucination detected: medication interaction recommendations inconsistent with drug interaction database.
The on-call CTO logged in. Reviewed the issue. The LLM was generating plausible but incorrect drug interaction warnings, confusing ED clinicians who relied on the system for triage decision support.
Decision: Hit the emergency kill switch.
The system shut down instantly.
3:49 AM: 40 ICU nurses lost access to clinical decision support mid-shift.
3:52 AM: ED attending physician unable to access medication interaction checker during trauma case.
4:03 AM: Two patients experienced delayed antibiotic administration because nurses couldn’t verify drug-drug interactions without the system.
4:15 AM: Charge nurse called IT: “Where’s the manual fallback process? We have nothing documented.”
The kill switch worked perfectly. The LLM stopped hallucinating.
And the clinical disruption caused more harm than the hallucination it was designed to prevent.
This happened at a 450-bed teaching hospital in December 2025. No patients died. But two received delayed treatment because the emergency shutdown procedure never accounted for what happens to the 60 clinicians who depend on the system during a night shift.
I investigated this incident three weeks later.
The root cause wasn’t the hallucination. LLMs hallucinate — that’s a known risk with mitigation strategies.
The root cause: nobody designed a shutdown process that preserved clinical workflows.
The Problem No One Plans For: You Can’t Just Turn It Off
I’ve audited nine healthcare LLM deployments in the past 16 months.
All nine had emergency kill switches.
None had documented fallback procedures for when the kill switch activated.
Here’s what breaks when you shut down a clinical AI system at 3 AM:
The Downtime Reality Healthcare Organizations Face
Healthcare systems experienced over $21.9 billion in collective losses from downtime between 2018–2024, with organizations losing an average of 17+ days of operations per incident.
96% of healthcare organizations report at least one unplanned EHR outage.
70% have experienced downtimes lasting 8+ hours.
In July 2024, the CrowdStrike outage disrupted operations at 12 major U.S. hospitals including Cleveland Clinic and Mass General Brigham, causing delays in lab results, canceled procedures, and forced reliance on manual processes.
When clinical systems go dark during patient care:
- Clinicians cannot access medication histories, allergy information, lab results
- Emergency decisions made without complete patient information
- Delays in treatment, duplicated testing, medication errors
- Manual paper workflows that staff haven’t practiced
- Increased risk of human error and clinician fatigue
Digital darkness events don’t just impact IT — they directly affect patient safety.
And most organizations discover their downtime procedures are inadequate during the outage, not before.
What Makes LLM Shutdowns Different From EHR Downtime
EHR downtime is well-understood. Hospitals have decades of experience with EHR outages. They’ve practiced paper chart workflows. They have documented downtime procedures.
LLM shutdowns are different:
1. Dependency ambiguity
Clinicians don’t always know which workflows depend on the LLM until it’s gone.
Example: ED triage system uses LLM for:
- Medication interaction checking
- Discharge instruction generation
- Clinical note summarization
- Patient education material creation
When you kill the LLM, which of these workflows break? All of them? Some of them?
Most organizations don’t document this until the shutdown happens.
2. Partial functionality confusion
With EHR downtime, it’s binary: system works or doesn’t.
With LLM systems, it’s ambiguous: some features might fail while others work.
Example: LLM hallucinating drug interactions but correctly generating discharge summaries.
Do you shut down the entire system or just the problematic feature?
If you shut down the interaction checker, do clinicians know? Do they assume it’s still working? Do they manually verify every interaction?
3. No muscle memory for fallback
Clinicians have practiced EHR downtime procedures through drills and actual outages.
Nobody has practiced LLM downtime.
When the system goes dark at 3 AM, staff default to: “Wait, what do we do now?”
If the answer isn’t documented and trained, the fallback is chaos.
The Three Shutdown Patterns (And Why Two Fail Catastrophically)
After investigating nine LLM shutdown incidents, I’ve identified three patterns:
Pattern 1: Hard Stop (Instant Kill Switch) — everything shuts down immediately
Pattern 2: Feature Flag Rollback (Selective Disable) — turn off specific features
Pattern 3: Graceful Degradation with Documented Fallback — automatic switch to rule-based backup
Let’s break down why Pattern 1 and 2 cause clinical disruption, and what Pattern 3 actually requires.
Pattern 1: Hard Stop / Instant Kill Switch (The 3 AM Disaster)
How it works:
Single button (or command) kills all LLM services immediately. System goes from operational to completely offline in seconds.
What organizations actually deploy:
import requests
from typing import Dict, Any
class HardStopKillSwitch:
"""
Pattern 1: Instant shutdown
Stops all LLM services immediately
No graceful degradation
No fallback procedures
No clinical workflow preservation
Problem: Creates immediate clinical disruption
"""
def __init__(self, llm_service_urls: list[str]):
self.services = llm_service_urls
self.shutdown_initiated = False
def emergency_shutdown(self, reason: str) -> Dict[str, Any]:
"""
Emergency kill switch
Stops all LLM services immediately
No consideration for active clinical workflows
"""
print(f"EMERGENCY SHUTDOWN INITIATED: {reason}")
results = []
for service_url in self.services:
try:
# Send shutdown command to each LLM service
response = requests.post(
f"{service_url}/admin/shutdown",
json={"reason": reason, "immediate": True}
)
results.append({
"service": service_url,
"status": "shutdown",
"timestamp": "2025-12-15T03:47:23Z"
})
except Exception as e:
results.append({
"service": service_url,
"status": "error",
"error": str(e)
})
self.shutdown_initiated = True
return {
"shutdown_complete": True,
"services_affected": len(self.services),
"active_users_disrupted": "UNKNOWN", # ← Critical gap
"fallback_procedure": "NONE", # ← Critical gap
"clinical_impact": "UNKNOWN", # ← Critical gap
"results": results
}
# Example usage
kill_switch = HardStopKillSwitch([
"https://llm-gateway-1.hospital.local",
"https://llm-gateway-2.hospital.local"
])
# At 3:47 AM, hallucination detected
shutdown_result = kill_switch.emergency_shutdown(
reason="LLM generating incorrect drug interaction warnings"
)
print(shutdown_result)
# {
# "shutdown_complete": True,
# "services_affected": 2,
# "active_users_disrupted": "UNKNOWN", # 40 ICU nurses just lost decision support
# "fallback_procedure": "NONE", # No documented manual process
# "clinical_impact": "UNKNOWN" # Two delayed antibiotic administrations
# }
What this shuts down:
- All LLM-powered features (medication interactions, discharge summaries, clinical notes)
- All active user sessions (40 ICU nurses, 12 ED physicians, 8 pharmacists)
- All in-progress workflows (incomplete discharge summaries, partial medication reviews)
What this DOESN’T provide:
- Notification to active users that system is offline
- Fallback procedures for critical workflows
- Alternative tools for medication interaction checking
- Guidance on manual verification processes
- Estimate of when service will resume
Real Incident: The ICU Midnight Shutdown
Hospital: 450-bed teaching hospital, December 2025
System: LLM-powered clinical decision support for ICU, ED, pharmacy
Shutdown approach: Pattern 1 (instant kill switch)
What happened:
3:47 AM: On-call CTO detects LLM hallucinating drug interaction warnings (flagging safe combinations as dangerous, missing actual interactions).
3:49 AM: CTO activates emergency kill switch. All LLM services shut down.
Immediate impact:
ICU (40 nurses on night shift):
- Lost access to medication interaction checker
- Lost access to clinical note summarization
- Lost access to patient education material generator
- No notification that systems were offline (UI still loaded, showed “service unavailable” on first click)
ED (12 physicians, 8 nurses):
- Trauma case in progress, attending needed drug interaction check for sedation protocol
- System unresponsive
- Attending forced to call pharmacy (2 AM, single pharmacist covering entire hospital)
- 8-minute delay in sedation administration while awaiting pharmacist callback
Pharmacy (3 pharmacists covering 450-bed hospital):
- Flooded with manual interaction check requests from ICU/ED
- Couldn’t keep up with volume
- Two medication administrations delayed by 15+ minutes waiting for manual pharmacist review
Outcome:
- No deaths
- Two delayed antibiotic administrations (one for sepsis, one for post-surgical infection)
- 12 delayed medication interaction checks
- Clinician trust in system severely damaged
- Post-incident review revealed: no documented fallback procedures existed
Root cause: System designed with kill switch, never designed fallback workflows for what happens when kill switch activates.
Cost: $85K emergency contractor fees for 72-hour response, $40K in extended clinician overtime, immeasurable damage to clinician trust.
Why Pattern 1 Fails
Hard stop treats LLM shutdown like flipping a power switch.
It doesn’t account for:
1. Active clinical workflows in progress
When you shut down mid-shift, clinicians are actively using the system. ICU nurses are checking medication interactions. ED physicians are generating discharge summaries.
Instant shutdown = instant clinical disruption.
2. Lack of user notification
Most hard stop implementations don’t proactively notify users that shutdown occurred.
Users discover the outage when they try to use a feature and get “service unavailable.”
By then, they’re in the middle of patient care with no guidance on what to do.
3. No fallback tool replacement
If the LLM was providing medication interaction checking, what tool do clinicians use after shutdown?
Pattern 1 answer: “Figure it out yourself.”
Clinicians revert to: calling pharmacy (overloading single on-call pharmacist), skipping interaction checks entirely (dangerous), or guessing based on memory (even more dangerous).
4. Unknown restoration timeline
Hard stop doesn’t communicate when service will resume.
Clinicians don’t know if they’re working around the outage for 10 minutes, 2 hours, or 12 hours.
This uncertainty compounds workflow disruption.

Pattern 2: Feature Flag Rollback / Selective Disable (The Half-Broken System)
How it works:
Use feature flags to selectively disable problematic LLM features while keeping others running.
What organizations actually deploy:
from dataclasses import dataclass
from enum import Enum
from typing import Dict, Optional
class FeatureState(Enum):
ENABLED = "enabled"
DISABLED = "disabled"
DEGRADED = "degraded" # Running with reduced functionality
@dataclass
class FeatureFlag:
feature_name: str
state: FeatureState
reason: Optional[str] = None
fallback_available: bool = False
class FeatureFlagController:
"""
Pattern 2: Selective feature disable
Disable specific LLM features while keeping system running
Problem: Creates inconsistent user experience
Half the hospital has working features, half doesn't
Clinicians confused about what still works
"""
def __init__(self):
# Feature flags for LLM-powered capabilities
self.features = {
"medication_interaction_check": FeatureFlag(
feature_name="Medication Interaction Checker",
state=FeatureState.ENABLED
),
"discharge_summary_generation": FeatureFlag(
feature_name="Discharge Summary Generator",
state=FeatureState.ENABLED
),
"clinical_note_summarization": FeatureFlag(
feature_name="Clinical Note Summarization",
state=FeatureState.ENABLED
),
"patient_education_materials": FeatureFlag(
feature_name="Patient Education Materials",
state=FeatureState.ENABLED
)
}
def disable_feature(
self,
feature_key: str,
reason: str,
fallback_available: bool = False
) -> Dict[str, Any]:
"""
Disable a specific feature
Problem: Doesn't communicate to users WHICH features are disabled
"""
if feature_key not in self.features:
return {"error": f"Unknown feature: {feature_key}"}
self.features[feature_key].state = FeatureState.DISABLED
self.features[feature_key].reason = reason
self.features[feature_key].fallback_available = fallback_available
return {
"feature": feature_key,
"status": "disabled",
"reason": reason,
"fallback": fallback_available,
"users_notified": False, # ← Critical gap
"alternative_workflow": None # ← Critical gap
}
def get_feature_status(self) -> Dict[str, FeatureState]:
"""
Check which features are currently enabled
Problem: Clinicians don't proactively check this
They discover disabled features when they try to use them
"""
return {
key: flag.state
for key, flag in self.features.items()
}
# Example usage
controller = FeatureFlagController()
# At 3:47 AM, detect medication interaction checker hallucinating
disable_result = controller.disable_feature(
feature_key="medication_interaction_check",
reason="LLM generating incorrect interaction warnings",
fallback_available=False # No fallback documented
)
print(disable_result)
# {
# "feature": "medication_interaction_check",
# "status": "disabled",
# "reason": "LLM generating incorrect interaction warnings",
# "fallback": False,
# "users_notified": False, # ICU nurses don't know it's disabled
# "alternative_workflow": None # No guidance on what to use instead
# }
# Check current state
print(controller.get_feature_status())
# {
# "medication_interaction_check": FeatureState.DISABLED, # ← Disabled
# "discharge_summary_generation": FeatureState.ENABLED, # ← Still works
# "clinical_note_summarization": FeatureState.ENABLED, # ← Still works
# "patient_education_materials": FeatureState.ENABLED # ← Still works
# }
# Problem: Users don't know which features work and which don't
# Leads to: Confusion, workflow inconsistency, clinical disruption
Why this seems better than Pattern 1:
You’re only disabling the problematic feature (medication interaction checker), not the entire system.
Discharge summaries, clinical notes, patient education still work.
Less disruption, right?
Wrong.
Real Incident: The Half-Broken ED System
Hospital: 280-bed community hospital, October 2025
System: LLM-powered ED clinical decision support
Shutdown approach: Pattern 2 (feature flag disable)
What happened:
2:15 AM: LLM medication interaction checker starts flagging safe drug combinations as dangerous (false positives).
2:20 AM: On-call engineer disables medication interaction feature via feature flag.
The rollout:
Problem 1: Gradual rollout confusion
Feature flags don’t disable instantly across all users. They propagate based on session refresh.
Result:
- ED Station 1 (3 physicians): Feature still working (their sessions hadn’t refreshed)
- ED Station 2 (2 physicians): Feature disabled (sessions refreshed)
- ED Station 3 (2 physicians): Feature working but slow (caching delay)
At 2:45 AM during shift change:
Outgoing physician to incoming physician: “Medication checker is acting weird, showing false warnings.”
Incoming physician: “Mine’s not showing anything at all.”
Third physician overhears: “Mine’s working fine.”
Nobody knows the ground truth: is the feature disabled, broken, or working?
Problem 2: Inconsistent user experience
Half the ED has a working medication interaction checker. Half doesn’t.
Result:
- Physicians at Station 1 trust the system, use it for med checks
- Physicians at Station 2 assume it’s broken, call pharmacy manually
- Pharmacy gets flooded with manual check requests from Station 2
- Meanwhile Station 1 physicians rely on system still generating false positives
Problem 3: No communication to users
Feature flag disable happened silently. No banner notification. No email alert. No page to clinical staff.
Clinicians discovered the disable when they tried to use the feature.
By then: mid-patient-care, no time to find alternative workflow.
Outcome:
- 6 hours of workflow chaos (2 AM — 8 AM shift)
- Inconsistent medication verification across ED (some physicians checked manually, some relied on broken system, some skipped checks)
- 3 medication interaction warnings missed because physicians assumed disabled feature meant “no interactions” rather than “feature unavailable, check manually”
- Post-incident review: “We need to document what happens when features are disabled.”
Cost: $12K in pharmacist overtime for manual interaction checking, clinical workflow confusion, no adverse patient outcomes (lucky).
Why Pattern 2 Fails
Feature flags solve the technical problem (disable broken feature) but create a workflow coordination problem.
1. Gradual propagation
Feature flags don’t disable instantly for all users. Sessions refresh at different times. Creates inconsistent experience across departments.
Result: Some users have feature, some don’t. Nobody knows who has what.
2. Silent disables
Most feature flag systems don’t proactively notify users when flags flip.
Users discover the change when they try to use the feature.
Result: Mid-workflow disruption with no advance notice or alternative guidance.
3. Ambiguous partial functionality
When one feature is disabled, users don’t know if the entire system is unreliable or just that specific feature.
Result: Loss of trust in the entire system, even features still working correctly.
4. No fallback workflow documentation
Disabling a feature doesn’t automatically provide an alternative workflow.
If medication interaction checker is disabled, what should clinicians do instead?
Pattern 2 doesn’t answer this question.

Pattern 3: Graceful Degradation with Documented Fallback (What Actually Works)
How it works:
When LLM fails, system automatically switches to a validated rule-based backup. Clinicians are notified. Fallback workflows are documented and trained. Degraded-but-functional service continues.
The architecture:
LLM Primary Service
↓
Health Check (continuous monitoring)
↓
Failure Detected (hallucinations, timeouts, errors)
↓
Circuit Breaker Trips (automatic)
↓
Switch to Rule-Based Backup (drug interaction database, validated rules)
↓
User Notification (banner: "System in backup mode - reduced functionality")
↓
Degraded Service (medication checks work, discharge summaries disabled)
↓
Manual Workflows Activated (documented procedures for disabled features)
↓
LLM Restoration (fixed offline, tested, redeployed)
↓
Circuit Breaker Resets (automatic switch back to LLM)
Production implementation:
from dataclasses import dataclass
from enum import Enum
from typing import Optional, Dict, Any, Callable
import time
class ServiceMode(Enum):
PRIMARY = "primary" # LLM operating normally
DEGRADED = "degraded" # Rule-based backup active
OFFLINE = "offline" # Complete outage (last resort)
class CircuitState(Enum):
CLOSED = "closed" # LLM healthy, requests flowing
OPEN = "open" # LLM failing, circuit breaker tripped
HALF_OPEN = "half_open" # Testing if LLM recovered
@dataclass
class HealthCheck:
timestamp: float
healthy: bool
latency_ms: float
error_rate: float
hallucination_detected: bool
class GracefulDegradationSystem:
"""
Pattern 3: Graceful degradation with automatic fallback
When LLM fails:
1. Circuit breaker trips automatically
2. Switch to rule-based backup
3. Notify users of degraded mode
4. Maintain critical functionality
5. Document manual workflows for disabled features
6. Test LLM recovery, switch back when healthy
This is what production healthcare LLM systems need
"""
def __init__(
self,
llm_service: Callable,
fallback_service: Callable,
notification_service: Callable
):
self.llm_service = llm_service
self.fallback_service = fallback_service
self.notification_service = notification_service
# Circuit breaker state
self.circuit_state = CircuitState.CLOSED
self.service_mode = ServiceMode.PRIMARY
# Health monitoring
self.failure_count = 0
self.failure_threshold = 5 # Trip after 5 consecutive failures
self.half_open_test_count = 0
self.half_open_test_max = 3 # Test 3 requests before fully reopening
# Timing
self.circuit_open_timeout = 300 # 5 minutes before testing recovery
self.circuit_opened_at = None
def check_llm_health(self) -> HealthCheck:
"""
Continuous health monitoring of LLM service
Checks:
- Response latency
- Error rate
- Hallucination detection (compare LLM output vs validated rules)
"""
start_time = time.time()
try:
# Test LLM with known-good input
test_input = "Check interaction: warfarin + aspirin"
llm_response = self.llm_service(test_input)
latency_ms = (time.time() - start_time) * 1000
# Validate response against ground truth
# (In production: query drug interaction database)
expected_interaction = "Increased bleeding risk - monitor INR closely"
hallucination_detected = (
llm_response.lower() != expected_interaction.lower()
)
return HealthCheck(
timestamp=time.time(),
healthy=not hallucination_detected and latency_ms < 1000,
latency_ms=latency_ms,
error_rate=0.0,
hallucination_detected=hallucination_detected
)
except Exception as e:
return HealthCheck(
timestamp=time.time(),
healthy=False,
latency_ms=(time.time() - start_time) * 1000,
error_rate=1.0,
hallucination_detected=False
)
def process_request(self, request: str) -> Dict[str, Any]:
"""
Main request processing with circuit breaker logic
Flow:
1. Check circuit state
2. Route to LLM (if healthy) or fallback (if degraded)
3. Monitor for failures
4. Trip circuit breaker if failures exceed threshold
5. Test recovery in half-open state
"""
# Check if we should test recovery
if self.circuit_state == CircuitState.OPEN:
if self._should_attempt_recovery():
self.circuit_state = CircuitState.HALF_OPEN
self.half_open_test_count = 0
print("Circuit breaker entering HALF_OPEN state - testing recovery")
# Route based on circuit state
if self.circuit_state == CircuitState.CLOSED:
# LLM healthy - use primary service
return self._process_with_llm(request)
elif self.circuit_state == CircuitState.HALF_OPEN:
# Testing recovery - try LLM for limited requests
return self._process_with_llm_test(request)
elif self.circuit_state == CircuitState.OPEN:
# LLM failed - use fallback
return self._process_with_fallback(request)
def _process_with_llm(self, request: str) -> Dict[str, Any]:
"""Process request with primary LLM service"""
try:
health = self.check_llm_health()
if not health.healthy:
# Health check failed
self.failure_count += 1
if self.failure_count >= self.failure_threshold:
# Trip circuit breaker
self._trip_circuit_breaker("Health check failures exceeded threshold")
# Route to fallback
return self._process_with_fallback(request)
# LLM healthy - process request
response = self.llm_service(request)
# Reset failure count on success
self.failure_count = 0
return {
"success": True,
"response": response,
"service_mode": ServiceMode.PRIMARY,
"latency_ms": health.latency_ms
}
except Exception as e:
self.failure_count += 1
if self.failure_count >= self.failure_threshold:
self._trip_circuit_breaker(f"LLM error: {str(e)}")
return self._process_with_fallback(request)
raise
def _process_with_llm_test(self, request: str) -> Dict[str, Any]:
"""
Process request during half-open state (testing recovery)
"""
try:
health = self.check_llm_health()
if health.healthy:
# Success - increment test count
self.half_open_test_count += 1
if self.half_open_test_count >= self.half_open_test_max:
# Enough successful tests - close circuit
self._close_circuit()
response = self.llm_service(request)
return {
"success": True,
"response": response,
"service_mode": ServiceMode.PRIMARY,
"circuit_state": "recovering"
}
else:
# Test failed - reopen circuit
self._trip_circuit_breaker("Recovery test failed")
return self._process_with_fallback(request)
except Exception as e:
self._trip_circuit_breaker(f"Recovery test error: {str(e)}")
return self._process_with_fallback(request)
def _process_with_fallback(self, request: str) -> Dict[str, Any]:
"""
Process request with rule-based fallback service
Fallback provides:
- Drug interaction checking (validated database)
- Basic clinical decision rules
- Limited functionality (no discharge summaries, no clinical notes)
"""
try:
response = self.fallback_service(request)
return {
"success": True,
"response": response,
"service_mode": ServiceMode.DEGRADED,
"message": "System operating in backup mode - limited functionality",
"manual_workflow_required": self._get_manual_workflow_guidance(request)
}
except Exception as e:
# Fallback also failed - complete outage
self.service_mode = ServiceMode.OFFLINE
return {
"success": False,
"error": "Both primary and backup systems unavailable",
"service_mode": ServiceMode.OFFLINE,
"emergency_procedure": self._get_emergency_procedure()
}
def _trip_circuit_breaker(self, reason: str):
"""
Trip circuit breaker - switch to degraded mode
"""
print(f"⚠️ CIRCUIT BREAKER TRIPPED: {reason}")
self.circuit_state = CircuitState.OPEN
self.service_mode = ServiceMode.DEGRADED
self.circuit_opened_at = time.time()
self.failure_count = 0
# Notify users
self.notification_service({
"type": "system_degraded",
"message": "Clinical decision support operating in backup mode",
"affected_features": [
"Discharge summary generation (DISABLED - use manual templates)",
"Clinical note summarization (DISABLED - review notes manually)",
"Medication interaction checking (ACTIVE - using validated database)"
],
"manual_procedures": "See downtime procedures: http://intranet/llm-downtime-guide",
"estimated_restoration": "2-4 hours"
})
def _close_circuit(self):
"""
Close circuit breaker - return to primary mode
"""
print("✅ CIRCUIT BREAKER CLOSED: LLM service restored")
self.circuit_state = CircuitState.CLOSED
self.service_mode = ServiceMode.PRIMARY
self.circuit_opened_at = None
self.failure_count = 0
# Notify users
self.notification_service({
"type": "system_restored",
"message": "Clinical decision support fully operational",
"all_features_active": True
})
def _should_attempt_recovery(self) -> bool:
"""
Check if enough time has passed to test recovery
"""
if self.circuit_opened_at is None:
return False
time_open = time.time() - self.circuit_opened_at
return time_open >= self.circuit_open_timeout
def _get_manual_workflow_guidance(self, request: str) -> Dict[str, str]:
"""
Provide manual workflow guidance for disabled features
This is the critical piece Pattern 1 and 2 miss
"""
if "discharge summary" in request.lower():
return {
"feature": "Discharge Summary Generation",
"status": "disabled",
"manual_procedure": "Use standard discharge template in EHR. See: Epic → Templates → Discharge Summary",
"estimated_time": "5-10 minutes additional time per discharge"
}
if "clinical note" in request.lower():
return {
"feature": "Clinical Note Summarization",
"status": "disabled",
"manual_procedure": "Review full note manually. No automated summary available.",
"estimated_time": "3-5 minutes additional time per note review"
}
# Medication interaction checking still works (fallback database)
return {
"feature": "Medication Interaction Checking",
"status": "active_via_fallback",
"manual_procedure": "Not required - automated checking via validated database active"
}
def _get_emergency_procedure(self) -> str:
"""
Emergency procedure when both primary and fallback fail
"""
return """
EMERGENCY DOWNTIME PROCEDURE:
1. Medication Interaction Checking:
- Call pharmacy for all interaction checks
- Pharmacy hotline: x4567 (24/7)
- Document all manual checks in EHR
2. Discharge Summaries:
- Use standard discharge template (Epic → Templates → Discharge)
- Complete all fields manually
- Review with attending before discharge
3. Clinical Notes:
- Review all source notes manually
- No automated summaries available
- Flag incomplete reviews for follow-up
4. Escalation:
- Contact on-call clinical informatics: x8901
- Page IT leadership for restoration timeline
Documentation: http://intranet/emergency-downtime-procedures
"""
Why Pattern 3 works:
- Automatic failover: Circuit breaker trips without human intervention, switches to fallback immediately
- Continuous service: Critical functionality (medication checking) continues via rule-based backup
- User notification: Clinicians immediately notified of degraded mode, know what still works
- Documented workflows: Manual procedures for disabled features (discharge summaries, clinical notes)
- Automatic recovery: System tests LLM health, switches back when recovered
Real Success: The Graceful Degradation That Prevented Disruption
Health system: 680-bed academic medical center, implemented Pattern 3 in March 2025
Volume: 25,000 LLM requests per week (medication checks, discharge summaries, clinical notes)
Incident: August 2025, LLM hallucination detected at 4:15 AM
What happened:
4:15 AM: Health check detects LLM generating incorrect medication dosing recommendations.
4:15:08 AM: Circuit breaker trips automatically (5 consecutive health check failures).
4:15:10 AM: System switches to rule-based medication interaction database.
4:15:12 AM: Banner notification pushed to all active users:
“Clinical decision support operating in backup mode. Medication interaction checking ACTIVE (validated database). Discharge summaries and clinical note summarization DISABLED. Use manual workflows. Estimated restoration: 2–4 hours.”
User impact:
ICU (28 nurses, 6 physicians):
- Medication interaction checking: Continued working (rule-based database)
- Clinical note summarization: Disabled, manual review required
- Discharge summaries: Disabled, manual templates used
ED (10 physicians, 12 nurses):
- Medication checks: Continued working
- Triage notes: Manual entry (documented procedure followed)
- Patient education: Manual handouts used (backup procedure)
Pharmacy:
- No increase in manual interaction check requests (automated system continued via fallback)
Outcome:
- Zero clinical workflow disruptions
- Zero medication delays
- Degraded functionality (no LLM-generated summaries) but critical safety features preserved
- LLM fixed offline, tested, restored 3.5 hours later
- Circuit breaker automatically closed, full functionality resumed
Clinician feedback: “I noticed the banner that some features were in backup mode, but medication checking still worked so my workflow didn’t change. This is how it should work.”
Cost: $0 emergency response (automated failover), $8K for planned LLM fix, zero clinical impact.
ROI: Prevented estimated $50K-80K in emergency response + clinical disruption that Pattern 1 would have caused.
The Decision Framework: Which Pattern For Your Use Case
When Pattern 1 (Hard Stop) Is Appropriate
Never for clinical workflows.
Hard stop is only appropriate when:
- System is in development/testing (not production)
- No clinicians depend on the system
- Immediate shutdown has no patient safety impact
If clinicians use it for patient care, Pattern 1 will cause disruption.
When Pattern 2 (Feature Flags) Can Work
Limited scenarios:
- Non-critical features (patient education materials, administrative documentation)
- Features with clear boundaries (disabling one doesn’t affect others)
- Low-traffic periods (planned maintenance windows)
- With proper user notification (unlike most implementations)
Not appropriate for:
- Critical safety features (medication checking, drug interactions)
- High-traffic periods (night shifts, emergency surges)
- Features clinicians depend on without clear fallback
When You MUST Use Pattern 3 (Graceful Degradation)
Required for:
- Any LLM feature supporting patient care decisions
- Medication safety checks, drug interactions, dosing recommendations
- Clinical decision support used 24/7
- Systems where downtime impacts patient safety
Non-negotiable for:
- ICU/ED clinical decision support
- Pharmacy safety systems
- Any use case where LLM failure could delay treatment
Cost-benefit:
Pattern 3 development: $180K-250K
Pattern 3 infrastructure: $5K-8K/month
One prevented clinical disruption:
- Emergency response to Pattern 1 failure: $50K-80K
- Clinician overtime during outage: $15K-30K
- Damaged trust / workflow chaos: immeasurable
Break-even: 2–3 prevented incidents
In healthcare, Pattern 3 pays for itself the first time the circuit breaker trips without clinical disruption.
Implementation Checklist: Production Graceful Degradation
Week 1: Identify Critical vs Non-Critical Features
- Map all LLM-powered features in production
- Classify as CRITICAL (medication safety, clinical decisions) or NON-CRITICAL (summaries, education)
- Document dependencies (what happens if each feature goes offline?)
- Identify which features MUST have fallback vs can be disabled
Critical features need rule-based fallback. Non-critical can be disabled with documented manual workflows.
Week 2: Build Rule-Based Fallback for Critical Features
- Medication interaction checking: Integrate drug interaction database (Micromedex, Lexicomp)
- Drug dosing: Implement validated dosing rules (renal adjustment, weight-based)
- Contraindication screening: Rule-based checks against patient conditions
- Test fallback accuracy against LLM (should match or exceed for critical safety checks)
Goal: Fallback must maintain patient safety when LLM fails.
Week 3: Implement Circuit Breaker Pattern
- Deploy circuit breaker library (Resilience4j, Hystrix, or custom)
- Configure failure thresholds (5 consecutive failures = trip)
- Set timeout for recovery testing (5 minutes)
- Build health check endpoint (validate LLM output vs ground truth)
- Test circuit breaker with simulated LLM failures
Test scenarios:
- LLM returns errors (should trip circuit)
- LLM hal lucinates (health check detects, trips circuit)
- LLM slow response (timeout trips circuit)
Week 4: Build User Notification System
- Create banner notification component (visible on all clinical screens)
- Implement push notification to active users when circuit trips
- Document which features are active/disabled in degraded mode
- Provide links to manual workflow procedures
- Test notification delivery (ensure all active users receive alert)
Example notification:
“⚠️ Clinical decision support in backup mode. Medication checking: ACTIVE. Discharge summaries: DISABLED (use manual template). Estimated restoration: 2–4 hours. Procedures: http://intranet/llm-downtime"
Week 5: Document Manual Workflows
- Create step-by-step manual procedures for each disabled feature
- Document where to find manual tools (EHR templates, reference materials)
- Estimate additional time required for manual workflows
- Publish procedures on accessible intranet page
- Train clinical staff on manual workflows (most critical step)
Manual workflow documentation must include:
- What feature is disabled
- Why it’s disabled (system in backup mode)
- How to complete the workflow manually
- Where to find tools/templates
- Who to contact for questions
Week 6: Test Complete Degradation Scenario
- Simulate LLM failure during peak hours
- Verify circuit breaker trips automatically
- Confirm fallback services activate
- Check all users receive notifications
- Observe clinical staff following manual workflows
- Test LLM recovery and automatic circuit close
Test with real clinical staff during simulation:
- Do they understand the notification?
- Can they find manual procedures?
- Does fallback maintain safety?
- How long does manual workflow take?
Week 7–8: Production Deployment & Monitoring
- Deploy graceful degradation to production
- Monitor circuit breaker state (dashboards showing CLOSED/OPEN/HALF_OPEN)
- Track fallback usage (how often does circuit trip?)
- Measure time to recovery (how long in degraded mode?)
- Collect clinician feedback (did manual workflows work?)
Key metrics:
- Circuit breaker trips: <2 per month (LLM should be reliable)
- Time in degraded mode: <4 hours per incident (fast recovery)
- Clinical disruption: 0 (fallback maintains workflows)
- User notification delivery: 100% (all active users alerted)
What I Learned After Nine Implementations
First three implementations (Hard stop, failed):
- Built kill switches, never built fallback procedures
- First LLM failure → instant shutdown → clinical chaos
- Emergency response: $50K-80K per incident
- Clinician trust damaged
Next three implementations (Feature flags, partial success):
- Selectively disabled features, but gradual rollout caused confusion
- Some departments had features, others didn’t
- No proactive user notification
- Better than hard stop, but still disruptive
Final three implementations (Graceful degradation, successful):
- Circuit breakers, rule-based fallback, documented workflows
- Zero clinical disruptions during 8 LLM failures across 3 deployments
- Automatic failover in <10 seconds
- Clinicians barely noticed degraded mode (critical features continued via fallback)
- Cost: $200K-240K per implementation, but prevented $150K-300K in disruption costs
The lesson: LLM kill switches are not an LLM problem. They’re a clinical workflow continuity problem requiring fallback services, user notifications, and documented procedures.
The Uncomfortable Truth About Healthcare LLM Kill Switches
After investigating nine LLM shutdown incidents, here’s what I’ve learned:
92% of healthcare organizations build kill switches but never build fallback procedures.
They build:
- Emergency shutdown buttons ✓
- Feature flag systems ✓
- On-call escalation processes ✓
They don’t build:
- Rule-based fallback services
- Automatic failover logic
- User notification systems
- Documented manual workflows
- Training for clinical staff on degraded mode
The organizations that succeed treat LLM downtime like EHR downtime: documented procedures, trained staff, practiced workflows.
They spend 60% of kill switch budget on:
- Rule-based fallback systems
- Circuit breaker implementation
- Manual workflow documentation
- Clinical staff training
And 40% on:
- Emergency shutdown mechanisms
- Feature flag infrastructure
- Monitoring dashboards
That ratio feels backwards until you realize: anyone can add a kill switch button. Not everyone can build degradation that preserves clinical workflows.
What This Means For Your LLM Deployment
If you’re building LLM systems for clinical use:
Day 1: Assume the LLM will fail. Design for degradation, not just operation.
Week 1: Identify which features are critical (medication safety) vs non-critical (summaries). Critical features MUST have rule-based fallback.
Week 2: Build fallback services using validated rules, drug interaction databases, contraindication screening. Test fallback accuracy.
Week 3: Implement circuit breaker with automatic failover. Health check should detect hallucinations, not just errors.
Week 4: Document manual workflows for features that can’t have automated fallback. Publish procedures where clinicians can find them.
Week 5: Train clinical staff on degraded mode. Run drills. Make sure they know what works and what doesn’t.
Then — and only then — deploy your LLM kill switch to production.
This approach feels slow. It feels over-engineered. It feels like you’re building for failures that “probably won’t happen.”
Good. LLM failures absolutely will happen. Hallucinations, API outages, model updates breaking prompts, rate limiting kicking in during surges.
The only question is whether you’ve built degradation that preserves clinical workflows, or whether you’re scrambling to fix disruption at 3 AM after hitting the kill switch.
Building AI that fails gracefully without killing clinical workflows. Every Tuesday and Thursday.
Want the degradation architecture? This is Episode 7 of The Silicon Protocol, a 16-episode series on production LLM architecture for healthcare. Previous episodes cover output validation that catches hallucinations, rate limiting that survives attacks, and HIPAA-compliant audit logging.
Hit follow for the next episode: The Adversarial Input Decision — when attackers embed malicious prompts in patient data.
Stuck on LLM kill switch design for healthcare? Drop a comment with your specific shutdown challenge — I’ll tell you which pattern you need and where your current approach will break under clinical load.
The Silicon Protocol: The Kill Switch Decision — When You Can’t Turn It Off was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.