MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference
arXiv:2604.21026v2 Announce Type: replace
Abstract: Deploying large language models to heterogeneous hardware is often constrained by memory, not compute. We introduce MCAP (Monte Carlo Activation Profiling), a load-time per-layer importance estimator…