Anurita Das - Provide.ai

MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference

Anurita Das / April 27, 2026

arXiv:2604.21026v2 Announce Type: replace
Abstract: Deploying large language models to heterogeneous hardware is often constrained by memory, not compute. We introduce MCAP (Monte Carlo Activation Profiling), a load-time per-layer importance estimator…

Author name: Anurita Das

MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference