cs.PF - Provide.ai

Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers

J\k{e}drzej Maczan / April 6, 2026

arXiv:2604.02344v1 Announce Type: new
Abstract: WebGPU’s security-focused design imposes per-operation validation that compounds across the many small dispatches in neural network inference, yet the true cost of this overhead is poorly characterized. …

cs.AR, cs.LG, cs.PF

Fast NF4 Dequantization Kernels for Large Language Model Inference

Xiangbo Qi, Chaoyi Jiang, Murali Annavaram / April 6, 2026

arXiv:2604.02556v1 Announce Type: new
Abstract: Large language models (LLMs) have grown beyond the memory capacity of single GPU devices, necessitating quantization techniques for practical deployment. While NF4 (4-bit NormalFloat) quantization enable…

cs.AI, cs.DC, cs.LG, cs.PF

Democratizing AI: A Comparative Study in Deep Learning Efficiency and Future Trends in Computational Processing

Lisan Al Amin, Md Ismail Hossain, Rupak Kumar Das, Mahbubul Islam, Abdulaziz Tabbakh / April 3, 2026

arXiv:2603.20920v2 Announce Type: replace-cross
Abstract: The exponential growth in data has intensified the demand for computational power to train large-scale deep learning models. However, the rapid growth in model size and complexity raises concer…

cs.AI, cs.DC, cs.LG, cs.PF, cs.SE

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

Tara Saba, Anne Ouyang, Xujie Si, Fan Long / April 3, 2026

arXiv:2604.01489v1 Announce Type: new
Abstract: High-performance GPU kernels are critical to modern machine learning systems, yet developing efficient implementations remains a challenging, expert-driven process due to the tight coupling between algor…

cs.AI, cs.DC, cs.LG, cs.PF

Intelligent Cloud Orchestration: A Hybrid Predictive and Heuristic Framework for Cost Optimization

Heet Nagoriya, Komal Rohit / April 3, 2026

arXiv:2604.02131v1 Announce Type: cross
Abstract: Cloud computing allows scalable resource provisioning, but dynamic workload changes often lead to higher costs due to over-provisioning. Machine learning (ML) approaches, such as Long Short-Term Memory…

cs.DC, cs.LG, cs.PF

A Practical Two-Stage Framework for GPU Resource and Power Prediction in Heterogeneous HPC Systems

Beste Oztop, Dhruva Kulkarni, Zhengji Zhao, Ayse Kivilcim Coskun, Kadidia Konate / April 3, 2026

arXiv:2604.02158v1 Announce Type: cross
Abstract: Efficient utilization of GPU resources and power has become critical with the growing demand for GPUs in high-performance computing (HPC). In this paper, we analyze GPU utilization and GPU memory utili…

cs.LG, cs.PF, cs.SE

Risk-Aware Batch Testing for Performance Regression Detection

Ali Sayedsalehi, Peter C. Rigby, Gregory Mierzwinski / April 2, 2026

arXiv:2604.00222v1 Announce Type: cross
Abstract: Performance regression testing is essential in large-scale continuous-integration (CI) systems, yet executing full performance suites for every commit is prohibitively expensive. Prior work on performa…

cs.AI, cs.DC, cs.PF

When AI Bends Metal: AI-Assisted Optimization of Design Parameters in Sheet Metal Forming

/ April 1, 2026

arXiv:2511.22302v2 Announce Type: replace
Abstract: Numerical simulations have revolutionized the industrial design process by reducing prototyping costs, design iterations, and enabling product engineers to explore the design space more efficiently. …

cs.AI, cs.PF

Time is Not Compute: Scaling Laws for Wall-Clock Constrained Training on Consumer GPUs

Yi Liu / April 1, 2026

arXiv:2603.28823v1 Announce Type: cross
Abstract: Scaling laws relate model quality to compute budget (FLOPs), but practitioners face wall-clock time constraints, not compute budgets. We study optimal model sizing under fixed time budgets from 5 minut…

cs.AI, cs.LG, cs.PF

Throughput Optimization as a Strategic Lever in Large-Scale AI Systems: Evidence from Dataloader and Memory Profiling Innovations

Mayank Jha / March 31, 2026

arXiv:2603.26823v1 Announce Type: new
Abstract: The development of large-scale foundation models, particularly Large Language Models (LLMs), is constrained by significant computational and memory bottlenecks. These challenges elevate throughput optimi…