GPU-Accelerated Optimization of Transformer-Based Neural Networks for Real-Time Inference
arXiv:2603.28708v1 Announce Type: new
Abstract: This paper presents the design and evaluation of a GPU-accelerated inference pipeline for transformer models using NVIDIA TensorRT with mixed-precision optimization. We evaluate BERT-base (110M parameter…