- Provide.ai - Page 63

FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control

/ May 5, 2026

arXiv:2604.19021v2 Announce Type: replace
Abstract: Linear attention mechanisms have emerged as promising alternatives to softmax attention, offering linear-time complexity during inference. Recent advances such as Gated DeltaNet (GDN) and Kimi Delta …

cs.CV

VISTA: Video Interaction Spatio-Temporal Analysis Benchmark

/ May 5, 2026

arXiv:2605.01391v1 Announce Type: new
Abstract: Existing benchmarks for Vision-Language Models (VLMs) primarily evaluate spatio-temporal understanding on simple single-action videos, closed attribute sets and restricted entity types, failing to captur…

cs.CV

Act in Collusion: Distributed Multi-Target Backdoor Attacks in Federated Learning

/ May 5, 2026

arXiv:2411.03926v3 Announce Type: replace
Abstract: Federated learning (FL) is widely used in Internet-of-Things (IoT) systems, but its distributed training process also exposes it to backdoor attacks. Existing studies mainly consider single-target or…

cs.CV

Registration-Free Learnable Multi-View Capture of Faces in Dense Semantic Correspondence

/ May 5, 2026

arXiv:2605.01450v1 Announce Type: new
Abstract: Recent frameworks like ToFu and TEMPEH provide an automated alternative to classical registration pipelines by predicting 3D meshes in dense semantic correspondence directly from calibrated multi-view im…

cs.AI, cs.RO

VILAS: A VLA-Integrated Low-cost Architecture with Soft Grasping for Robotic Manipulation

/ May 5, 2026

arXiv:2605.02037v1 Announce Type: cross
Abstract: We present VILAS, a fully low-cost, modular robotic manipulation platform designed to support end-to-end vision-language-action (VLA) policy learning and deployment on accessible hardware. The system i…

cs.CV, cs.RO

VoxAfford: Multi-Scale Voxel-Token Fusion for Open-Vocabulary 3D Affordance Detection

/ May 5, 2026

arXiv:2605.01365v1 Announce Type: cross
Abstract: Open-vocabulary 3D affordance detection requires localizing interaction regions on point clouds given novel affordance descriptions. Recent methods extend multimodal large language models (MLLMs) with …

cs.AI, cs.CV

SRGAN-CKAN: Expressive Super-Resolution with Nonlinear Functional Operators under Minimal Resources

/ May 5, 2026

arXiv:2605.01459v1 Announce Type: cross
Abstract: Single-Image Super-Resolution (SISR) aims to reconstruct a High-Resolution (HR) image from a Low-Resolution (LR) observation, a fundamentally ill-posed problem where high-frequency details are severely…

cs.AI, cs.CL, cs.CV

Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL

/ May 5, 2026

arXiv:2604.28123v2 Announce Type: replace-cross
Abstract: The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (R…

cs.CV, cs.HC, cs.LG

Multimodal Ambivalence/Hesitancy Recognition in Videos for Personalized Digital Health Interventions

/ May 5, 2026

arXiv:2604.11730v3 Announce Type: replace-cross
Abstract: Using behavioural science, health interventions focus on behaviour change by providing a framework to help patients acquire and maintain healthy habits that improve medical outcomes. In-person …

cs.AI, cs.CV, cs.LG

AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models

/ May 5, 2026

arXiv:2506.09082v5 Announce Type: replace-cross
Abstract: The rise of vision foundation models (VFMs) calls for systematic evaluation. A common approach pairs VFMs with large language models (LLMs) as general-purpose heads, followed by evaluation on b…