Multimodal Graph Network Modeling for Human-Object Interaction Detection with PDE Graph Diffusion
arXiv:2509.12554v3 Announce Type: replace
Abstract: Existing GNN-based Human-Object Interaction (HOI) detection methods rely on simple MLPs to fuse instance features and propagate information. However, this mechanism is largely empirical and lack of targeted information propagation process. To address this problem, we propose Multimodal Graph Network Modeling (MGNM) for HOI detection with Partial Differential Equation (PDE) graph diffusion. Specifically, we first design a multimodal graph network framework that explicitly models the HOI detection task within a four-stage graph structure. Next, we propose a novel PDE diffusion mechanism to facilitate information propagation within this graph. This mechanism leverages multimodal features to propaganda information via a white-box PDE diffusion equation. Furthermore, we design a variational information squeezing (VIS) mechanism to further refine the multimodal features extracted from CLIP, thereby mitigating the impact of noise inherent in pretrained Vision-Language Models. Extensive experiments demonstrate that our MGNM achieves state-of-the-art performance on two widely used benchmarks: HICO-DET and V-COCO. Moreover, when integrated with a more advanced object detector, our method yields significant performance gains while maintaining an effective balance between rare and non-rare categories.