ViFiCon: Vision and Wireless Association Via Self-Supervised Contrastive Learning

arXiv:2210.05513v2 Announce Type: replace Abstract: We introduce ViFiCon, a self-supervised contrastive scheme which learns a cross-modal association between vision and wireless modalities. Specifically, the system uses pedestrian data collected from RGB-D camera footage and WiFi Fine Time Measurements (FTM) from a user's smartphone device. Depth data from RGB-D (vision domain) is inherently linked with an observable pedestrian, but FTM data (wireless domain) is associated only to a smartphone on the network. We represent temporal sequences from both vision and wireless domains by stacking multi-person depth data sequences within an image representation. This simplicity allows both scene-wide processing and fewer vision and wireless features, alleviating privacy and energy associated with transmitting IMU data. To facilitate self-supervised learning, we design a scene-wide synchronization pretext task for our network and then employ the learned representation for the downstream multimodal association task. We show that compared to fully supervised state-of-the-art models, ViFiCon achieves high performance vision-to-wireless association of 92.63% in 25 frames sliding window fashion (2.5s), finding which bounding box corresponds to which smartphone device, without hand-labeled association examples for training data. Extensive experimental results demonstrate ViFiCon applicability in real-world systems when wireless data annotations are scarce.

Leave a Comment