Andre Williams -

Cross-modal place recognition, specifically Image-to-PointCloud (I2P) localization, is fundamental for robust self-localization and navigation in various autonomous systems. However, it faces significant challenges including the inherent semantic gap between modalities, severe environmental variations, viewpoint differences, and stringent real-time computational demands. This paper introduces CrossLoc, a novel attention-enhanced framework meticulously designed for efficient and robust I2P place recognition. Our method initiates with comprehensive data preprocessing, including FoV alignment and the generation of high-quality dense depth maps from sparse LiDAR point clouds. A dual-stream feature encoder, leveraging lightweight, partially weight-shared EfficientNet B0 variants, extracts local features from both RGB images and dense depth maps. A core contribution is our Transformer-based Cross-Modal Attention Fusion Module, which dynamically learns to integrate visual and geometric information by enabling RGB features to query geometric context, thereby generating highly discriminative fused representations. These fused features are then aggregated into compact global descriptors using an Adaptive Generalized Mean (GeM) Pooling layer. Trained end-to-end using a Triplet Loss on the KITTI dataset and validated on the diverse HAOMO dataset, CrossLoc achieves leading performance and remarkable runtime efficiency, significantly outperforming prior art. Ablation studies confirm the critical contributions of our attention fusion and adaptive pooling mechanisms, while detailed analyses highlight superior feature discriminability and robustness to challenging environmental conditions. CrossLoc's blend of high accuracy, robustness, and real-time capability positions it as a practical and impactful solution for real-world autonomous applications.