In post-disaster damage assessment, Visual Question Answering (VQA) systems are essential in identifying the severity and scope of damage. However, counting-related tasks, such as determining the number of vehicles and flooded buildings, remain a significant challenge for current deep learning models. To address this issue, we propose DeVANet (DeBERTa Vision Attention Network), a novel architecture aimed at enhancing counting accuracy in VQA for post-disaster scenarios. We leverage DeBERTa for language modeling and introduce an innovative image embedding module, where local-global attention guides Vision Mamba features to achieve precise extraction of both small and large object features. Our fusion mechanism employs self-attention for both text and image features, followed by bi-directional cross-attention and co-attention to enhance multimodal integration. We tackle VQA as both a classification and regression problem by employing separate MLPs for each task: one handling discrete class predictions and the other generating continuous values for counting tasks. A joint loss function, combining weighted cross-entropy and negative binomial loss, ensures optimized performance across both tasks. Extensive experiments on the FloodNet and RescueNet datasets demonstrate that DeVANet achieves significant improvements in counting accuracy and overall VQA performance compared to state-of-the-art works, supported by detailed ablation studies that validate the effectiveness of each component in the architecture.
In the field of remote sensing, semantic segmentation of Unmanned Aerial Vehicle (UAV) imagery is crucial for tasks such as land resource management, urban planning, precision agriculture, and economic assessment. Traditional methods use Convolutional Neural Networks (CNNs) for hierarchical feature extraction but are limited by their local receptive fields, restricting comprehensive contextual understanding. To overcome these limitations, we propose a combination of transformer and attention mechanisms to improve object classification, leveraging their superior information modeling capabilities to enhance scene understanding. In this paper, we present SwinFAN (Swin-based Focal Axial attention Network), a U-Net framework featuring a Swin transformer as encoder, equipped with a novel decoder that introduces two new components for enhanced semantic segmentation of urban remote sensing images. The first proposed component is a Guided Focal-Axial (GFA) attention module which combines local and global contextual information, enhancing the model's ability to discern intricate details and complex structures. The second component is an innovative Attentionbased Feature Refinement Head (AFRH) designed to improve the precision and clarity of segmentation outputs through selfattention and convolutional techniques. Comprehensive experiments demonstrate that the accuracy of our proposed architecture significantly outperforms state-of-the-art models. More specifically, our method achieves mIoU (mean Intersection over Union) improvements of 1.9% on UAVid, 3.6% on Potsdam, 1.9% on Vaihingen, and 0.8% on LoveDA.