AUTHOREA
Log in
Sign Up
Browse Preprints
LOG IN
SIGN UP
Yakoub Bazi
Professor of Computer Engineering
Saudi Arabia
Public Documents
2
RAGCap: Retrieval-Augmented Generation for Style-Aware Remote Sensing Image Captionin...
Yakoub Bazi
and 2 more
March 08, 2025
Remote Sensing (RS) image captioning has traditionally relied on specialized models tailored to domainspecific tasks. The emergence of large vision-language models (VLMs) offers a promising alternative due to their versatility across tasks and domains. However, fine-tuning VLMs for specific applications introduces significant challenges, including computational overhead, overfitting risks, and reduced generalization capabilities. To address these limitations, we propose RAGCap a Retrieval-Augmented Generation framework that leverages pre-trained VLMs for RS captioning without the need for fine-tuning. Our approach employs a similarity-based retrieval model (SigLIP) to identify relevant image-caption pairs from the training set. These retrieved examples, along with the test image, are processed by a multi-image capable VLM (Qwen2VL) using a carefully designed prompt structure. This enables the model to generate captions that not only accurately describe the test image but also preserve the domain-specific style. Extensive evaluations on three RS benchmark datasets demonstrate that RAGCap achieves competitive performance compared to finetuned models while offering enhanced efficiency and generalization. Our framework provides a practical and scalable solution, maintaining the versatility of VLMs while effectively adapting to domain-specific requirements. Code will be available at: https://github.com/BigData-KSU/RAGCap
LoRA-CLIP: Efficient Low-Rank Adaptation of Large CLIP Foundation Model for Scene Cla...
Mohamad Mahmoud Al Rahhal
and 2 more
February 12, 2025
Scene classification in remote sensing (RS) imagery has been extensively investigated using both learning-from-scratch approaches and fine-tuning of ImageNet pre-trained models. Meanwhile, CLIP (Contrastive Language-Image Pretraining) has emerged as a powerful foundation model for vision-language tasks, demonstrating remarkable zero-shot capabilities across various domains. Its image encoder is a key component in many vision instruction-tuning models, enabling effective alignment of text and visual modalities for diverse tasks. However, its potential for supervised remote sensing (RS) scene classification remains unexplored. This work investigates the efficient adaptation of large CLIP models (containing over 300M parameters) through Low-Rank Adaptation (LoRA), specifically targeting the attention layers. By applying LoRA to CLIP's attention mechanisms, we can effectively adapt the vision- model for specialized scene classification tasks with minimal computational overhead, requiring fewer training epochs than traditional fine-tuning methods. Our extensive experiments demonstrate the promising capabilities of LoRA-CLIP. By training only on a small set of additional parameters LoRA-CLIP outperforms models pre-trained on ImageNet, demonstrating the clear advantages of using image-text pretrained backbones for scene classification.