Text-to-image (T2I) generation has gained increasing attention as it provides flexible and intuitive synthetic data for downstream geoscience applications. However, obtaining large-scale, high-quality image-text datasets in remote sensing (RS) is challenging because of high annotation costs and specialized domain knowledge. In this paper, we propose a dual loop data cleaning (DLDC) method, which leverages contrastive multimodal quality evaluations to generate high-quality RS image-text training data automatically. By constructing an external generation loop (EGL) based on a multimodal basic model and an internal evaluation loop (IEL) based on contrastive learning metrics, DLDC can automatically perform layout description and image-text matching evaluation on remote sensing images. Our approach effectively filters out noisy samples and curates a refined dataset without manual intervention. Experimental results show that our dual loop evaluation can accurately determine the optimal data cleaning ratio for different scenes, improving image generation quality. Compared with state-of-the-art (SOTA) pre-trained models, our fine-tuned models reduce FID values by over 35%, increase CLIP scores by more than 25%, and improve RemoteCLIP scores by over 10.5%. Experimental results demonstrate that our automatically generated image-text data is of a similar quality to manually annotated data, opening new pathways for rapid, cost-effective, and reliable RS data generation.