In recent research, there has been a growing focus on advancing multimodal models beyond unimodal counterparts to ensure robustness in real-world scenarios. Achieving effectiveness amid various types of noise requires resilience to distribution shifts, often addressed through data augmentation techniques. However, the widely used vision-language augmentation, MixGen, experiences a notable performance decline under real-world conditions with perturbed data. Quantitative and qualitative analyses, employing ground-truth ranking and Grad-CAM in a retrieval task, reveal that this decline is attributed to models trained with MixGen-augmented data relying on spurious correlations.In response, we propose RobustMixGen, a novel data augmentation method considering both image and text content to address this challenge. To enhance modality-specific considerations, we introduce a method for categorizing object and background classes in advance. Employing CutMixUp for image synthesis and Conjunction Concat for text synthesis, this technique aims to mitigate spurious correlations. The effectiveness of RobustMixGen is demonstrated in a retrieval task, exhibiting a 0.21% improvement in Recall@K Mean compared to existing models. Additionally, under perturbed data in a distribution shift scenario, it showcases robustness with a 17.11% improvement in image perturbations and a 2.77% enhancement in text perturbations based on MMI, establishing itself as a more robust data augmentation technique.