Today, remote sensing (RS) can offer earth observation data with various temporal-spatial-spectral characteristics by leveraging different kinds of sensors, forming a multimodal data framework. Shifting the perspective from orbit to ground, text, points of interest (POIs), street-view images and other geospatial data can provide information related to but distinct from RS images. In order to enrich the information capacity and build a comprehensive understanding, multimodal learning in the RS field attempts to simultaneously process and apply data of various modalities. However, multimodal approaches in supervised manner often require expensive human annotation, impeding the full release of data potential. To alleviate this issue, self-supervised learning (SSL) has become an attractive way to learn from unlabeled data, which can extract meaningful representations by designing effective pretext learning objectives. The strengths of label-free, featureextraction and task-agnostic allow SSL to easily scale up the data and model size, paving the way for RS multimodal foundation model (FM). In this survey, we systematically review the evolving field of RS multimodal SSL. In terms of data modalities, this review not only covers research utilizing multimodal RS images but also includes studies that integrate RS images with other forms of geospatial data, providing a comprehensive overview of the data integration scenarios. At the methodology level, multimodal SSL requires the synergy of learning objective and data fusion module. To provide a systematical framework for understanding the trends and challenges of RS multimodal SSL approaches, we present a structured methodology taxonomy in terms of multimodal SSL objective and data fusion strategy. And for each type of method, we summarize its characteristics, key elements and common scenarios. As for the application areas, we categorize them into four classes, including image processing, image understanding, vision-language understanding, and socioeconomic prediction. In addition, we also provide a systematical review of RS multimodal FMs based on SSL. Finally, we discuss challenges and future directions of RS multimodal SSL. It is our aspiration that this review will act as a starting point for researches to examine the advancements and engage in the exploration of RS multimodal SSL studies.