The rapid growth of generative artificial intelligence has led to an explosion of multimodal data, intensifying the demand for adaptive cross-modal retrieval systems. However, persistent semantic and distributional gaps across modalities continue to hinder effective alignment. To alleviate these challenges without incurring prohibitive annotation costs, unsupervised cross-modal retrieval (UCMR) has emerged as an attractive alternative; nevertheless, it often struggles to establish reliable semantic correspondences due to the absence of supervision. To address this limitation, we propose MiRA (Multi-granularity Consensus Representation Alignment), an unsupervised framework that enhances semantic consistency and representation robustness via hierarchical, noise-aware learning. MiRA progressively refines multi-granular features and aligns cross-modal representations through consensus-driven optimization. Specifically, it comprises two key components: (1) Progressive Agreement Clustering (PAC), which improves pseudo-label reliability by enforcing cross-layer consistency and reducing uncertainty from single-granularity representations; and (2) Cross-modal Robust Association (CRA), which leverages the refined pseudo-labels to guide representation learning under a hybrid contrastive-consistency objective, promoting discriminative alignment while suppressing noisy associations. Extensive experiments on four benchmark datasets demonstrate that MiRA consistently surpasses nine state-of-the-art UCMR methods, confirming its effectiveness in capturing hierarchical semantics and achieving robust cross-modal alignment without supervision.