Predicting patient response to cancer immunotherapy remains a critical challenge due to the complex and heterogeneous nature of the tumor microenvironment. Current singlemodal analyses often fail to capture the intricate details influencing outcomes. To address this, we propose MAFNet, a novel Multi-modal Attention Fusion Network, designed to integrate diverse patient-derived data including pathological images, genomics, and transcriptomics. MAFNet incorporates a Hierarchical Attention Fusion Module (HAFM) for tailored feature encoding, a Transformer-based Cross-Modal Interaction Learning (CMIL) component to model intermodal dependencies, and a Multi-task Self-supervised Pre-training strategy for robust representation learning. Evaluated on TCGA Lung Adenocarcinoma and Melanoma cohorts, and validated on an external GEO dataset, MAFNet achieved superior immunotherapy response prediction, significantly outperforming single-modal deep learning models and simple concatenation-based fusion methods. It further demonstrated strong generalizability, high interpretability through attention visualizations, and effective overall survival prediction. An ablation study confirmed the critical contribution of each innovative component, while human evaluation highlighted its clinical plausibility and utility. Although computationally more intensive, MAFNet's enhanced predictive accuracy, generalizability, and interpretability position it as a powerful decision-support tool for advancing personalized immunotherapy.