Accurate terrain elevation estimation from remote sensing data is essential for a multitude of geographic applications. Specifically, image-based elevation estimation has garnered growing attention due to advancements in optical sensor development and automated analysis algorithms, such as machine learning. In this context, deep learning methods, including Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), have recently enhanced the feature extraction ability and estimation accuracy of this task. Despite the distinct advantages afforded by each architectural paradigm, current methods are frequently impeded in their ability to discern subtle height variations within complex scenes and are ill-equipped to effectively tackle the extraction of features across both large and small scales. Although vision foundation models have shown significant advances in remote sensing analysis, their effectiveness for height estimation remains unexplored. In this study, we introduce the foundation model in the field of elevation estimation and propose a novel Depth to Elevation (Depth2Elevation) model, marking the first application of the Depth Anything Model (DAM) to height estimation in remote sensing images. First, we introduce the scale modulator for modulating partial encoders in the original DAM, which enables DAM to capture subtle representations of localized objects at different scales. Secondly, we further enhance the model’s representational capability by using a resolution-agnostic decoder architecture, which enables DAM to learn features at different spatial scales efficiently. We conducted comprehensive experiments on several benchmark datasets. Compared to strong baselines, our method achieves an average relative improvement of at most 42% on the latest large-scale benchmark dataset GAMUS and shows the best generalization ability across different scenarios