Transformer networks have attracted significant scholarly attention for their superior ability to model long-range dependencies, making them effective in video object-tracking applications. However, existing Transformer-based trackers still show limitations in handling scale variation, complex background transformations, and occlusion, primarily due to insufficient utilization of the spatiotemporal feature information of the target. Specifically, these trackers struggle with maintaining target semantic integrity, adequately perceiving crucial areas, and incorporating historical information. To address these challenges, this paper introduces several enhancements. First, it replaces the pixel-level attention mechanism with a multi-scale cyclicshifted window attention mechanism, preserving target semantic integrity and improving the model's ability to capture scale variations. Subsequently, the convolution theorem is applied to transform the attention computation of cyclically shifted spatial samples into entry-wise multiplication in the frequency domain for non-cyclically shifted samples, thus enhancing computational efficiency. Additionally, a selective elimination module is proposed to focus on critical areas resembling those in the search frame, effectively minimizing background interference. Finally, a head network assesses the reliability of candidate targets, reintegrating reliable samples as dynamic templates into the tracking network to fully leverage spatiotemporal information from historical frames. Extensive comparative experiments demonstrate that the proposed model outperforms current state-of-the-art trackers across several public datasets.