Traditional DOA estimation methods include beamforming, maximum likelihood estimation, subspace-based methods and the sparsity-inducing methods, and DOA estimation is made by establishing the relationship between the received signal and the geometric characteristics of the array. However, factors such as low signal-to-noise ratio, low snapshot, array errors, coherent signals, and broadband signals can seriously affect the performance of these methods. Existing improved methods, such as spatial smoothing and compressed sensing to deal with coherent signal sources, and band division technology to deal with broadband signal sources, are often at the expense of resolution. Besides that, traditional methods tend to be poorly extrapolated and fail to make satisfactory estimates in complex situations. In order to deal with the above problems, some studies have proposed machine learning methods and deep learning methods to estimate DOA. However, the generalization ability of machine learning methods is weaker than that of deep learning methods, and most of them only use synthetic data for experiments, which cannot guarantee the performance in practical applications. Most deep learning methods model DOA estimation as a classification problem on grids, which limits the accuracy of estimation results. If the accuracy is to be increased, the grids have to be finer, which significantly increases the computational cost. Like the above machine learning methods, most deep learning methods do not give experimental results on measured data. This paper proposes a novel DOA estimation method based on the Transformer model to solve the DOA estimation problem. Firstly, compared with the traditional Transformer, the model in this paper adds a sensor-based attention mechanism specially designed for DOA estimation. This method abandons the previous grid classification, and directly regards the DOA estimation problem as a regression problem to minimize the error. It can be proved through strict mathematical derivation that its output can be decomposed by pseudo-singular value, and the eigenvalue matrix is the same as that of the MUSIC method, which means that the output of the proposed attention module is in the space spanned by the (projected) signal and noise eigenvectors. If the eigenvalue is large, the spanned space is dominated by the corresponding eigenvector, which forces the model to concentrate on the vital eigenvectors. Secondly, the complexity of the sensor-based attention mechanism is significantly reduced compared with the original attention mechanism, from O(N2) to O(M2), where N is the number of snapshots, M is the number of sensors. Thirdly, we conducted simulation experiments including low signal-to-noise ratio, low snapshot, array errors, coherent signal and broadband signal scenarios, and the results show that our method has good adaptability to various scenarios. Fourthly, in order to verify the practical application ability of our model, we carried out migration and testing on the measured data, and the results show that our method still has a good effect. Fifthly, in order to cope with possible environmental changes in practical applications, we specially set up a generalization setting experiment. This experiment mainly explores the generalization ability of the model for unknown scenarios, including the generalization situation under different signal-to-noise ratios and different array error strengths, and satisfactory results have been achieved. Finally, since our model needs to know the number of sources in advance, and the number of sources is sometimes unknown in reality, we slightly modify the DOA estimation model, changing the regression head to the classification head to realize the estimation of the number of sources. The results show that the average estimation accuracy is about 98%, which further enhance the application capabilities.