This investigation yielded three conclusions. To begin, various counterexamples were constructed to establish the main theorem, which asserts that the optimal scaling factor for a model implemented with a self-attention function to converge and achieve its best possible accuracy is not 1 √ d k. Secondly, a theoretical analysis showed the existence of the kinetic energy that was obtained and stored in the model weights. Several empirical facts were discovered that support this theoretical interpretation. As a result of this discovery, a self-attention model, such as transformers, would have an inertial effect to memorize its original value of the scaling factor. As a result, once the optimal scaling factor is discovered early in training, it may be fixed for the subsequent training phase. Finally, the first result and second findings imply that this scaling factor is independent of the variance or mean of each value of entries of the input, query, and key matrices (based on the an observation of counterexamples and a proof provided to show self-attention function is a stack of Ising models), and the scaling factor represented a form of kinetic energy (based on the inertial effect). The first two results also confirm the conjecture that the best scaling factor is determined by the entire topology of the neural net function.