Sonni Zamarian -

Efficient pre-training of massive neural architectures remains a critical challenge due to the high computational cost and resource demand, particularly when processing extensive datasets with increasingly complex models. The introduction of small token initialization provides a novel approach that reduces the dimensionality of token embeddings during early training, thereby accelerating training times while minimizing memory and energy consumption. This method leverages reduced token sizes at the onset of training, which incrementally expand as the model learns more complex representations, maintaining competitive accuracy levels throughout the process. Experimental results demonstrated that models initialized with smaller tokens required significantly fewer floating-point operations, reduced memory overhead, and achieved faster convergence compared to traditional initialization techniques. The analysis also revealed that the token expansion process allowed the model to retain high performance in terms of perplexity, token coverage, and generalization on downstream tasks. Ultimately, small token initialization offers a scalable, resource-efficient alternative for pre-training large-scale models without sacrificing performance, making it a compelling option for both academic research and practical applications where computational efficiency is paramount.