Medical images aid in disease diagnosis, but manual analysis is time-consuming. To streamline this, a Vision Transformer-based framework is used for automatic multi-class image classification, addressing limited datasets and static Deep Learning frameworks. The COVID-19 CXR data set, the ARS-COV-2 CXR data set, and the COVID-19 Posterior-Anterior Chest Radiography (X-rays) data set has been used for training, validation, and testing processes. The training image data has been divided into flattened regions known as Visual Tokens. The transformer encoder has been then fed the flattened regions of the input image. With its positional encoding, the transformer encoder creates visual tokens of the input image. The encoded visual tokens have been then delivered to MLP, which is followed by a soft-max algorithm to categorize the images. Experimental results have shown that the overall accuracy of the vision transformer-based detection model is 0.7346 with precision 0.6997, Sensitivity 0.7318, and F1-Score 0.7126.