Dewant Katare

and 6 more

Autonomous applications having AI models and algorithms as backbone require high-performance computational and memory resources for efficient deployment and data processing. This complexity further increases while dealing with larger data sizes and computationally complex algorithms. Stochastic computing, precision-scaling, and model compression techniques such as quantization and pruning have addressed these issues. Although these approaches benefit model training and inference, performance and efficiency tradeoffs remain challenging. Our proposed approach includes three approximation schemes to address model performance and efficiency tradeoffs. The first scheme uses the concept of approximate multipliers. The second scheme approximates convolution operations using minimal multiplicative operations, and the third scheme uses variational inference using quantization-aware training and post-training quantization mechanisms. We evaluate the proposed schemes using performance metrics such as multiply-accumulate operations, floating-point operations, accuracy, latency, and on-device power consumption from CNNs, DNNs, and vision transformers. Tests show that our methods achieve up to 43% model compression while maintaining the accuracy within 3 to 4% of the baseline, with approximately 38% reduction in energy consumption compared to the baseline model. The proposed strategies provide mechanisms for optimized edge computing deployments by achieving a balanced tradeoff between model performance and energy efficiency. Code can be accessed at Approximate-Models.