A Quantization-Aware Optimization Framework for Efficient Deep Neural Network Inference
by Mohamed Almoudane
Published: January 29, 2026 • DOI: 10.47772/IJRISS.2026.10100169
Abstract
The growing demand for deploying deep neural network (DNN) inference on resource-constrained platforms has intensified challenges related to computational cost, memory footprint, and energy efficiency [1], [2]. Quantization is widely adopted to address these constraints; however, conventional low- bit quantization methods often suffer from severe accuracy degradation, commonly referred to as the performance cliff phe-nomenon [3], [4].
In this work, we propose a unified Quantization-Aware Optimization Framework (QAOF) that bridges high-precision floating-point training and efficient integer-only inference. The framework incorporates a multi-level, layer-wise sensitivity analysis based on the average Hessian trace to characterize loss curvature and guide precision allocation across the network [5]. To mitigate accuracy loss caused by inter-channel and inter-layer distribution mismatch in hybrid architectures, we further introduce Quantization-Aware Distribution Scaling (QADS), which adaptively aligns weight and activation distributions prior to quantization. In addition, computationally expensive operations are replaced with piecewise linear, integer-friendly formulations to ena-ble efficient execution on low-power hardware [6].
Extensive evaluations on representative architectures, including ResNet, MobileNet, and Vision Transformers (ViT), demonstrate that QAOF achieves substantial efficiency gains with minimal accuracy impact. Across standard benchmarks, the proposed method delivers up to 4.2× inference speedup and up to 75% memory reduction, while maintaining accuracy loss below 0.4%. Finally, we provide practical guidelines for selecting between post-training quantization and quantization aware training under diverse hardware deployment scenar-ios [7], [8].