Explainable Deep Learning for Age and Gender Prediction from Facial Images: A Comparative Study of VGG16, Resnet50, and Efficientnet with Grad-CAM and SHAP
by Yassir Elhaj
Published: April 27, 2026 • DOI: 10.47772/IJRISS.2026.100400063
Abstract
Automatic age estimation and gender classification from facial images represent two of the most intensively studied problems in computer vision, with wide-ranging applications in human-computer interaction, biometric surveillance, targeted marketing, healthcare monitoring, and forensic analysis. Despite remarkable advances in convolutional neural network architectures over the past decade, the black-box nature of deep learning models continues to pose significant challenges in terms of interpretability, trustworthiness, and accountability, particularly in sensitive deployment contexts. This paper presents a comprehensive comparative study of three state-of-the-art deep learning architectures—VGG16, ResNet50, and EfficientNet-B3—for simultaneous age and gender prediction from facial images, with a strong emphasis on model explainability. Our framework employs the UTKFace dataset, comprising over 20,000 face images spanning ages from 1 to 116 across multiple ethnicities. We describe a rigorous preprocessing pipeline incorporating Multitask Cascaded Convolutional Networks (MTCNN) for face detection and alignment, followed by standardized normalization and extensive data augmentation strategies. Both Gradient-weighted Class Activation Mapping (Grad-CAM) and SHapley Additive exPlanations (SHAP) are integrated into the evaluation workflow to provide visual and quantitative insight into the regions and features that drive model decisions. Experimental results demonstrate that EfficientNet-B3 achieves superior performance with a Mean Absolute Error (MAE) of 4.37 years for age estimation and a gender classification accuracy of 96.8%, while maintaining a significantly reduced computational footprint compared to the other architectures under evaluation. ResNet50 offers a strong middle ground between accuracy and training efficiency, whereas VGG16, though interpretable, lags behind in both performance and computational cost. Our explainability analysis reveals that all three models predominantly attend to periocular regions, nasolabial folds, and frontal skull geometry for age estimation, while gender classification relies more heavily on jaw contour, brow ridge prominence, and lip morphology. These findings underscore the importance of integrating explainability tools into the facial analysis pipeline and provide practical guidance for practitioners deploying deep learning systems in real-world, ethically sensitive environments.