A Self-Evolving Transformer-Based Machine Vision Framework for Adaptive Industrial Defect Diagnosis under Non-Stationary Environments

by Vincent Kibet

Published: April 3, 2026 • DOI: 10.47772/IJRISS.2026.100300271

Abstract

The detection of industrial defect systems is usually challenging when dealing with non-stationary production processes, as they are faced with constantly changing lighting, material characteristics, and defect patterns. The standard Convolutional Neural Network (CNN)-based systems are unable to cope with such changes in distribution without retraining and manually re-labelling of the data. This study presented a self-evolving machine vision framework, which used transformers to adapt to changes in the environment based on continuous meta-learning and uncertainty-based pseudo-labelling. This was also integrated with six basic components, including a backbone of Vision Transformer (ViT) that learned multi-scale features. The adaptable memory module included episodic defect pattern storage, a distribution shift detector based on Maximum Mean Discrepancy (MMD), a meta-learning engine based on the Model-Agnostic Meta-Learning (MAML) algorithm, a self-supervised evolution mechanism coupled with confidence-driven sample selection, and an uncertainty quantification module that uses Monte Carlo Dropout. The proposed framework had a high precision of 94.7% when used in 10 labelled samples on three industrial datasets (steel surface defects, semiconductor wafer inspection, and textile fabric anomaly) with 8.3%-12.6% over the state-of-the-art mechanisms. Even in extreme lighting conditions (96.2%), the system was also able to adapt to new defects within 45 minutes without interrupting the production line. The architecture was 47 times faster in false-positive than ResNet-50, and at 42 FPS on edge devices, meaning that it will be possible to deploy in industry in real time. The self-improving mechanism enabled continuous improvement since 89.4% of pseudo-labels attained a confidence level of more than 95%, illustrating that it does not require constant human supervision.