Abstract :
[en] Remote sensing image classification is a critical component of Earth
observation (EO) systems, traditionally dominated by convolutional neural
networks (CNNs) and other deep learning techniques. However, the advent of
Transformer-based architectures and large-scale pre-trained models has
significantly shifted, offering enhanced performance and efficiency. This study
focuses on identifying the most effective pre-trained model for land use
classification in onboard satellite processing, emphasizing achieving high
accuracy, computational efficiency, and robustness against noisy data
conditions commonly encountered during satellite-based inference. Through
extensive experimentation, we compared traditional CNN-based models,
ResNet-based models, and various pre-trained vision Transformer models. Our
findings demonstrate that pre-trained Transformer models, particularly
MobileViTV2 and EfficientViT-M2, outperform models trained from scratch in
accuracy and efficiency. These models achieve high performance with reduced
computational requirements and exhibit greater resilience during inference
under noisy conditions. While MobileViTV2 excelled on clean validation data,
EfficientViT-M2 proved more robust when handling noise, making it the most
suitable model for onboard satellite Earth observation tasks. In conclusion,
EfficientViT-M2 is the optimal choice for reliable and efficient remote sensing
image classification in satellite operations, achieving 98.76\% accuracy,
precision, and recall. Specifically, EfficientViT-M2 delivered the highest
performance across all metrics, excelled in training efficiency (1,000s) and
inference time (10s), and demonstrated greater robustness (overall robustness
score at 0.79).