Power systems are critical infrastructures, and unmanned aerial vehicles (UAVs) have become an important tool for power equipment inspection. However, semantic segmentation of aerial power equipment remains challenging due to the scarcity of high-quality multi-modal datasets and the difficulty of accurately identifying small targets such as power lines.
In a study published in Pattern Recognition, the research team led by Prof. CHAO Jianshu from Fujian Institute of Research on the Structure of Matter, Chinese Academy of Sciences, developed a novel RGB-D semantic segmentation framework, M³WaveGNet, for UAV-based power equipment inspection.
To address the lack of publicly available multi-modal datasets for aerial power equipment inspection, the researchers first developed an AirSim Power SystemDataset (APSD) using the AirSim simulation platform. APSD contains more than 4,000 RGB-D image pairs collected from multiple urban and industrial environments, including power lines, power poles, street lights, and traffic lights.
Besides APSD, the researchers introduced M³WaveGNet, a lightweight semantic segmentation network that integrates multi-modal information and multi-level wavelet analysis. The framework employs a Multi-Modal and Multi-Level Wavelet Feature Fusion Encoder, which utilizes multi-resolution wavelet decomposition to preserve fine-grained details while enhancing semantic representation. Through a Stage-Level Feature Exchange (SFE) strategy, RGB and depth features interact throughout the encoding process, enabling effective local cross-modal feature fusion.
Furthermore, inspired by recent advances in state space models, the team designed a Multi-modal Fusion Module (MmFM) based on a novel multi-input single-output state space architecture. Unlike conventional fusion methods that rely on simple concatenation or attention mechanisms, the proposed module explicitly models interactions between RGB and depth modalities in both spatial and channel dimensions, facilitating efficient global feature fusion with linear computational complexity.
Extensive experiments demonstrated the effectiveness of the proposed framework. Compared with traditional RGB-only segmentation methods, M³WaveGNet improves the mean Intersection-over-Union (mIoU) by more than 25%. On APSD, the method achieves an mIoU of 83.57% while maintaining real-time inference capability at over 60 FPS. Notably, it achieves an IoU of 91.29% for power-line segmentation, significantly outperforming existing state-of-the-art RGB-D segmentation approaches.
To further evaluate the trade-off between segmentation accuracy and inference speed, the researchers proposed a new metric named IoU-Fscore, inspired by the classical F-score. This metric provides a quantitative assessment of the balance between mIoU and FPS, offering a practical benchmark for real-time UAV perception systems. The proposed framework was also validated on several public datasets, including TTPLA, Mid-Air, and Cityscapes, where it consistently achieves competitive or superior performance compared with existing RGB-D segmentation methods.
This study demonstrates the effectiveness of combining multi-modal perception, wavelet-based multi-resolution analysis, and state space modeling for UAV-based infrastructure inspection. The proposed APSD dataset, M³WaveGNet framework, and IoU-Fscore metric provide valuable resources and technical foundations for future research on intelligent power system inspection and autonomous UAV perception.

Illustration of the Research (Image by Prof. CHAO’s group)
Contact:
Prof. CHAO Jianshu
Fujian Institute of Research on the Structure of Matter
Chinese Academy of Sciences
Email: jchao@fjirsm.ac.cn