BANet: Enhancing Weakly Aligned Multimodal Object Detection via Balanced Bidirectional Alignment Network

Shi, Y; Li, G; Shen, Z; Meng, H; Pang, Y

Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/33009

Title:	BANet: Enhancing Weakly Aligned Multimodal Object Detection via Balanced Bidirectional Alignment Network
Authors:	Shi, Y Li, G Shen, Z Meng, H Pang, Y
Keywords:	multimodal object detection;weak alignment;feature alignment;cross-modal fusion;remote sensing
Issue Date:	17-Mar-2026
Publisher:	Institute of Electrical and Electronics Engineers (IEEE)
Citation:	Shi, Y. et al. (2026) 'BANet: Enhancing Weakly Aligned Multimodal Object Detection via Balanced Bidirectional Alignment Network', IEEE Transactions on Geoscience and Remote Sensing, 0 (early access), pp. 1–13. doi: 10.1109/tgrs.2026.3674946.
Abstract:	Multimodal object detection in remote sensing imagery has achieved remarkable performance, primarily owing to its ability to exploit complementary information from multiple modalities. However, most existing methods often suffer from substantial performance degradation under weakly aligned conditions, primarily due to the asymmetric utilization of information across different modalities. Therefore, we propose a novel multi-modal object detection network, termed Bidirectional Alignment Network (BANet), which aims to improve detection accuracy in weakly aligned multimodal remote sensing imagery by adopting a dual-path architecture and incorporating a dedicated Weakly Aligned Module (WAM) to explicitly mitigate misalignment and enhance cross-modal feature interaction. Specifically, WAM includes three cooperative components. Firstly, the Adaptive Cross-Modal Correlation Module (ACMCM) is designed to establish semantic correspondence by jointly modeling global dependencies and local similarities in a bidirectional manner. Then, the Symmetric Offset Generator (SOG) adopts a coarse-to-fine strategy to produce stable and symmetric offsets, thereby enabling precise and robust spatial alignment. Finally, the Progressive Fusion Strategy (PFS) adaptively integrates the original and aligned features through learnable weighting, effectively preserving modality-specific characteristics while enhancing both spatial alignment and semantic consistency. Extensive experiments on the DroneVehicle and VEDAI multimodal remote sensing datasets demonstrate the superiority of the proposed method over other advanced multimodal remote sensing object detectors. Notably, BANet performs best on the two datasets with only 8.8M parameters, highlighting its effectiveness and efficiency for real-time UAV applications.
URI:	https://bura.brunel.ac.uk/handle/2438/33009
DOI:	https://doi.org/10.1109/tgrs.2026.3674946
ISSN:	0196-2892
Other Identifiers:	ORCiD: Guoquan Li https://orcid.org/0000-0001-8022-743X ORCiD: Zhilong Shen https://orcid.org/0000-0002-2170-3907 ORCiD: Hongying Meng https://orcid.org/0000-0002-8836-1382 ORCiD: Yu Pang https://orcid.org/0000-0002-7507-5387
Appears in Collections:	Department of Electronic and Electrical Engineering Research Papers

Files in This Item:

File	Description	Size	Format
FullText.pdf	For the purpose of open access, the author has applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising.	34.11 MB	Adobe PDF	View/Open

Show full item record