Abstract:The latest advancements in intelligent driving technology are primarily reflected in the environmental perception layer, where sensor data fusion is critical for enhancing system performance. Although point cloud data provides accurate 3D spatial descriptions, it suffers from unorderedness and sparsity; image data, with its regular distribution and dense semantics, can compensate for the limitations of single-modality detection when fused with point clouds. However, existing fusion algorithms face challenges such as limited semantic information and insufficient modal interaction, leaving room for improvement in high-precision multi-modal 3D object detection. To address this, this paper proposes an innovative multi-sensor fusion method: generating pseudo-point clouds via depth completion from RGB images and combining them with real point clouds to identify regions of interest (RoIs). It includes three key improvements: deformable attention-based multi-layer feature extraction to adaptively expand the receptive field to target regions; 2D sparse convolution for efficient pseudo-point cloud feature extraction, leveraging their regular distribution in the image domain; and a two-stage feedback mechanism, which uses multi-modal cross-attention at the feature level to solve data alignment issues and an efficient fusion strategy at the decision level for interactive training across different stages. These innovations effectively resolve the contradictions between pseudo-point cloud accuracy and computational load, significantly enhancing feature extraction efficiency and detection accuracy. Experimental results on the KITTI dataset demonstrate the proposed method's superior performance in 3D traffic element detection, validating its effectiveness and offering a new approach for multi-modal fusion in autonomous driving environmental perception.