MM-VSM: Multi-Modal Vehicle Semantic Mesh and Trajectory Reconstruction for Image-Based Cooperative Perception
Recent advancements in cooperative 3D object detection have demonstrated significant potential for enhancing autonomous driving by integrating roadside infrastructure data. However, deploying comprehensive LiDAR-based cooperative perception systems remains prohibitively expensive and requires precisely annotated 3D data to function robustly. This paper proposes an improved multi-modal method integrating LiDAR-based shape references into a previously mono-camera-based semantic vertex reconstruction framework to enable robust and cost-effective monocular and cooperative pose estimation after the reconstruction. A novel camera–LiDAR loss function that combines re-projection loss from a multi-view camera system alongside LiDAR shape constraints is proposed. Experimental evaluations conducted on the Argoverse dataset and real-world experiments demonstrate significantly improved shape reconstruction robustness and accuracy, thereby improving pose estimation performance. The effectiveness of the algorithm is proven through a real-world smart valet parking application, which is evaluated in our university parking area with real vehicles. Our approach allows accurate 6DOF pose estimation using an inexpensive IP camera without requiring context-specific training, thereby advancing the state of the art in monocular and cooperative image-based vehicle localization.