The multi-sensor fusion of camera and LiDAR with semantic 3D depth sensing for enhanced perception in autonomous driving systems

Yildiz, Ahmet Serhat

Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/33449

Title:	The multi-sensor fusion of camera and LiDAR with semantic 3D depth sensing for enhanced perception in autonomous driving systems
Authors:	Yildiz, Ahmet Serhat
Advisors:	Meng, H Swash, M
Keywords:	YOLOv8 object detection;Depth completion;Sparse depth map;Dense depth estimation;Object projection and localisation
Issue Date:	2026
Publisher:	Brunel University London
Abstract:	The development of autonomous driving technology depends on the precise perception of the environment and effective object detection and classification, tracking, essential for safe navigation and decision-making. However, existing multi-sensor fusion methods focus on generating dense depth maps by combining camera and LiDAR data, where distance is represented using colour values. In these depth maps, object shapes are often unclear and small or partially occluded objects are difficult to detect. As a result, it is challenging to accurately detect an object and extract object’s specific distance information. Therefore, this research focuses on detecting specific important objects and directly extracting their distance information from a sparse depth map combined in a more efficient and reliable way. This thesis explores multi-sensor fusion strategies, specifically fusing camera and LiDAR data, to enhance the perception capabilities of autonomous driving systems. While LiDAR sensors provide accurate depth measurements, their limited resolution restricts detailed scene understanding. Conversely, cameras provide much semantic information, but lack depth precision. This research addresses sensor limitations by integrating YOLOv8, a state-of-the-art object detection framework, with LiDAR point cloud data using precise camera–LiDAR projection utilising calibrated transformation matrices. The system is evaluated using the KITTI object detection benchmark, demonstrating improved range resolution and detection robustness under complex driving conditions. This work presents three main contributions: CNN-based traffic sign classification, camera-LiDAR fusion with depth enhancement and projection, and novel depth estimation techniques for distance measurement. Firstly, a Convolutional Neural Network (CNN) model is developed, trained, and implemented on the German Traffic Sign Recognition Benchmark (GTSRB) to achieve reliable classification of traffic signs under varying real-world conditions. The CNN architecture has several convolutional layers with activation functions (including ReLU, Leaky ReLU, and GELU), and each one is followed by a max-pooling layer that systematically reduces the spatial dimensions while keeping important features. After that, the extracted features are subsequently sent to fully connected (FC) layers for classification purposes. The network utilises the Adam optimiser with categorical cross-entropy loss and employs regularisation methods. The study evaluates and compares the performance of different activation functions to analyse their impact on recognition accuracy and model robustness. The final model does a great job of classifying traffic signs in the GTSRB test set, offering a dependable vision input source for further perception tasks in autonomous driving systems. Secondly, this study introduces a complete camera–LiDAR fusion framework that improves depth perception by using calibration data and transformation matrices, which include both intrinsic and extrinsic parameters, to project 3D LiDAR point clouds onto 2D camera images. Using homogeneous coordinate transformations, and matrix multiplication, the projection pipeline carefully maps LiDAR points onto the image plane, creating a sparse depth map that matches the RGB data. LiDAR sensors are naturally sparse and have low resolution, especially in systems with fewer vertical beams. To address this problem, the interpolation method was used to make the depth map denser and emulate higher-resolution point distributions. This upsampling process was used on both bounding box and segmentation mask regions to evaluate the efficacy of various spatial priors. The study provides a foundation for future perception pipelines in autonomous driving systems. Thirdly, a novel distance estimation method is proposed based on fused cam-era–LiDAR data. Several depth extraction techniques are introduced and evaluated, including Point-by-Point (PbyP), Complete Region Depth Extraction (CoRDE), Central Region Depth Extraction (CeRDE), and Grid Central Region Depth Ex-traction (GCRDE). These methods are tested across various object categories (e.g., cars, trucks, bicycles) and occlusion levels (0 to 3) using metrics such as extraction time, accuracy, and Root Mean Square Error (RMSE). Results show that segmentation mask-based methods, especially CeRDE and GCRDE, achieve higher depth estimation accuracy and lower RMSE, particularly for large and occluded objects. However, bounding box methods like PbyP and CoRDE maintain faster processing times, favoring real-time applications. GeRDE provides a balanced solution, offering both high accuracy and computational efficiency. Overall, this thesis contributes to the field of autonomous driving systems perception by demonstrating that deep learning-enhanced sensor fusion and optimised depth extraction can significantly improve the performance and reliability of perception systems under complex real-world conditions.
Description:	This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University London
URI:	https://bura.brunel.ac.uk/handle/2438/33449
Appears in Collections:	Electronic and Electrical Engineering Department of Electronic and Electrical Engineering Theses

Files in This Item:

File	Description	Size	Format
FulltextThesis.pdf	Embargoed until 14/06/2029	48.11 MB	Adobe PDF	View/Open

Show full item record