Yiming Han2 -

Autonomous driving decision-making requires a deep semantic understanding of traffic scenes. In this paper, we propose the SGVLM (Semantic Graph Vision-Language Model) architecture: a vision-language model that enhances autonomous driving decision-making through depth-integrated semantic scene graph fusion. Key objects are represented as nodes (category, state) and spatial-semantic relations as edges, enriched with pixel-wise depth estimates from Depth-Anything-V2 to capture accurate inter-object distances. These structured graph features are aggregated via a two-layer Graph Attention Network and projected into the FastVLM’s FastViTHD feature space. A cross-modal triplet fusion layer then jointly integrates graph embeddings, visual features, and natural-language queries. Leveraging Low-Rank Adaptation (LoRA) for efficient fine-tuning, SGVLM_7B achieves relative improvements of 25.9% in BLEU-4 and 18.6% in ROUGE-L over the InternVL4Drive-v2 baseline on the DriveLM-nuScenes benchmark, and attains 94.56% accuracy on collision-warning decision tasks in our TTSG-data safety-critical scenarios. These results demonstrate that depth-integrated semantic scene graph fusion substantially enhances the model’s ability to generate actionable driving decisions under complex traffic conditions.