AUTHOREA
Log in Sign Up Browse Preprints
LOG IN SIGN UP
Yiming Han2
Yiming Han2

Public Documents 1
SGVLM: Depth-Integrated Semantic Scene Graph Fusion for Enhanced Autonomous Driving D...
Yiming Han2
Yiran Tao and Xiang Cui

Yiming Han2

and 2 more

September 13, 2025
Autonomous driving decision-making requires a deep semantic understanding of traffic scenes. In this paper, we propose the SGVLM (Semantic Graph Vision-Language Model) architecture: a vision-language model that enhances autonomous driving decision-making through depth-integrated semantic scene graph fusion. Key objects are represented as nodes (category, state) and spatial-semantic relations as edges, enriched with pixel-wise depth estimates from Depth-Anything-V2 to capture accurate inter-object distances. These structured graph features are aggregated via a two-layer Graph Attention Network and projected into the FastVLM’s FastViTHD feature space. A cross-modal triplet fusion layer then jointly integrates graph embeddings, visual features, and natural-language queries. Leveraging Low-Rank Adaptation (LoRA) for efficient fine-tuning, SGVLM_7B achieves relative improvements of 25.9% in BLEU-4 and 18.6% in ROUGE-L over the InternVL4Drive-v2 baseline on the DriveLM-nuScenes benchmark, and attains 94.56% accuracy on collision-warning decision tasks in our TTSG-data safety-critical scenarios. These results demonstrate that depth-integrated semantic scene graph fusion substantially enhances the model’s ability to generate actionable driving decisions under complex traffic conditions.

| Powered by Authorea.com

  • Home