Vision-Language Models (VLMs) have recently emerged as a promising paradigm in autonomous driving (AD). However, current performance evaluation protocols for VLM-based AD systems (ADVLMs) are predominantly confined to open-loop settings with static inputs, neglecting the more realistic and informative closed-loop setting that captures interactive behavior, feedback resilience, and real-world safety. To address this, we introduce Bench2ADVLM, a unified hierarchical closed-loop evaluation framework for real-time, interactive assessment of ADVLMs across simulation platforms. Inspired by dual-process theories of cognition, we first adapt diverse ADVLMs to simulation environments via a dual-system adaptation architecture. In this design, heterogeneous high-level driving commands generated by target ADVLMs (fast system) are interpreted by a general-purpose VLM (slow system) into standardized control actions suitable for execution in simulation. To enable more comprehensive evaluation, Bench2ADVLM introduces a self-reflective scenario generation module that automatically explores model behavior and uncovers potential failure modes for safety-critical scenario generation, constructing a benchmark including 220 common routes and 220 threat scenarios. Experiments across 4 state-of-the-art ADVLMs and 16 different combinations validate the diagnostic strength of our framework, revealing that existing ADVLMs still exhibit limited performance under closed-loop conditions. Furthermore, we design a physical control abstraction layer that translates simulation actions into actuation signals, enabling closed-loop evaluation of ADVLMs on 3 physical vehicles. Bench2ADVLM is flexible and extensible, supporting diverse VLMs and enabling deployment across heterogeneous vehicles. To our knowledge, this is the first work to establish the closed-loop evaluation framework for ADVLMs, offering a principled path toward scalable, reliable deployment of ADVLMs.
Overview of the Bench2ADVLM benchmark. The framework includes a dual-system adaptation architecture for translating high-level driving commands into mid-level control actions, a physical control abstraction layer for mapping mid-level control actions to low-level actuation signals, and a self-reflective scenario generation module for probing potential failure modes.
Physical-world evaluation is performed on the AGILE·X sandbox using Jetbot and LIMO. Vehicles collect real-time data and send it to the dual-system, which produces high-level commands and mid-level actions. These are then translated by the abstraction layer into platform-specific low-level actuation signals, closing the control loop.
The ADVLM is prompted with P3 (perception–prediction–planning) queries, and a GVLM fuses the answers into a description.
Main experimental results on Bench2ADVLM. † and ☆ indicate the use of Continuous Numerical Generation and Discrete Classification Selection parsing modes, respectively. Blue subscripts denote the standard deviation (±std ) over multiple runs.
Insight 1: ADVLMs lack fine-grained control and show limited closed-loop performance, with low Success Rate and Driving Score highlighting a gap from deployment readiness.
Experimental results on Bench2ADVLM under threat scenarios. Blue subscripts denote the standard deviation (±std) over multiple runs, while red superscripts indicate the performance drop (-drop) compared to the main results.
Model performance over different scenarios on Bench2ADVLM.
Insight 2: LLaVA shows a milder decline than LLaMA in behavior quality metrics (e.g., Efficiency), while LLaMA performs better in basic performance (e.g., Success Rate).
We use two autonomous driving platforms: Jetbot and LIMO. Both platforms are equipped with onboard sensors, including cameras, LiDAR, and IMU, and are capable of standard motion control. Jetbot, featuring stronger onboard computational resources, is suited for AI-intensive workloads, while LIMO emphasizes actuation stability and supports multiple driving modes, including differential, Ackermann, tracked, and Mecanum configurations. Both platforms adopt ROS as the internal communication and control framework. The figure below showcases the real-world evaluation setup and representative experimental outcomes.
To quantitatively assess the real-world driving performance, we design a structured evaluation strategy centered on the lane-following task. The driving sandbox is partitioned into ten distinct route segments, each reflecting varying geometric and traffic complexities. For each route, each AD vehicle runs three times, and we report the average results to ensure statistical reliability. The primary evaluation metric is the route completion rate, defined as the percentage of the planned trajectory successfully traversed by the vehicle without crossing the yellow boundary lines or colliding with obstacles.
First-person views captured every 0.5 seconds in RouteScenario-3749. The images show 6 frames per scene.
First-person views captured every 0.5 seconds in RouteScenario-2082. The images show 6 frames per scene.
BibTex Code Here