VULCAN

Vision-Language-Model Enhanced Multi-Agent Cooperative Navigation
for Indoor Fire-Disaster Response

Multi-agent cooperative navigation in indoor fire environments

Multi-agent search and rescue in an indoor fire environment with smoke, heat, and sensor degradation.

Abstract

Indoor fire disasters pose severe challenges to autonomous search and rescue due to dense smoke, high temperatures, and dynamically evolving indoor environments. In such time-critical scenarios, multi-agent cooperative navigation is particularly useful, as it enables faster and broader exploration than single-agent approaches. However, existing multi-agent navigation systems are primarily vision-based and designed for benign indoor settings, leading to significant performance degradation under fire-driven dynamic conditions. In this paper, we present VULCAN, a multi-agent cooperative navigation framework based on multi-modal perception and vision–language models (VLMs), tailored for indoor fire disaster response. We extend the Habitat-Matterport3D benchmark by simulating physically realistic fire scenarios, including smoke diffusion, thermal hazards, and sensor degradation. We evaluate representative multi-agent cooperative navigation baselines under both normal and fire-driven environments. Our results reveal critical failure modes of existing methods in fire scenarios and underscore the necessity of robust perception and hazard-aware planning for reliable multi-agent search and rescue.

System Design Overview

VULCAN is a hazard-aware multi-agent navigation framework that integrates multi-modal perception, vision–language reasoning, and cooperative planning to enable robust operation in indoor fire-disaster environments.

System design overview of the VULCAN framework
Overview of the VULCAN system architecture. Each agent performs multi-modal perception and fusion, constructs hazard-aware global maps, and plans safe and efficient exploration using a VLM-based global planner and hazard-aware FMM local planner.

Perception Degradation under Fire Conditions

Fire-induced smoke severely degrades multi-modal perception in indoor environments. Using Gazebo-based fire simulations with a high-fidelity physics engine and particle-emitter support, we qualitatively demonstrate modality-dependent perception degradation under increasing smoke density.

RGB Camera

Depth Camera

Thermal Camera

Lidar Sensor

Modality-dependent perception degradation under increasing smoke density in simulated indoor fire environments.

Multi-Agent Cooperative Navigation under Normal and Fire Conditions

We qualitatively compare multi-agent navigation behaviors under normal and fire-induced environments. Different semantic targets are used as proxies for human presence, highlighting how fire hazards affect individual perception and cooperative route planning.

Task I: Finding a Bed (Proxy of Humans)

Beds are used as proxies for human locations in indoor rescue scenarios.

Agent 0 View (Normal Environment)

Agent 0 View (Fire Environment)

Agent 1 View (Normal Environment)

Agent 1 View (Fire Environment)

Cooperative Route Planning (Normal Environment)

Cooperative Route Planning (Fire Environment)

Task II: Finding a Toilet (Proxy of Humans)

Toilets serve as alternative proxies for human presence.

Agent 0 View (Normal Environment)

Agent 0 View (Fire Environment)

Agent 1 View (Normal Environment)

Agent 1 View (Fire Environment)

Cooperative Route Planning (Normal Environment)

Cooperative Route Planning (Fire Environment)

All experiments are conducted in the Habitat simulator. Open-vocabulary object detection and class-agnostic segmentation are implemented using YOLOv8 and Mobile-SAM. A vision–language model (GPT-4o) assigns frontier goals based on the global multi-agent state. Each episode involves two robots with shared start positions and different initial orientations.

Evaluation

We evaluate representative multi-agent, map-based navigation baselines, including both VLM-based planners and conventional map-based methods, under normal conditions and simulated indoor fire conditions to assess their navigation success rate, exploration efficiency, and safety awareness. The quantitative results are summarized in Table I and Table II.

Table 1. Performance comparison under normal conditions
Method NS SR SPL CHE
Greedy 219.03 0.686 0.322 0
Cost-Utility 199.60 0.628 0.315 0
Random Sample 206.62 0.631 0.258 0
Co-NavGPT 185.43 0.666 0.388 0
Table 2. Performance comparison under fire conditions
Method NS ↑ SR ↓ SPL ↓ CHE ↑
Greedy 267.40 0.651 0.319 11.233
Cost-Utility 207.49 0.608 0.306 8.517
Random Sample 214.96 0.625 0.251 7.174
Co-NavGPT 187.89 0.660 0.381 4.873
We evaluate all methods using four metrics: Number of Steps (NS), Success Rate (SR), Success weighted by Path Length (SPL), and Cumulative Hazard Exposure (CHE). By comparing the performance across normal and fire scenarios, we observe substantial performance degradation for existing approaches originally designed for clean indoor environments. This highlights the critical need for robust multi-modal perception and hazard-aware planning in fire-disaster response applications.

Performance Degradation Analysis under Fire-driven Scenarios

We further analyze the underlying causes of the observed performance degradation under fire-driven scenarios. Our analysis reveals that perception uncertainty, incomplete map construction, and hazard-unaware decision making jointly contribute to navigation failures in smoke-filled and thermally hazardous environments.

Analysis of performance degradation causes under fire scenarios
The comparison highlights three failure modes: (a, d) Perception failure: smoke causes missed detections (confidence drop in a) and false positives (hallucinations in d); (b, e) Inefficient exploration: agents exhibit redundant framing and reduced coverage under fire conditions; (c, f) Unsafe planning: without hazard awareness, agents plan paths through high-risk zones (f) despite the presence of thermal hazards shown in the ground-truth map (c).