TRACE = Textual Representation of Allocentric Context from Egocentric Video
Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning
TRACE induces multimodal models to generate a text-based allocentric representation of the 3D environment as an intermediate reasoning trace for spatial question answering.
Existing Multimodal Large Language Models (MLLMs) struggle with 3D spatial reasoning, as they fail to construct structured abstractions of the 3D environment depicted in video inputs. To bridge this gap, drawing inspiration from cognitive theories of allocentric spatial reasoning, we investigate how to enable MLLMs to model and reason over text-based spatial representations of video. Specifically, we introduce Textual Representation of Allocentric Context from Egocentric Video (TRACE), a prompting method that induces MLLMs to generate text-based representations of 3D environments as intermediate reasoning traces for more accurate spatial question answering. TRACE encodes meta-context, camera trajectories, and detailed object entities to support structured spatial reasoning over egocentric videos. Extensive experiments on VSI-Bench and OST-Bench demonstrate that TRACE yields notable and consistent improvements over prior prompting strategies across a diverse range of MLLM backbones, spanning different parameter scales and training schemas.
TRACE encodes meta-context, camera trajectory, and entities as an intermediate reasoning trace for spatial QA with MLLMs.
Existing MLLMs struggle to construct structured abstractions of 3D environments from egocentric video.
It induces a structured allocentric representation with meta-context, trajectory, and entity registry.
Performance improves consistently across backbones, parameter scales, and training schemas.
Text as a Spatial Interface
TRACE turns egocentric video into a text-based allocentric representation that multimodal models can reason over.
Existing MLLMs often over-rely on 2D visual cues instead of building hierarchical abstractions of the 3D scene.
TRACE explicitly models meta-context, camera trajectory, and detailed object entities as an intermediate reasoning trace.
The resulting allocentric representation supports routing, measurement, order, and relational reasoning while improving spatial QA on VSI-Bench and OST-Bench.
How TRACE Structures Reasoning
TRACE aligns a global coordinate system with room layout, logs camera trajectory, and registers visible objects with key attributes, estimated positions, and spatial relations.
Spatial Descriptor
The model is prompted to emit schema-compliant TRACE before the final response instead of unconstrained rationale.
TRACE Structure
Meta Context, Trajectory, and Entity Registry are stored in a structured allocentric cache for inference-time reuse.
Reasoning Parser
Final answers are generated by jointly reasoning over the original video and the previously constructed TRACE.
Consistent Gains Across Models and Settings
Gemini 3 Pro
Absolute gain over Direct prompting.
Qwen2.5-VL-72B
Open-weight performance also improves under the same setup.
MiMo-VL-7B
Compact models benefit from explicit geometric grounding.
Gemini 3 Pro
TRACE remains effective in multi-turn embodied scene understanding.
MiMo-VL-7B
The strongest uplift appears on the compact open-source model.
Entity grounding contributes the largest effect.
- TRACE is the strongest method on VSI-Bench across Gemini 3 Pro, Qwen2.5-VL-72B, and MiMo-VL-7B.
- On OST-Bench, TRACE improves over Direct by +1.20 on Gemini 3 Pro and +2.39 on MiMo-VL-7B.
- One-stage inference is stronger than two-stage on Gemini and Qwen, while text-only remains competitive.
- Cross-environment gains remain stable across ARKitScenes, ScanNet, and ScanNetPP.
TRACE yields consistent, state-of-the-art gains over Direct prompting across backbones and scales.
The Qwen series still trails Gemini 3 on both spatial reasoning and visual perception.
Core Visuals from the Paper
Use TRACE in Your Work
TRACE is a prompting approach that uses the Textual Representation of Allocentric Context from Egocentric Video as an intermediate reasoning trace for spatial understanding.
Jiacheng Hua, Yishu Yin, Yuhang Wu, Tai Wang, Yifei Huang, and Miao Liu. 2026. Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), San Diego, California, USA. Association for Computational Linguistics.
@inproceedings{hua-etal-2026-unleashing,
title = {Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning},
author = {Hua, Jiacheng and Yin, Yishu and Wu, Yuhang and Wang, Tai and Huang, Yifei and Liu, Miao},
booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
month = jul,
year = {2026},
address = {San Diego, California, USA},
publisher = {Association for Computational Linguistics},
note = {To appear}
}
@misc{hua2026unleashingspatialreasoningmultimodal,
title = {Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning},
author = {Hua, Jiacheng and Yin, Yishu and Wu, Yuhang and Wang, Tai and Huang, Yifei and Liu, Miao},
year = {2026},
eprint = {2603.23404},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2603.23404}
}