ACL 2026 Main (Oral)

TRACE = Textual Representation of Allocentric Context from Egocentric Video

Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning

TRACE induces multimodal models to generate a text-based allocentric representation of the 3D environment as an intermediate reasoning trace for spatial question answering.

Paper BibTeX

Jiacheng Hua^1,2, Yishu Yin¹, Yuhang Wu¹, Tai Wang², Yifei Huang^3,2, Miao Liu^1†

¹ Beijing College of AI, Tsinghua University

² Shanghai Shanghai Artificial Intelligence Laboratory

³ Tokyo The University of Tokyo

Abstract

Existing Multimodal Large Language Models (MLLMs) struggle with 3D spatial reasoning, as they fail to construct structured abstractions of the 3D environment depicted in video inputs. To bridge this gap, drawing inspiration from cognitive theories of allocentric spatial reasoning, we investigate how to enable MLLMs to model and reason over text-based spatial representations of video. Specifically, we introduce Textual Representation of Allocentric Context from Egocentric Video (TRACE), a prompting method that induces MLLMs to generate text-based representations of 3D environments as intermediate reasoning traces for more accurate spatial question answering. TRACE encodes meta-context, camera trajectories, and detailed object entities to support structured spatial reasoning over egocentric videos. Extensive experiments on VSI-Bench and OST-Bench demonstrate that TRACE yields notable and consistent improvements over prior prompting strategies across a diverse range of MLLM backbones, spanning different parameter scales and training schemas.

TRACE teaser figure showing the challenge of spatial reasoning from egocentric video and the TRACE representation. — Teaser
TRACE encodes meta-context, camera trajectory, and entities as an intermediate reasoning trace for spatial QA with MLLMs.

Why it matters

Existing MLLMs struggle to construct structured abstractions of 3D environments from egocentric video.

What TRACE changes

It induces a structured allocentric representation with meta-context, trajectory, and entity registry.

What we observe

Performance improves consistently across backbones, parameter scales, and training schemas.

Overview

Text as a Spatial Interface

TRACE turns egocentric video into a text-based allocentric representation that multimodal models can reason over.

Problem

Existing MLLMs often over-rely on 2D visual cues instead of building hierarchical abstractions of the 3D scene.

Approach

TRACE explicitly models meta-context, camera trajectory, and detailed object entities as an intermediate reasoning trace.

Outcome

The resulting allocentric representation supports routing, measurement, order, and relational reasoning while improving spatial QA on VSI-Bench and OST-Bench.

TRACE Schema

< Meta Context, Trajectory, Entity Registry >

Meta Context

Room topology, grid alignment, and observer initialization.

Trajectory

Timestamps, estimated positions, headings, and motion context.

Entity Registry

Timestamped objects with signatures, coordinates, and relations.

Inference Interface

Egocentric Video TRACE Spatial QA Answer

Method

How TRACE Structures Reasoning

Step 1

Spatial Descriptor

The model is prompted to emit schema-compliant TRACE before the final response instead of unconstrained rationale.

Step 2

TRACE Structure

Meta Context, Trajectory, and Entity Registry are stored in a structured allocentric cache for inference-time reuse.

Step 3

Reasoning Parser

Final answers are generated by jointly reasoning over the original video and the previously constructed TRACE.

Results

Consistent Gains Across Models and Settings

VSI-Bench
Gemini 3 Pro

+7.54

Absolute gain over Direct prompting.

VSI-Bench
Qwen2.5-VL-72B

+3.10

Open-weight performance also improves under the same setup.

VSI-Bench
MiMo-VL-7B

+1.63

Compact models benefit from explicit geometric grounding.

OST-Bench
Gemini 3 Pro

+1.20

TRACE remains effective in multi-turn embodied scene understanding.

OST-Bench
MiMo-VL-7B

+2.39

The strongest uplift appears on the compact open-source model.

Ablation Signal

-5.24 without Entity Registry -1.92 without Trajectory

Entity grounding contributes the largest effect.

Reading the results

TRACE is the strongest method on VSI-Bench across Gemini 3 Pro, Qwen2.5-VL-72B, and MiMo-VL-7B.
On OST-Bench, TRACE improves over Direct by +1.20 on Gemini 3 Pro and +2.39 on MiMo-VL-7B.
One-stage inference is stronger than two-stage on Gemini and Qwen, while text-only remains competitive.
Cross-environment gains remain stable across ARKitScenes, ScanNet, and ScanNetPP.

Performance gains across models on VSI-Bench. — VSI Gains
TRACE yields consistent, state-of-the-art gains over Direct prompting across backbones and scales.

Decomposition analysis of spatial descriptor and reasoning parser. — Decomposition
The Qwen series still trails Gemini 3 on both spatial reasoning and visual perception.

Figure Gallery

Core Visuals from the Paper

Citation & Resources

Use TRACE in Your Work

TRACE is a prompting approach that uses the Textual Representation of Allocentric Context from Egocentric Video as an intermediate reasoning trace for spatial understanding.

Venue

ACL 2026 Main Conference

Location

San Diego, California, USA

Read Paper

Contact Jiacheng Hua: hjc21@mails.tsinghua.edu.cn Miao Liu: miaoliu@mail.tsinghua.edu.cn

Jiacheng Hua, Yishu Yin, Yuhang Wu, Tai Wang, Yifei Huang, and Miao Liu. 2026. Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), San Diego, California, USA. Association for Computational Linguistics.

BibTeX

@inproceedings{hua-etal-2026-unleashing,
  title     = {Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning},
  author    = {Hua, Jiacheng and Yin, Yishu and Wu, Yuhang and Wang, Tai and Huang, Yifei and Liu, Miao},
  booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  month     = jul,
  year      = {2026},
  address   = {San Diego, California, USA},
  publisher = {Association for Computational Linguistics},
  note      = {To appear}
}

@misc{hua2026unleashingspatialreasoningmultimodal,
  title         = {Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning},
  author        = {Hua, Jiacheng and Yin, Yishu and Wu, Yuhang and Wang, Tai and Huang, Yifei and Liu, Miao},
  year          = {2026},
  eprint        = {2603.23404},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2603.23404}
}