STaR: Scalable Task-Conditioned Retrieval for Long-Horizon Multimodal Robot Memory

Mingfeng Yuan1 , Hao Zhang2, Mahan Mohammadi1, Runhao Li1, Jinjun Shan2 , Steven L. Waslander1
University of Toronto1, York University2

Key Features

  • Long-Horizon Multimodal Robot Memory (OmniMem). We introduce a unified, task-agnostic memory that integrates 3D primitives (geometry and semantics), temporally aligned video captions (dynamic scene descriptions), and keyframe visual memory, enabling joint spatial, temporal, and semantic reasoning over long-duration robot experiences.
  • Scalable Task-Conditioned Retrieval via Information Bottleneck (STaR). STaR applies the Information Bottleneck principle to distill a compact, non-redundant, and information-rich subset of memories tailored to a given task, avoiding the inefficiency and hallucination risks of naïve Retrieval-Augmented Generation (RAG).
  • Agentic RAG for Planning, Retrieval, and Reasoning. We propose an agentic workflow in which an MLLM autonomously plans search strategies, issues memory retrieval calls, and reasons over STaR-distilled evidence, enabling precise answers and reliable execution for navigation and downstream robotic actions.
  • Extensive Evaluation and Real-Robot Deployment. STaR is evaluated on long-horizon navigation VQA benchmarks, including NaVQA (campus-scale indoor/outdoor scenes) and WH-VQA, a warehouse benchmark with many visually similar objects built in Isaac Sim, and is further validated through end-to-end deployment on a real Husky mobile robot.

🎥 STaR Demo Videos

Isaac Sim (Warehouse)

Real Robot Deployment

🧠 Method Overview

OpenNav Architecture

STaR System Overview. Our framework consists of three stages. (Left) Memory construction: the robot records RGB and posed depth data to build a multimodal memory composed of three complementary databases (DB) -- video caption, 3D primitive, and visual keyframe -- jointly forming OmniMem. (Middle) User query and reasoning: given text or multimodal queries, an agentic planner (MLLM) retrieves task-relevant memories through an Information Bottleneck, performs contextual reasoning, and outputs structured answers (location, time, or description). (Right) Evaluation: We evaluate STaR on both the NaVQA dataset (campus) and the WH-VQA dataset (warehouse), which cover spatial, temporal, and descriptive question types across short-, medium-, and long-term memory settings. The evaluation examines three key capabilities-long horizon cross-modal memory construction, task-conditioned memory retrieval, and contextual reasoning. We also validate the multi-modal query and navigation tasks in a warehouse simulated with Isaac Sim.

OVPS Architecture

Task-conditioned retrieval and contextual reasoning. Given an open-ended user query, STaR embeds task cues and queries the memory database to retrieve relevant video captions with timestamps and associated detected objects (caption-induced primitives). These retrieved cues define a task-specific working set of 3D primitives, over which STaR applies an Information Bottleneck–based clustering to merge neighboring primitives into compact, task-relevant groups. Captions are then grouped by cluster, and a single representative caption is selected from each group to form a non-redundant evidence set. When necessary, the robot further loads keyframe images to resolve fine-grained visual details, enabling contextual reasoning and the generation of actionable outputs, such as object locations, shelf indices, and navigation targets for downstream tasks.

🚘 On-Device Deployment

OpenNav Demo

STaR deployed on a Husky robot for indoor and outdoor experiments, supporting both text-based and multimodal queries.

Citation

If you find this work helpful, please consider citing:

@article{Yuan2026STaR,
      title={STaR: Scalable Task-Conditioned Retrieval for Long-Horizon Multimodal Robot Memory}, 
      author={Mingfeng Yuan and Hao Zhang and Mahan Mohammadi and Runhao Li and Jinjun Shan and Steven L. Waslander},
      year={2026},
      eprint={2602.09255},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2602.09255}, 
}