Vision-Language and Task and Motion Planning

Language-guided robot task planning, vision-language interpreters, and integrated task and motion planning (TAMP) for manipulation.

This theme covers vision-language models for robot task planning: interpreting natural language instructions and scene understanding to generate executable manipulation plans. It includes the Vision-Language Interpreter (ViLaIn) for task planning, grounded vision-language interpreters for integrated task and motion planning (TAMP), and one-shot vision-language guided motion generation (e.g. KeyMPs for occlusion-rich tasks with DMPs). The goal is to bridge high-level language instructions and low-level motion execution for flexible, interpretable robot control.

(Shirai et al., 2024) (Siburian et al., 2025) (Anarossi et al., 2025)

References

2025

  1. Under Review
    vilain_tamp.png
    Grounded Vision-Language Interpreter for Integrated Task and Motion Planning
    Jeremy Siburian, Keisuke Shirai, Cristian C. Beltran-Hernandez, Masashi Hamaya, Michael Görner, and 1 more author
    2025
  2. IEEE Access
    KeyMPs.png
    KeyMPs: One-Shot Vision-Language Guided Motion Generation by Sequencing DMPs for Occlusion-Rich Tasks
    Edgar Anarossi, Yuhwan Kwon, Hirotaka Tahara, Shohei Tanaka, Keisuke Shirai, and 4 more authors
    2025

2024

  1. ICRA
    icra2024_vilan.png
    Vision-Language Interpreter for Robot Task Planning
    Keisuke Shirai, Cristian C. Beltran-Hernandez, Masashi Hamaya, Atsushi Hashimoto, Shohei Tanaka, and 4 more authors
    In IEEE International Conference on Robotics and Automation (ICRA), 2024