metadata
pipeline_tag: robotics
library_name: transformers
license: mit
This repository contains models for the VLN-PE Benchmark, as presented in the paper Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities.
VLN-PE introduces a physically realistic Vision-and-Language Navigation platform supporting humanoid, quadruped, and wheeled robots, and systematically evaluates several ego-centric VLN methods in physical robotic settings.
For more details, visit the project page or the main GitHub repository.
VLN-PE Benchmark
| Model | Dataset/Benchmark | Val Seen | Val Unseen | Download | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TL | NE | FR | StR | OS | SR | SPL | TL | NE | FR | StR | OS | SR | SPL | |||
| Zero-shot transfer evaluation from VLN-CE | ||||||||||||||||
| Seq2Seq-Full | R2R VLN-PE | 7.80 | 7.62 | 20.21 | 3.04 | 19.3 | 15.2 | 12.79 | 7.73 | 7.18 | 18.04 | 3.04 | 22.42 | 16.48 | 14.11 | model |
| CMA-Full | R2R VLN-PE | 6.62 | 7.37 | 20.06 | 3.95 | 18.54 | 16.11 | 14.61 | 6.58 | 7.09 | 17.07 | 3.79 | 20.86 | 16.93 | 15.24 | model |
| Train on VLN-PE | ||||||||||||||||
| Seq2Seq | R2R VLN-PE | 10.61 | 7.53 | 27.36 | 4.26 | 32.67 | 19.75 | 14.68 | 10.85 | 7.88 | 26.8 | 5.57 | 28.13 | 15.14 | 10.77 | model |
| CMA | R2R VLN-PE | 11.13 | 7.59 | 23.71 | 3.19 | 34.94 | 21.58 | 16.1 | 11.16 | 7.98 | 22.64 | 3.27 | 33.11 | 19.15 | 14.05 | model |
| RDP | R2R VLN-PE | 13.26 | 6.76 | 27.51 | 1.82 | 38.6 | 25.08 | 17.07 | 12.7 | 6.72 | 24.57 | 3.11 | 36.9 | 25.24 | 17.73 | model |
| Seq2Seq+ | R2R VLN-PE | 10.22 | 7.75 | 33.43 | 3.19 | 30.09 | 16.86 | 12.54 | 9.88 | 7.85 | 26.27 | 6.52 | 28.79 | 16.56 | 12.7 | model |
| CMA+ | R2R VLN-PE | 8.86 | 7.14 | 23.56 | 3.5 | 36.17 | 25.84 | 21.75 | 8.79 | 7.26 | 21.75 | 3.27 | 31.4 | 22.12 | 18.65 | model |
Citation
If you find our work helpful, please cite:
@inproceedings{vlnpe,
title={Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities},
author={Wang, Liuyi and Xia, Xinyuan and Zhao, Hui and Wang, Hanqing and Wang, Tai and Chen, Yilun and Liu, Chengju and Chen, Qijun and Pang, Jiangmiao},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year={2025}
}
@misc{internnav2025,
title = {{InternNav: InternRobotics'} open platform for building generalized navigation foundation models},
author = {InternNav Contributors},
howpublished={\url{https://github.com/InternRobotics/InternNav}},
year = {2025}
}