|
|
--- |
|
|
pipeline_tag: robotics |
|
|
library_name: transformers |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
This repository contains models for the **VLN-PE Benchmark**, as presented in the paper [Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities](https://huggingface.co/papers/2507.13019). |
|
|
|
|
|
VLN-PE introduces a physically realistic Vision-and-Language Navigation platform supporting humanoid, quadruped, and wheeled robots, and systematically evaluates several ego-centric VLN methods in physical robotic settings. |
|
|
|
|
|
For more details, visit the [project page](https://crystalsixone.github.io/vln_pe.github.io/) or the main [GitHub repository](https://github.com/InternRobotics/InternNav). |
|
|
|
|
|
## VLN-PE Benchmark |
|
|
<style type="text/css"> |
|
|
.tg {border-collapse:collapse;border-spacing:0;} |
|
|
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; |
|
|
overflow:hidden;padding:10px 5px;word-break:normal;} |
|
|
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; |
|
|
font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} |
|
|
.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top} |
|
|
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top} |
|
|
.tg .tg-fymr{border-color:inherit;font-weight:bold;text-align:left;vertical-align:top} |
|
|
</style> |
|
|
<table class="tg"><thead> |
|
|
<tr> |
|
|
<th class="tg-c3ow" rowspan="2"><span style="font-weight:bold">Model</span></th> |
|
|
<th class="tg-0pky" rowspan="2"><span style="font-weight:bold">Dataset/Benchmark</span></th> |
|
|
<th class="tg-c3ow" colspan="7"><span style="font-weight:bold">Val Seen</span></th> |
|
|
<th class="tg-c3ow" colspan="7"><span style="font-weight:bold">Val Unseen</span></th> |
|
|
<th class="tg-fymr" rowspan="2">Download</th> |
|
|
</tr> |
|
|
<tr> |
|
|
<th class="tg-fymr">TL</th> |
|
|
<th class="tg-fymr">NE</th> |
|
|
<th class="tg-fymr">FR</th> |
|
|
<th class="tg-fymr">StR</th> |
|
|
<th class="tg-fymr">OS</th> |
|
|
<th class="tg-fymr">SR</th> |
|
|
<th class="tg-fymr">SPL</th> |
|
|
<th class="tg-fymr">TL</th> |
|
|
<th class="tg-fymr">NE</th> |
|
|
<th class="tg-fymr">FR</th> |
|
|
<th class="tg-fymr">StR</th> |
|
|
<th class="tg-fymr">OS</th> |
|
|
<th class="tg-fymr">SR</th> |
|
|
<th class="tg-fymr">SPL</th> |
|
|
</tr></thead> |
|
|
<tbody> |
|
|
<tr> |
|
|
<td class="tg-c3ow" colspan="17">Zero-shot transfer evaluation from VLN-CE</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td class="tg-0pky">Seq2Seq-Full</td> |
|
|
<td class="tg-0pky">R2R VLN-PE</td> |
|
|
<td class="tg-0pky">7.80</td> |
|
|
<td class="tg-0pky">7.62</td> |
|
|
<td class="tg-0pky">20.21</td> |
|
|
<td class="tg-0pky">3.04</td> |
|
|
<td class="tg-0pky">19.3</td> |
|
|
<td class="tg-0pky">15.2</td> |
|
|
<td class="tg-0pky">12.79</td> |
|
|
<td class="tg-0pky">7.73</td> |
|
|
<td class="tg-0pky">7.18</td> |
|
|
<td class="tg-0pky">18.04</td> |
|
|
<td class="tg-0pky">3.04</td> |
|
|
<td class="tg-0pky">22.42</td> |
|
|
<td class="tg-0pky">16.48</td> |
|
|
<td class="tg-0pky">14.11</td> |
|
|
<td class="tg-0pky"><a href="https://huggingface.co/InternRobotics/VLN-PE/tree/main/r2r/zero_shot/seq2seq" target="_blank" rel="noopener noreferrer">model</a></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td class="tg-0pky">CMA-Full</td> |
|
|
<td class="tg-0pky">R2R VLN-PE</td> |
|
|
<td class="tg-0pky">6.62</td> |
|
|
<td class="tg-0pky">7.37</td> |
|
|
<td class="tg-0pky">20.06</td> |
|
|
<td class="tg-0pky">3.95</td> |
|
|
<td class="tg-0pky">18.54</td> |
|
|
<td class="tg-0pky">16.11</td> |
|
|
<td class="tg-0pky">14.61</td> |
|
|
<td class="tg-0pky">6.58</td> |
|
|
<td class="tg-0pky">7.09</td> |
|
|
<td class="tg-0pky">17.07</td> |
|
|
<td class="tg-0pky">3.79</td> |
|
|
<td class="tg-0pky">20.86</td> |
|
|
<td class="tg-0pky">16.93</td> |
|
|
<td class="tg-0pky">15.24</td> |
|
|
<td class="tg-0pky"><a href="https://huggingface.co/InternRobotics/VLN-PE/tree/main/r2r/zero_shot/cma" target="_blank" rel="noopener noreferrer">model</a></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td class="tg-c3ow" colspan="17">Train on VLN-PE</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td class="tg-0pky">Seq2Seq</td> |
|
|
<td class="tg-0pky">R2R VLN-PE</td> |
|
|
<td class="tg-0pky">10.61</td> |
|
|
<td class="tg-0pky">7.53</td> |
|
|
<td class="tg-0pky">27.36</td> |
|
|
<td class="tg-0pky">4.26</td> |
|
|
<td class="tg-0pky">32.67</td> |
|
|
<td class="tg-0pky">19.75</td> |
|
|
<td class="tg-0pky">14.68</td> |
|
|
<td class="tg-0pky">10.85</td> |
|
|
<td class="tg-0pky">7.88</td> |
|
|
<td class="tg-0pky">26.8</td> |
|
|
<td class="tg-0pky">5.57</td> |
|
|
<td class="tg-0pky">28.13</td> |
|
|
<td class="tg-0pky">15.14</td> |
|
|
<td class="tg-0pky">10.77</td> |
|
|
<td class="tg-0pky"><a href="https://huggingface.co/InternRobotics/VLN-PE/tree/main/r2r/fine_tuned/seq2seq" target="_blank" rel="noopener noreferrer">model</a></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td class="tg-0pky">CMA</td> |
|
|
<td class="tg-0pky">R2R VLN-PE</td> |
|
|
<td class="tg-0pky">11.13</td> |
|
|
<td class="tg-0pky">7.59</td> |
|
|
<td class="tg-0pky">23.71</td> |
|
|
<td class="tg-0pky">3.19</td> |
|
|
<td class="tg-0pky">34.94</td> |
|
|
<td class="tg-0pky">21.58</td> |
|
|
<td class="tg-0pky">16.1</td> |
|
|
<td class="tg-0pky">11.16</td> |
|
|
<td class="tg-0pky">7.98</td> |
|
|
<td class="tg-0pky">22.64</td> |
|
|
<td class="tg-0pky">3.27</td> |
|
|
<td class="tg-0pky">33.11</td> |
|
|
<td class="tg-0pky">19.15</td> |
|
|
<td class="tg-0pky">14.05</td> |
|
|
<td class="tg-0pky"><a href="https://huggingface.co/InternRobotics/VLN-PE/tree/main/r2r/fine_tuned/cma" target="_blank" rel="noopener noreferrer">model</a></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td class="tg-0pky">RDP</td> |
|
|
<td class="tg-0pky">R2R VLN-PE</td> |
|
|
<td class="tg-0pky">13.26</td> |
|
|
<td class="tg-0pky">6.76</td> |
|
|
<td class="tg-0pky">27.51</td> |
|
|
<td class="tg-0pky">1.82</td> |
|
|
<td class="tg-0pky">38.6</td> |
|
|
<td class="tg-0pky">25.08</td> |
|
|
<td class="tg-0pky">17.07</td> |
|
|
<td class="tg-0pky">12.7</td> |
|
|
<td class="tg-0pky">6.72</td> |
|
|
<td class="tg-0pky">24.57</td> |
|
|
<td class="tg-0pky">3.11</td> |
|
|
<td class="tg-0pky">36.9</td> |
|
|
<td class="tg-0pky">25.24</td> |
|
|
<td class="tg-0pky">17.73</td> |
|
|
<td class="tg-0pky"><a href="https://huggingface.co/InternRobotics/VLN-PE/tree/main/r2r/fine_tuned/rdp" target="_blank" rel="noopener noreferrer">model</a></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td class="tg-0pky">Seq2Seq+</td> |
|
|
<td class="tg-0pky">R2R VLN-PE</td> |
|
|
<td class="tg-0pky">10.22</td> |
|
|
<td class="tg-0pky">7.75</td> |
|
|
<td class="tg-0pky">33.43</td> |
|
|
<td class="tg-0pky">3.19</td> |
|
|
<td class="tg-0pky">30.09</td> |
|
|
<td class="tg-0pky">16.86</td> |
|
|
<td class="tg-0pky">12.54</td> |
|
|
<td class="tg-0pky">9.88</td> |
|
|
<td class="tg-0pky">7.85</td> |
|
|
<td class="tg-0pky">26.27</td> |
|
|
<td class="tg-0pky">6.52</td> |
|
|
<td class="tg-0pky">28.79</td> |
|
|
<td class="tg-0pky">16.56</td> |
|
|
<td class="tg-0pky">12.7</td> |
|
|
<td class="tg-0pky"><a href="https://huggingface.co/InternRobotics/VLN-PE/tree/main/r2r/fine_tuned/seq2seq_plus" target="_blank" rel="noopener noreferrer">model</a></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td class="tg-0pky">CMA+</td> |
|
|
<td class="tg-0pky">R2R VLN-PE</td> |
|
|
<td class="tg-0pky">8.86</td> |
|
|
<td class="tg-0pky">7.14</td> |
|
|
<td class="tg-0pky">23.56</td> |
|
|
<td class="tg-0pky">3.5</td> |
|
|
<td class="tg-0pky">36.17</td> |
|
|
<td class="tg-0pky">25.84</td> |
|
|
<td class="tg-0pky">21.75</td> |
|
|
<td class="tg-0pky">8.79</td> |
|
|
<td class="tg-0pky">7.26</td> |
|
|
<td class="tg-0pky">21.75</td> |
|
|
<td class="tg-0pky">3.27</td> |
|
|
<td class="tg-0pky">31.4</td> |
|
|
<td class="tg-0pky">22.12</td> |
|
|
<td class="tg-0pky">18.65</td> |
|
|
<td class="tg-0pky"><a href="https://huggingface.co/InternRobotics/VLN-PE/tree/main/r2r/fine_tuned/cma_plus" target="_blank" rel="noopener noreferrer">model</a></td> |
|
|
</tr> |
|
|
</tbody></table> |
|
|
|
|
|
## Citation |
|
|
If you find our work helpful, please cite: |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{vlnpe, |
|
|
title={Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities}, |
|
|
author={Wang, Liuyi and Xia, Xinyuan and Zhao, Hui and Wang, Hanqing and Wang, Tai and Chen, Yilun and Liu, Chengju and Chen, Qijun and Pang, Jiangmiao}, |
|
|
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, |
|
|
year={2025} |
|
|
} |
|
|
@misc{internnav2025, |
|
|
title = {{InternNav: InternRobotics'} open platform for building generalized navigation foundation models}, |
|
|
author = {InternNav Contributors}, |
|
|
howpublished={\url{https://github.com/InternRobotics/InternNav}}, |
|
|
year = {2025} |
|
|
} |
|
|
``` |