Update README.md
Browse files
README.md
CHANGED
|
@@ -131,16 +131,14 @@ It also closes the gap to proprietary Claude models.
|
|
| 131 |
### GUI Grounding Performance
|
| 132 |
<div align="center">
|
| 133 |
|
| 134 |
-
| **Model** | **OSWorld-G** | **ScreenSpot-V2** | **ScreenSpot-Pro** |
|
| 135 |
-
|
| 136 |
-
| Qwen2.5-VL-7B | 31.4 | 88.8 | 27.6 |
|
| 137 |
-
| Qwen2.5-VL-32B | 46.5 | 87.0 | 39.4 |
|
| 138 |
-
| UI-TARS-72B | 57.1 | 90.3 | 38.1 |
|
| 139 |
-
| **OpenCUA-
|
| 140 |
-
| **OpenCUA-
|
| 141 |
-
| **OpenCUA-
|
| 142 |
-
| **OpenCUA-32B** | **59.6** | **93.4** | 55.3 |
|
| 143 |
-
| **OpenCUA-72B** | - | 92.9 | **60.8** |
|
| 144 |
</div>
|
| 145 |
|
| 146 |
|
|
@@ -159,7 +157,7 @@ It also closes the gap to proprietary Claude models.
|
|
| 159 |
|
| 160 |
# π Quick Start
|
| 161 |
<div style="border-left: 6px solid #f28c28; background: #fff8e6; padding: 12px 16px; margin: 16px 0;">
|
| 162 |
-
<strong>β οΈ Important for Qwen-based Models (OpenCUA-7B, OpenCUA-32B):</strong>
|
| 163 |
|
| 164 |
To align with our training infrastructure, we have modified the model in two places:
|
| 165 |
<ul style="margin-top: 8px;">
|
|
@@ -184,8 +182,8 @@ Download the model weight from huggingface:
|
|
| 184 |
```bash
|
| 185 |
from huggingface_hub import snapshot_download
|
| 186 |
snapshot_download(
|
| 187 |
-
repo_id="xlangai/OpenCUA-
|
| 188 |
-
local_dir="OpenCUA-
|
| 189 |
local_dir_use_symlinks=False
|
| 190 |
)
|
| 191 |
```
|
|
@@ -274,7 +272,7 @@ def run_inference(model, tokenizer, image_processor, messages, image_path):
|
|
| 274 |
return output_text
|
| 275 |
|
| 276 |
# Example usage
|
| 277 |
-
model_path = "OpenCUA/OpenCUA-
|
| 278 |
image_path = "screenshot.png"
|
| 279 |
instruction = "Click on the submit button"
|
| 280 |
|
|
@@ -306,13 +304,14 @@ python huggingface_inference.py
|
|
| 306 |
Command for running OpenCUA-7B and OpenCUA-32B in OSWorld:
|
| 307 |
```
|
| 308 |
python run_multienv_opencua.py \
|
| 309 |
-
|
| 310 |
-
|
| 311 |
-
|
| 312 |
-
|
| 313 |
-
|
| 314 |
-
|
| 315 |
-
|
|
|
|
| 316 |
```
|
| 317 |
<div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
|
| 318 |
<em>Currently we only supports huggingface inference. We are implementing the vLLM supports of OpenCUA models. Please stay tuned.</em>
|
|
@@ -430,6 +429,7 @@ OpenCUA models are intended for **research and educational purposes only**.
|
|
| 430 |
<li><strong><code>OpenCUA/OpenCUA-Qwen2-7B</code></strong> β Relative coordinates</li>
|
| 431 |
<li><strong><code>OpenCUA/OpenCUA-7B</code></strong> β Absolute coordinates</li>
|
| 432 |
<li><strong><code>OpenCUA/OpenCUA-32B</code></strong> β Absolute coordinates</li>
|
|
|
|
| 433 |
</ul>
|
| 434 |
</div>
|
| 435 |
|
|
@@ -447,7 +447,7 @@ OpenCUA models are intended for **research and educational purposes only**.
|
|
| 447 |
return abs_x, abs_y
|
| 448 |
```
|
| 449 |
|
| 450 |
-
- **OpenCUA-7B
|
| 451 |
```python
|
| 452 |
# Example output: pyautogui.click(x=960, y=324)
|
| 453 |
# These are coordinates on the smart-resized image, not the original image
|
|
|
|
| 131 |
### GUI Grounding Performance
|
| 132 |
<div align="center">
|
| 133 |
|
| 134 |
+
| **Model** | **OSWorld-G** | **ScreenSpot-V2** | **ScreenSpot-Pro** | **UI-Vision** |
|
| 135 |
+
|-------|-----------|---------------|----------------|----------------|
|
| 136 |
+
| Qwen2.5-VL-7B | 31.4 | 88.8 | 27.6 | 0.85 |
|
| 137 |
+
| Qwen2.5-VL-32B | 46.5 | 87.0 | 39.4 | - |
|
| 138 |
+
| UI-TARS-72B | 57.1 | 90.3 | 38.1 | 25.5 |
|
| 139 |
+
| **OpenCUA-7B** | 55.3 | 92.3 | 50.0 | 29.7 |
|
| 140 |
+
| **OpenCUA-32B** | **59.6** | **93.4** | 55.3 | 33.3 |
|
| 141 |
+
| **OpenCUA-72B** | 59.2 | 92.9 | **60.8** | **37.3** |
|
|
|
|
|
|
|
| 142 |
</div>
|
| 143 |
|
| 144 |
|
|
|
|
| 157 |
|
| 158 |
# π Quick Start
|
| 159 |
<div style="border-left: 6px solid #f28c28; background: #fff8e6; padding: 12px 16px; margin: 16px 0;">
|
| 160 |
+
<strong>β οΈ Important for Qwen-based Models (OpenCUA-7B, OpenCUA-32B, OpenCUA-72B):</strong>
|
| 161 |
|
| 162 |
To align with our training infrastructure, we have modified the model in two places:
|
| 163 |
<ul style="margin-top: 8px;">
|
|
|
|
| 182 |
```bash
|
| 183 |
from huggingface_hub import snapshot_download
|
| 184 |
snapshot_download(
|
| 185 |
+
repo_id="xlangai/OpenCUA-72B",
|
| 186 |
+
local_dir="OpenCUA-72B",
|
| 187 |
local_dir_use_symlinks=False
|
| 188 |
)
|
| 189 |
```
|
|
|
|
| 272 |
return output_text
|
| 273 |
|
| 274 |
# Example usage
|
| 275 |
+
model_path = "OpenCUA/OpenCUA-72B" # or other model variants
|
| 276 |
image_path = "screenshot.png"
|
| 277 |
instruction = "Click on the submit button"
|
| 278 |
|
|
|
|
| 304 |
Command for running OpenCUA-7B and OpenCUA-32B in OSWorld:
|
| 305 |
```
|
| 306 |
python run_multienv_opencua.py \
|
| 307 |
+
--headless \
|
| 308 |
+
--observation_type screenshot \
|
| 309 |
+
--model OpenCUA-72B \
|
| 310 |
+
--result_dir ./results\
|
| 311 |
+
--test_all_meta_path evaluation_examples/test_nogdrive.json \
|
| 312 |
+
--max_steps 100 \
|
| 313 |
+
--num_envs 30 \
|
| 314 |
+
--coordinate_type qwen25
|
| 315 |
```
|
| 316 |
<div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
|
| 317 |
<em>Currently we only supports huggingface inference. We are implementing the vLLM supports of OpenCUA models. Please stay tuned.</em>
|
|
|
|
| 429 |
<li><strong><code>OpenCUA/OpenCUA-Qwen2-7B</code></strong> β Relative coordinates</li>
|
| 430 |
<li><strong><code>OpenCUA/OpenCUA-7B</code></strong> β Absolute coordinates</li>
|
| 431 |
<li><strong><code>OpenCUA/OpenCUA-32B</code></strong> β Absolute coordinates</li>
|
| 432 |
+
<li><strong><code>OpenCUA/OpenCUA-72B</code></strong> β Absolute coordinates</li>
|
| 433 |
</ul>
|
| 434 |
</div>
|
| 435 |
|
|
|
|
| 447 |
return abs_x, abs_y
|
| 448 |
```
|
| 449 |
|
| 450 |
+
- **OpenCUA-7B, OpenCUA-32B, OpenCUA-72B** (Qwen2.5-based): Output **absolute coordinates** after smart resize
|
| 451 |
```python
|
| 452 |
# Example output: pyautogui.click(x=960, y=324)
|
| 453 |
# These are coordinates on the smart-resized image, not the original image
|