---
license: apache-2.0
library_name: transformers
pipeline_tag: text-to-image
---

## ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning

[![Paper (arXiv)](https://img.shields.io/badge/Paper-ControlThinker-d32f2f.svg?logo=arXiv)](https://arxiv.org/abs/2506.03596) [![Hugging Face Model](https://img.shields.io/badge/%F0%9F%A4%97%20HF%20-Model-yellow)](https://huggingface.co/maplebb/ControlThinker) [![Hugging Face Paper](https://img.shields.io/badge/Paper-HF-blue)](https://huggingface.co/papers/2506.03596) [GitHub Repository](https://github.com/maplebb/controlthinker)

ControlThinker is a novel framework that employs a "comprehend-then-generate" paradigm for controllable image generation through visual reasoning. It addresses the semantic gap between input text prompts and target images by leveraging a Multimodal Large Language Model (MLLM) to extract latent semantics from control images. This enriches prompts, significantly enhancing visual quality and semantic consistency in generated images.

The model was presented in the paper [ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning](https://huggingface.co/papers/2506.03596).

<p align="center"><img src="https://github.com/maplebb/controlthinker/raw/main/asset/image/teaser.png" width="95%"></p>

## Usage

You can use ControlThinker for image generation. Below is a sample usage demonstrating how to generate an image from a text prompt.

```python
from inference_solver import FlexARInferenceSolver
from PIL import Image

# ******************** Image Generation ********************
inference_solver = FlexARInferenceSolver(
    model_path="maplebb/ControlThinker",
    precision="bf16",
    target_size=768,
)

q1 = f"Generate an image of 768x768 according to the following prompt:
" \
     f"Image of a dog playing water, and a waterfall is in the background."

# generated: tuple of (generated response, list of generated images)
generated = inference_solver.generate(
    images=[],
    qas=[[q1, None]],
    max_gen_len=8192,
    temperature=1.0,
    logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
)

a1, new_image = generated[0], generated[1][0]

# You can save and display the generated image
new_image.save("generated_dog.png")
new_image.show()
```

## License

ControlThinker is licensed under the Apache 2.0.

## ✍️ Citation

```bibtex
@article{han2025controlthinker,
  title={ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning},
  author={Han, Feng and Jiao, Yang and Chen, Shaoxiang and Xu, Junhao and Chen, Jingjing and Jiang, Yu-Gang},
  journal={arXiv preprint arXiv:2506.03596},
  year={2025}
}
```