--- license: apache-2.0 language: - en - zh pipeline_tag: text-to-image library_name: transformers ---

[//]: # (

)

## Introduction We introduce **LongCat-Image**, a pioneering open-source and bilingual (Chinese-English) foundation model for image generation, designed to address core challenges in multilingual text rendering, photorealism, deployment efficiency, and developer accessibility prevalent in current leading models.

### Key Features - 🌟 **Exceptional Efficiency and Performance**: With only **6B parameters**, LongCat-Image surpasses numerous open-source models that are several times larger across multiple benchmarks, demonstrating the immense potential of efficient model design. - 🌟 **Powerful Chinese Text Rendering**: LongCat-Image demonstrates superior accuracy and stability in rendering common Chinese characters compared to existing SOTA open-source models and achieves industry-leading coverage of the Chinese dictionary. - 🌟 **Remarkable Photorealism**: Through an innovative data strategy and training framework, LongCat-Image achieves remarkable photorealism in generated images. [//]: # (For more details, please refer to the comprehensive [***LongCat-Image Technical Report***](https://arxiv.org/abs/2412.11963).) ## 🎨 Showcase

## Quick Start ### Installation Clone the repo: ```shell git clone --single-branch --branch main https://github.com/meituan-longcat/LongCat-Image cd LongCat-Image ``` Install dependencies: ```shell # create conda environment conda create -n longcat-image python=3.10 conda activate longcat-image # install other requirements pip install -r requirements.txt python setup.py develop ``` ### Run Text-to-Image Generation > [!TIP] > Leveraging a stronger LLM for prompt refinement can further enhance image generation quality. Please refer to [inference_t2i.py](https://github.com/meituan-longcat/LongCat-Image/blob/main/scripts/inference_t2i.py#L28) for detailed usage instructions. > [!CAUTION] > **Special Handling for Text Rendering** > > For both Text-to-Image and Image Editing tasks involving text generation, **you must enclose the target text within quotes (`""`)**. > > **Reason:** The tokenizer applies **character-level encoding** specifically to content found inside quotes. Failure to use explicit quotation marks will result in a significant degradation of text rendering quality. ```python import torch from transformers import AutoProcessor from longcat_image.models import LongCatImageTransformer2DModel from longcat_image.pipelines import LongCatImagePipeline device = torch.device('cuda') checkpoint_dir = './weights/LongCat-Image' text_processor = AutoProcessor.from_pretrained( checkpoint_dir, subfolder = 'tokenizer' ) transformer = LongCatImageTransformer2DModel.from_pretrained( checkpoint_dir , subfolder = 'transformer', torch_dtype=torch.bfloat16, use_safetensors=True).to(device) pipe = LongCatImagePipeline.from_pretrained( checkpoint_dir, transformer=transformer, text_processor=text_processor ) # pipe.to(device, torch.bfloat16) # Uncomment for high VRAM devices (Faster inference) pipe.enable_model_cpu_offload() # Offload to CPU to save VRAM (Required ~17 GB); slower but prevents OOM prompt = '一个年轻的亚裔女性，身穿黄色针织衫，搭配白色项链。她的双手放在膝盖上，表情恬静。背景是一堵粗糙的砖墙，午后的阳光温暖地洒在她身上，营造出一种宁静而温馨的氛围。镜头采用中距离视角，突出她的神态和服饰的细节。光线柔和地打在她的脸上，强调她的五官和饰品的质感，增加画面的层次感与亲和力。整个画面构图简洁，砖墙的纹理与阳光的光影效果相得益彰，突显出人物的优雅与从容。' image = pipe( prompt, height=768, width=1344, guidance_scale=4.5, num_inference_steps=50, num_images_per_prompt=1, generator=torch.Generator("cpu").manual_seed(43), enable_cfg_renorm=True, enable_prompt_rewrite=True # Reusing the text encoder as a built-in prompt rewriter ).images[0] image.save('./t2i_example.png') ```