Dataset Gallery

ConsistCompose3M

A 3.4M-sample multimodal dataset for layout-grounded image composition with layout and identity annotations.

← Project Page 📄 Paper 🤗 Model 💻 Code 📂 Dataset 📑 BibTeX
Overview

Data Statistics

Text2Image
Single-Reference
Multi-Reference
Category 1

Layout Text to Image

Category 2

Single Subject

Subject  →  Background  →  ROI  →  Output
Subject  →  Background  →  ROI  →  Output
Subject  →  Background  →  ROI  →  Output
Subject  →  Background  →  ROI  →  Output
Subject  →  Background  →  ROI  →  Output
Subject  →  Background  →  ROI  →  Output
Category 2 — Full Wall

Single Subject Image Wall

Category 3

Multi-subject

Sub 1  →  Sub 2  →  Background  →  ROI  →  Output
Sub 1  →  Sub 2  →  Background  →  ROI  →  Output
Sub 1  →  Sub 2  →  Background  →  ROI  →  Output
Sub 1  →  Sub 2  →  Background  →  ROI  →  Output
Sub 1  →  Sub 2  →  Background  →  ROI  →  Output
Sub 1  →  Sub 2  →  Background  →  ROI  →  Output
Category 3 — Full Wall

Multi-subject Image Wall (FLUX)

Category 3 — Qwen

Multi-subject Image Wall (Qwen Image)

Citation

BibTeX

If you find our work useful, please cite:

@article{shi2025consistcompose,
  title={ConsistCompose: Unified Multimodal Layout Control for Image Composition},
  author={Shi, Xuanke and Li, Boxuan and Han, Xiaoyang and Cai, Zhongang and
          Yang, Lei and Lin, Dahua and Wang, Quan},
  journal={arXiv preprint arXiv:2511.18333},
  year={2025}
}