UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation

1Fudan University, 2Shanghai Innovation Intuition,
3Hunyuan, Tencent, 4Shanghai AI Lab,
5Shanghai Jiaotong University

Overview

pipeline

Benchmark Statistics

data-overview

Benchmark Statistics.
(a) Word clouds for English and Chinese prompts in both short and long forms; (b) overall prompt length distribution; and (c) distribution of testpoint counts per prompt for short versus long versions.

Evaluation Dimensions

data-overview

Qualitative Cases.
We present qualitative examples of T2I models evaluated across our specified dimensions.

Benchmark Construction and Offline Evaluation Model Training.

pipeline
pipeline

Evaluation Accuracy Comparison.
Our dedicated evaluation model demonstrates a significant improvement in evaluation accuracy across all test points compared to the commonly used offline evaluation VLM, Qwen2.5-VL-72b.

English Short Prompt Evaluation

pipeline

English Long Prompt Evaluation

pipeline

Chinese Short Prompt Evaluation

pipeline

Chinese Long Prompt Evaluation

pipeline

BibTeX


@article{UniGenBench++,
  title={UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation},
  author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Bu, Jiazi and Zhou, Yujie and Xin, Yi and He, Junjun and Wang, Chunyu and Lu, Qinglin and Jin, Cheng and Wang, Jiaqi},
  journal={arXiv preprint arXiv:2510.18701},
  year={2025}
}