GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation
GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation [115.5] ベンチマークドリフトは、最も人気のあるT2Iベンチマークの一つであるGenEvalにとって重要な問題であることを示す。 我々は新しいベンチマークGenEval 2を導入し、原始的な視覚概念のカバレッジを改善し、より高度な構成性を実現した。 論文参考訳(メタデータ) (Thu, 18 Dec 2025 18:26:56 GMT)
「GenEval was well-aligned with human judgment at the time of its release, it has drifted far from human judgment over time—resulting in an absolute error of as much as 17.7% for current models. This level of drift strongly suggests that GenEval has been saturated for some time, as we verify via a large-scale human study. To help fill this benchmarking gap, we introduce a new benchmark, GenEval 2, with improved coverage of primitive visual concepts and higher degrees of compositionality, which we show is more challenging for current models.」とGenEvalの新たなバージョンの提案。